Understanding Context Limits

Context limits are a key factor when working with LLMs.

Simply put – it’s the number of tokens (“words” – roughly, sub-word pieces. https://tiktokenizer.vercel.app/ ) that a LLM can process, before it truncates the past context. If you give a LLM with a context window of 8192 tokens, a text of size 10,000 tokens – it won’t be able to process it all since it exceeds its capacity. Context size is the working memory of a model.

To understand what tokens are, Karpathy has an amazing tutorial here: (I’ve linked directly to the part where he talks about tokens)

In this blogpost we’ll demonstrate context length as a continuation of a previous blogpost Running LLMs Locally

First, let’s check llama3’s context length

mgoyal@JUSTMAGOP16GEN2:/mnt/c/Users/MGoyal$ ollama show llama3
  Model
    architecture        llama
    parameters          8.0B
    context length      8192
    embedding length    4096
    quantization        Q4_0

  Capabilities
    completion

  Parameters
    num_keep    24
    stop        "<|start_header_id|>"
    stop        "<|end_header_id|>"
    stop        "<|eot_id|>"

  License
    META LLAMA 3 COMMUNITY LICENSE AGREEMENT
    Meta Llama 3 Version Release Date: April 18, 2024
    ...

Let’s run a Python script which goes past its context:

import requests
padding = "Ignore this filler text. " * 500
big_text = padding + "What is 2+2? Answer with just the number." + padding

r = requests.post("http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": big_text, "stream": False})
data = r.json()
print(f"Response: {data['response']}")

The response?

Response: It seems like you're trying to ignore some filler text! Don't worry, I'm here to help with any questions or topics you'd like to discuss. Just let me know what's on your mind, and we can start fresh!

As you can see, it totally ignored “What is 2+2? Answer with just the number”, because the filler text took up all the context.

Now let’s change it so that the filler text doesn’t take up all the space – we’ll change the padding to 50

import requests
padding = "Ignore this filler text. " * 50
big_text = padding + "What is 2+2? Answer with just the number." + padding

r = requests.post("http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": big_text, "stream": False})
data = r.json()
print(f"Response: {data['response']}")

Now it directly answers the question and ignores the filler text:

Response: 4

This clearly shows that LLMs have a limited window of text that they can process. You can easily find the context limit of any model by a simple google search or in the official reference. For example OpenAI provides context windows here: https://developers.openai.com/api/docs/models/gpt-5. Since Llama is a local model running on my laptop its context window is much smaller than the ones provided by the big AI companies. OpenAI provides 1 million context window for GPT-5.5 at the time of writing this article, for example.

For large context windows, in practice, many models degrade before the maximum limit is reached. The model will usually perform best at, say ~50% (don’t take my word for the exact number) context window limit, but its performance will degrade when 75% of the context window is used.

Whenever you work with a model, you always have to factor in the context window so that you can make the correct decision in choosing a model and/or using the model in a certain way.

Additional menu