Context limits are a key factor when working with LLMs.
Simply put – it’s the number of tokens (“words” – roughly, sub-word pieces. https://tiktokenizer.vercel.app/ ) that a LLM can process, before it truncates the past context. If you give a LLM with a context window of 8192 tokens, a text of size 10,000 tokens – it won’t be able to process it all since it exceeds its capacity. Context size is the working memory of a model.
To understand what tokens are, Karpathy has an amazing tutorial here: (I’ve linked directly to the part where he talks about tokens)
In this blogpost we’ll demonstrate context length as a continuation of a previous blogpost Running LLMs Locally
First, let’s check llama3’s context length
mgoyal@JUSTMAGOP16GEN2:/mnt/c/Users/MGoyal$ ollama show llama3
Model
architecture llama
parameters 8.0B
context length 8192
embedding length 4096
quantization Q4_0
Capabilities
completion
Parameters
num_keep 24
stop "<|start_header_id|>"
stop "<|end_header_id|>"
stop "<|eot_id|>"
License
META LLAMA 3 COMMUNITY LICENSE AGREEMENT
Meta Llama 3 Version Release Date: April 18, 2024
...
Let’s run a Python script which goes past its context:
import requests
padding = "Ignore this filler text. " * 500
big_text = padding + "What is 2+2? Answer with just the number." + padding
r = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": big_text, "stream": False})
data = r.json()
print(f"Response: {data['response']}")
The response?
Response: It seems like you're trying to ignore some filler text! Don't worry, I'm here to help with any questions or topics you'd like to discuss. Just let me know what's on your mind, and we can start fresh!
As you can see, it totally ignored “What is 2+2? Answer with just the number”, because the filler text took up all the context.
Now let’s change it so that the filler text doesn’t take up all the space – we’ll change the padding to 50
import requests
padding = "Ignore this filler text. " * 50
big_text = padding + "What is 2+2? Answer with just the number." + padding
r = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": big_text, "stream": False})
data = r.json()
print(f"Response: {data['response']}")
Now it directly answers the question and ignores the filler text:
Response: 4
This clearly shows that LLMs have a limited window of text that they can process. You can easily find the context limit of any model by a simple google search or in the official reference. For example OpenAI provides context windows here: https://developers.openai.com/api/docs/models/gpt-5. Since Llama is a local model running on my laptop its context window is much smaller than the ones provided by the big AI companies. OpenAI provides 1 million context window for GPT-5.5 at the time of writing this article, for example.
For large context windows, in practice, many models degrade before the maximum limit is reached. The model will usually perform best at, say ~50% (don’t take my word for the exact number) context window limit, but its performance will degrade when 75% of the context window is used.
Whenever you work with a model, you always have to factor in the context window so that you can make the correct decision in choosing a model and/or using the model in a certain way.