#158 Understanding the Power of Context Length in LLMs
Last week, I summarized the latest advancements in AI from OpenAI and Google. Both companies are competing closely to lead the AI revolution. After their announcements, two things stood out to me: a 2 million token context window length and AI agents, which Google calls Project Astra. I realize some of you might not know what these terms mean or how they can make a difference. In this article, we will explore the context window length in detail. If you're short on time and prefer not to read the entire post, you can find a TLDR summary at the end.
We all know that large language models (LLMs) have become very common today. Many companies are trying to boost their productivity by using these LLMs in their business. One important thing to understand is that LLMs' knowledge is limited to the data they are trained on. Simply put, data is what powers LLMs. The more data they are trained on, the better they understand the world.
Now, imagine you have a huge file, like your physics homework or an hour-long video, and you want to ask questions about it. You want the LLM to answer based on your file instead of giving general information from the internet. This ability to work with large files and make the LLM aware of its specific context is behind many of the new features Google has introduced, like talking to your photos, asking questions about YouTube videos, summarizing email threads, and more.
When we talk about context length, we mean the maximum amount of input the LLM can handle to make sense of the data. If you give too much information, the LLM will lose track of the context and give irrelevant answers. There are several approaches to handle this -
There are prompt techniques that have been used for summarizing. I have shared a few of these such as Chain of Thought Prompting, Plan & Solve Prompting, Large Language Models are Human Level Prompt Engineers, Rephrase & Respond: Let Large Language Models Ask Better Questions for Themselves in previous editions of The Passion Pad.
One method involves using a vector database. This technique embeds the document and performs similarity searches. This might be similar to how Google Photo Search works, although it hasn't been released yet.
Another method is fine-tuning the LLM with custom data. I fine-tuned LLMs for my Deep Learning project last semester. While effective, this process is expensive and changes the model's weights. Although the model may perform well for your specific use case, it might fail at other general tasks.
Instead of these methods, a simpler way to handle large amounts of data is to increase the context length of LLMs. This is what Google has done. They have introduced the first consumer-facing LLM with a 2 million token context length. For comparison, other models have much shorter context lengths: 4,069 tokens for GPT-3, 16K for GPT-3.5, 8K for GPT-4, 128K for GPT-4 Turbo, and 200K for Anthropic models. As you can see, no one else comes close.
Increasing the context length is the best approach because it allows the LLM to use its extensive training data more effectively without changing the model's weights. This leads to better answers and responses since the powerful LLM can handle larger amounts of data without any alterations.
Imagine you have a personal assistant named Gyanav. Gyanav is very knowledgeable and can help you with many tasks. However, Gyanav has a limited attention span. He can only remember and process information from a few sentences you tell him at a time.
One day, you need Gyanav to help you with a complicated project. You hand him a huge document, but since Gyanav can only remember a small part of it, he often loses track of what he's doing. As a result, his help isn't very effective because he can't see the bigger picture.
Now, imagine that Gyanav’s attention span is magically increased. He can now remember and process information from the entire document all at once. With this enhanced ability, Gyanav can understand the context of your project much better. He can connect the dots, remember important details, and provide you with more accurate and helpful assistance.
This is similar to what happens with LLMs and context length. When the context length is short, the LLM can only handle a small amount of information at a time, which limits its effectiveness. But when the context length is increased, the LLM can understand and process much larger amounts of data, making it far more useful and accurate in its responses.
For example, if you have a long email thread or a detailed report and you want to ask questions about it, an LLM with a longer context length can keep track of all the details and provide you with better answers. This is why Google's new LLM with a 2 million token context length is so impressive—it can handle much more information at once, making it incredibly powerful and useful.
How to increase the context length, one might wonder?
Well this is where things get technical. There are a number of ways this can be done, but for the scope of this article, I am going to explain just one. Feel free to ping me if you want to dig deeper into this.
Think about a traditional neural network (NN). In these networks, the number of parameters (like weights and biases) grows with the size of the input. Each parameter is tied to a specific input feature, and the number of weights depends on the number of input features.
However, transformers work differently. Instead of processing each token one by one, they use self-attention to weigh the importance of each word in relation to others. The parameters are shared across all tokens in the sequence, so the number of parameters doesn’t depend on the input size. This allows transformers to handle inputs of varying lengths.
If a model is trained on a context length of 2048 tokens, it can technically accept any sequence length. But if the input exceeds 2048 tokens, the model might not give meaningful results because it wasn’t trained to handle so many tokens at once.
One trick to increase context length comes from a research paper called ALiBi (Attention with Linear Biases).
A common idea is to train the base model on shorter lengths (like 2048 tokens) and then fine-tune it on longer lengths (like 100K tokens). However, this approach often fails due to a property known as extrapolation.
Extrapolation, as introduced in the ALiBi paper, means making inferences for sequences longer than those seen during training. For example, if a model was trained on 1024 tokens but receives 2048 tokens during use, it’s extrapolating.
The ALiBi method changes the positional embeddings in transformers. By using ALiBi, a 1.3 billion parameter model trained on sequences of 1024 tokens could extrapolate to sequences of 2048 tokens. This approach trains faster (11% faster) and uses less memory (11% less) than models with traditional sinusoidal positional embeddings. The paper concludes that sinusoidal embeddings are not good at extrapolation.
In summary, choosing the context length (L) is a crucial design decision. A larger L means better performance due to increased context but also significantly higher training costs. The idea is to use ALiBi positional embeddings to extrapolate by training on a smaller context length.
Here’s a TLDR generated by ChatGPT -
TLDR: Last week, I covered the latest AI advancements from OpenAI and Google, highlighting Google's impressive 2 million token context window and Project Astra AI agents. Context length is crucial for large language models (LLMs) to handle large amounts of data effectively. While traditional methods like fine-tuning and vector databases exist, increasing the context length of LLMs is a more efficient approach. Google's new LLM with a 2 million token context length demonstrates how this improvement can lead to better, more accurate responses without altering model weights. Techniques like ALiBi (Attention with Linear Biases) enable this increase, allowing LLMs to process larger inputs efficiently.