StreamingLLM running AI models smoothly indefinitely

The current AI gold rush in Silicon Valley and the larger enterprise tech world has been centred on text-to-text large language models (LLMs), such as OpenAI’s ChatGPT, Meta’s Llama 2, and Anthropic’s Claude 2. However, most of them share some of the same problems.

One of these issues is consistently high quality performance over time during a single conversation with a user, where the LLM responds in a way that is equally helpful, quick, and relevant at the beginning, middle, and end of the conversation. This is true regardless of how long the conversation lasts or how many exchanges of dialogue it includes. This is due to the fact that LLMs are pre-trained on data blocks or sequences of a specific length—4,000 tokens in the case of Llama 2 and many other top LLMs.

The LLM starts to experience diminished performance, i.e., worse quality replies, after a user inputs more tokens than this, even if they are doing so across numerous separate prompts. For businesses who want LLMs to assist clients or staff in an ongoing manner, this is unacceptable.

According to a new paper from researchers at Meta, MIT, and CMU, there is an easy way to support LLMs keep up their performance even for endlessly long conversations in which the user’s prompts add up to be longer than what the LLM was trained to handle at once.

For other AI researchers and businesses wishing to employ LLMs to help with their business, their work, a new framework for training and deploying LLM inferences called “StreamingLLM,” reveals a number of significant findings.

The issue that StreamingLLM aims to address

Anyone who has interacted with a human customer support representative, or even an internal IT engineer at your employer, is aware that it frequently requires a protracted dialogue and several messages exchanged between you and your assigned helper to fix the issue at hand.

You want the individual assigned to assist you to be consistently responsive, knowledgeable, and helpful throughout your entire transaction, regardless of whether you’re a consumer or an employee. Deep into the conversation where you’ve already invested time and effort in expressing your problem, it can be quite annoying and unhelpful if your assistant suddenly starts responding with one-word responses, more slowly, or without providing you with the information you require.

Although this may be a problem for some people who are distracted, uninspired, or worn out by the conversation, it is a common problem for LLMs because their performance deteriorates once a conversation with them lasts longer than the “context window,” the most tokens they can respond to at once and which was used to pre-train them. Even if the majority of LLMs are built to handle lengthy, open-ended conversations, this is still true.

The cumulative sum of multiple messages in a single conversation adds up to a number of tokens that is greater than those included in the LLM’s initial pre-training context window, which lowers the LLM’s performance moving forward even if each of those lines fits within the context window of an LLM — and all of them should, as most LLMs have an upper boundary on the amount of text you can enter for them to respond to in a single message.

It would be as if when you spoke to a live customer service representative for a predetermined amount of time, spread out over a few lines, you suddenly reached some limit that made them less intelligent and attentive.

The difficulty is summed up in the following way by the researchers that created the StreamingLLM framework in their paper: “For instance, an ideal ChatBot helper can steadily work over the content of recent day-long chats. To generalize LLM to greater sequence lengths than they were pre-trained on, nevertheless, is quite difficult.

Although it is possible to lengthen the token sequences in pre-training LLMs—and some researchers have already done so—it is not possible to predict how long a particular conversation with a particular user would last.

What is the best way to ensure that an LLM that was trained using a fixed context-window length, no matter how lengthy it was, can still perform well after that length has been spread across numerous messages?

The answer the researchers came up with

When the amount of information in a conversation grew to be more than the number of tokens used in the pre-training sequence, the researchers devised a creative approach for keeping LLM performance intact.

The researchers found that LLMs focus more intently on the tokens that are presented to them early in a conversation or during training.

The initial tokens receive a remarkably high amount of attention score, they write. How come this is the case?

They claim that because autoregressive language modelling is sequential, early tokens are visible to all later tokens, whereas later tokens are only visible to a specific subset of later tokens. Initial tokens are therefore more easily trained to act as attention drains and draw unneeded attention.

To put it another way, whatever you present to an LLM first in a conversation can and will be used by it later in subsequent prompt and output exchanges, but whatever you prompt it with later won’t always be what the LLM chooses to focus on or refer to in its responses.

However, the researchers found that if the user gives some of the initial tokens in later responses to an LLM, it’s enough to bring the LLM’s performance back to close to its peak.

Recall our prior comparison of human customer service? Imagine if you could suddenly compel someone to provide you with high-quality responses even much later in the conversation by using the same four magic words that you used to start your interaction with them.

The researchers refer to these initial tokens, which capture the majority of the LLM’s attention, as “attention sinks,” and they observe that, for the majority of LLMs, the introduction of four initial tokens suffices to restore the LLM’s performance. One or two additions is not enough to fully recover.

The researchers were able to maintain the performance of top models like LLama 2 and Falcon 40B across prompts consisting of 4 million tokens (a 1000-fold increase from the original context window of just 4,000 tokens), and possibly even more, by reintroducing attention sink tokens in every single subsequent prompt from a user. They were also able to increase its speed in subsequent responses by 22.2 times.

One token to rule them all—or at least their attention—

Further developing their findings, the researchers proposed and demonstrated that you could get away with only adding one special token to serve as a “attention sink” for an LLM early on, and that by reintroducing this token later manually or automatically (behind the scenes of a user- or employee-facing LLM), the LLM’s performance could be maintained at a high level.

The researchers claim that introducing a sink token is quite successful in stabilizing the attention mechanism. The performance of the model can be properly anchored by just using this sink token with recent tokens.In light of these results, we advise training upcoming LLMs with a sink token across all samples to enhance streaming deployment.

When asked what specific information should be used for an attention sink, Guangxuan Xiao of MIT said in an email that “the ‘attention sinks’ can be any initial tokens; the focus is more on their position than semantics. These are not particular words or concepts; even tokens (such as the linebreak “\n”) without semantic connotations can be used successfully.

For continuous applications, such as multi-round discussions, is what the researchers hope to employ StreamingLLM, according to Xiao. It’s ideal for scenarios when a model needs to run continuously without relying too heavily on historical data. This is illustrated by a daily assistant LLM. With this method, there is no longer a need for regular cache refreshes because the model can retain and draw on recent interactions.

Contrary to some excitement about their work on X (previously Twitter), the researchers were cautious to emphasize StreamingLLM does not increase the context window of LLMs and are open about the limitations of their work as well. Additionally, it does not guarantee that LLM will recall each and every word that was said at each stage of the conversation.

In actuality, Xiao said, “neither do we enhance the LLMs’ long-term memory nor do we widen their context window.”

Source link