MIT Researchers DESTROY the Co

· algieg's blog


The "KV Cache" Bottleneck #

Failure of Current "Window Attention" #

StreamingLLM and the "Attention Sink" Discovery #

Implementation of "Attention Sinks" #

Speed and Efficiency Benchmarks #

Comparison to "Long-Context" Models #

Summary #

MIT researchers have addressed the memory and performance limitations of LLM context windows by identifying "Attention Sinks." They discovered that keeping the first few tokens of a sequence permanent in the KV Cache—side-by-side with a sliding window of recent tokens—prevents model collapse. This technique, called StreamingLLM, allows standard Transformer models to process millions of tokens with constant memory usage and up to 22x faster inference. While it does not provide perfect "long-term memory" for specific facts deep in the past, it enables stable, infinite-length interactions without the need for expensive hardware or model retraining.

last updated: