BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers
Lilly Kumari, Shengjie Wang, Tianyi Zhou, Nikhil Sarda, Anthony Rowe, Jeff Bilmes
Proceedings of First Conference on Language Modeling (COLM), 2024
Abstract
The need for Transformer-based Large Language Models (LLMs) to maintain key-value representations (a KV cache) of previously seen tokens in the GPU memory leads to a significant overhead that scales linearly with the sequence length and batch size. With the advent of extremely long context LLMs, efficiently modeling long-range dependencies becomes challenging. In this work, we focus on the problem of long context summarization by formulating it as a subset selection problem. Specifically, we propose a novel submodular optimization framework called BumbleBee that uses a mixture of submodular functions to balance the diversity amongst the context tokens in the key embedding space and their importance computed using accumulated attention attributed to them across different input tokens. Our framework can work for both the LLM prefill and decoding phases, utilizing offline or online versions of our submodular algorithm respectively. While the context sizes grow to be as large only as the summary size, the temporal extent of the contexts may grow unboundedly, justifying the moniker Infinite-Context Transformers. Empirically, we validate the effectiveness of our framework across 13 different datasets using the LLaMA 7B and 13B models. Our results show that BumbleBee improves accuracy compared to state-of-the-art techniques at comparable context reduction ratios.