Context Compression for Auto-regressive Transformers with Sentinel Tokens

Siyu Ren, Qi Jia, Kenny Zhu


Abstract
The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at https://github.com/DRSY/KV_Compression.
Anthology ID:
2023.emnlp-main.794
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12860–12867
Language:
URL:
https://aclanthology.org/2023.emnlp-main.794
DOI:
10.18653/v1/2023.emnlp-main.794
Bibkey:
Cite (ACL):
Siyu Ren, Qi Jia, and Kenny Zhu. 2023. Context Compression for Auto-regressive Transformers with Sentinel Tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12860–12867, Singapore. Association for Computational Linguistics.
Cite (Informal):
Context Compression for Auto-regressive Transformers with Sentinel Tokens (Ren et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.794.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.794.mp4