Marco Vecchio


2024

pdf bib
Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
Yichen Jiang | Marco Vecchio | Mohit Bansal | Anders Johannsen
Findings of the Association for Computational Linguistics: EACL 2024

Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist token” activations during finetuning. However, this simple idea is ineffective in compressing API documentation, resulting in low accuracy compared to the baseline using an uncompressed prompt. In this work, we introduce two major improvements. First, we specialize gist tokens for different hierarchies within an API: we use one Gistarg token for compressing an argument and one Gistvalue token for compressing an acceptable value of a categorical argument. We then dynamically reveal Gistvalue tokens only when they are needed. Second, we add a reconstruction loss to predict the API documentation from the gist tokens. On multiple API-calling tasks, our proposed system keeps the simplicity, efficiency, and large compression factor (20x on SGD) of the gist token approach while achieving significantly better accuracy.