ContextPilot: Fast Long-Context Inference via Context Reuse

Yinsicheng Jiang *, Yeqi Huang *, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

May 2026

PDF Code

Abstract

AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today’s prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to 3× compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality.

Type

Conference paper

Publication

In 9th Annual Conference on Machine Learning and Systems (MLSys 2026)

Machine Learning Systems

Luo Mai

Associate Professor

My research interests include computer systems, machine learning systems and data management.

ContextPilot: Fast Long-Context Inference via Context Reuse

Abstract

Yinsicheng Jiang

PhD Student

Yeqi Huang

PhD Student

Cheng Deng

Research Fellow

Xuan Sun

Research Associate

Luo Mai

Associate Professor

Related