[11/25, Paper] BitDecoding accepted to HPCA 2026.
Thrilled to announce that our paper BitDecoding has been accepted to HPCA 2026! 🚀
The Problem: Long-context LLMs are memory-hungry. While low-bit KV-cache quantization (2-bit, 4-bit) can dramatically reduce memory footprint, existing systems are painfully slow—they only use CUDA cores and completely ignore Tensor Cores, the GPU’s main computational powerhouse.
Our Innovation: BitDecoding unlocks Tensor Cores for low-bit KV-cache inference. By cooperatively leveraging both CUDA cores and Tensor Cores with optimized layouts and smart parallelization, we achieve the best of both worlds: small memory footprint AND blazing-fast decoding.
The Impact:
- Up to 8.9× faster than FP16 FlashDecoding on Hopper and Blackwell
- 4.3× faster than state-of-the-art low-bit system
- 3× lower latency on LLaMA-3.1-8B with 128K context
- Works across Blackwell, Hopper and Ampere GPUs
This opens the door for efficient long-context inference at scale. Looking forward to presenting at HPCA 2026!