[11/25, Paper] BitDecoding accepted to HPCA 2026.

Thrilled to announce that our paper BitDecoding has been accepted to HPCA 2026! 🚀

The Problem: Long-context LLMs are memory-hungry. While low-bit KV-cache quantization (2-bit, 4-bit) can dramatically reduce memory footprint, existing systems are painfully slow—they only use CUDA cores and completely ignore Tensor Cores, the GPU’s main computational powerhouse.

Our Innovation: BitDecoding unlocks Tensor Cores for low-bit KV-cache inference. By cooperatively leveraging both CUDA cores and Tensor Cores with optimized layouts and smart parallelization, we achieve the best of both worlds: small memory footprint AND blazing-fast decoding.

The Impact:

  • Up to 8.9× faster than FP16 FlashDecoding on Hopper and Blackwell
  • 4.3× faster than state-of-the-art low-bit system
  • 3× lower latency on LLaMA-3.1-8B with 128K context
  • Works across Blackwell, Hopper and Ampere GPUs

This opens the door for efficient long-context inference at scale. Looking forward to presenting at HPCA 2026!

Luo Mai
Luo Mai
Associate Professor

My research interests include computer systems, machine learning systems and data management.