[07/25, Paper] WaferLLM, the world fastest LLM inference system, accepted to OSDI 2025.

Thrilled to announce that WaferLLM, the world’s fastest LLM inference system, has been accepted to OSDI 2025! 🎉

The Opportunity: Wafer-scale accelerators pack hundreds of thousands of AI cores with massive on-chip memory (tens of GB) and incredible bandwidth (tens of PB/s). But current LLM systems, built for GPUs, can’t harness this power—leaving most of the hardware idle.

Our Breakthrough: WaferLLM is the first LLM inference system purpose-built for wafer-scale architectures. We introduce:

  • Novel PLMR model capturing wafer-scale hardware characteristics
  • Wafer-scale LLM parallelism across hundreds of thousands of cores
  • MeshGEMM & MeshGEMV—the first scalable implementations for wafer architectures

The Results:

  • 200× higher accelerator utilization vs. state-of-the-art
  • 606× faster GEMV operations than NVIDIA A100
  • 16× more energy-efficient than A100
  • 10-20× speedups for full LLM inference vs. A100 clusters (SGLang, vLLM)

This opens a new frontier for LLM inference at unprecedented scale and efficiency. Open-sourced at: https://github.com/MeshInfra/WaferLLM

Looking forward to presenting at OSDI 2025!

Luo Mai
Luo Mai
Associate Professor

My research interests include computer systems, machine learning systems and data management.