[07/25, Paper] WaferLLM, the world fastest LLM inference system, accepted to OSDI 2025.
Thrilled to announce that WaferLLM, the world’s fastest LLM inference system, has been accepted to OSDI 2025! 🎉
The Opportunity: Wafer-scale accelerators pack hundreds of thousands of AI cores with massive on-chip memory (tens of GB) and incredible bandwidth (tens of PB/s). But current LLM systems, built for GPUs, can’t harness this power—leaving most of the hardware idle.
Our Breakthrough: WaferLLM is the first LLM inference system purpose-built for wafer-scale architectures. We introduce:
- Novel PLMR model capturing wafer-scale hardware characteristics
- Wafer-scale LLM parallelism across hundreds of thousands of cores
- MeshGEMM & MeshGEMV—the first scalable implementations for wafer architectures
The Results:
- 200× higher accelerator utilization vs. state-of-the-art
- 606× faster GEMV operations than NVIDIA A100
- 16× more energy-efficient than A100
- 10-20× speedups for full LLM inference vs. A100 clusters (SGLang, vLLM)
This opens a new frontier for LLM inference at unprecedented scale and efficiency. Open-sourced at: https://github.com/MeshInfra/WaferLLM
Looking forward to presenting at OSDI 2025!