ServerlessLLM in OSDI 2024.
ServerlessLLM is a new system that enables cost-effective serverless inference for LLMs. Its key contribution is to make “Checkpoint Locality” effectively realized. It achieves this through a combination of a new LLM checkpoint format and a fast loader on multi-tier storage, an efficient algorithm for LLM inference live migration, and a new locality-friendly GPU serverless architecture.
The ServerlessLLM paper has been accepted to the top system conference OSDI'24, and we are in the process of open-sourcing it. Stay tuned.