Tags
#Machine Learning Systems #LLM ServingMPI-LLM
MPI-LLM is an HPC-oriented serving system for Large Language Models that helps institutions run many LLMs on a limited GPU pool. Built on top of vLLM, it replaces the Ray backend with a Message Passing Interface (MPI) control plane and a master-side scheduler that manages long-lived GPU workers across multiple nodes. This design improves compatibility with HPC infrastructure and reduces the overheads of deployment and coordination. MPI-LLM supports hosting multiple models simultaneously and enables rapid model switching, allowing users to move between models with different GPU requirements without repeatedly tearing down and restarting jobs. Our implementation demonstrates lower startup and switching overheads on multi-node GPU clusters while preserving strong inference throughput for multi-model workloads.
MPI-LLM provides:
-
MPI runtime integration: Provides an MPI-native runtime layer for coordination across nodes.
-
Request-level scheduling: Schedules and routes requests to coordinate multi-model execution.
-
GPU allocation policies: Dynamically assigns GPU resources to support efficient model switching on a shared pool.