vLLM is an end-to-end serving system with a FastAPI frontend and a GPU-based inference engine.
The vLLM engine is written in Python and C++/CUDA code and employs control-related components such as the scheduler and block manager.
Kernel-level optimization techniques are used to optimize memory access patterns introduced by PagedAttention.
vLLM implements various decoding algorithms using methods like fork, append, and free, and supports parallel sampling, beam search, and prefix sharing.