VLLM
Visit ToolvLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It offers state-of-the-art serving throughput and supports a wide range of models and hardware.
At a glance
Trending
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It offers state-of-the-art serving throughput and supports a wide range of models and hardware.
Trending
About
vLLM is a fast and easy-to-use library designed for LLM inference and serving, originating from the Sky Computing Lab at UC Berkeley. It boasts state-of-the-art serving throughput and efficient memory management through PagedAttention. Key features include continuous batching, chunked prefill, prefix caching, and fast model execution with CUDA/HIP graphs. vLLM supports various quantization methods like FP8 and INT4, optimized attention kernels such as FlashAttention, and speculative decoding. It offers seamless integration with Hugging Face models, high-throughput serving with diverse decoding algorithms, and distributed inference capabilities. The tool also provides an OpenAI-compatible API server, multi-LoRA support, and broad hardware compatibility, including NVIDIA, AMD, and x86/ARM/PowerPC CPUs, along with plugins for TPUs and other accelerators. It supports over 200 model architectures, including decoder-only, Mixture-of-Expert, hybrid attention, multi-modal, embedding, and reward models.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending