Candle-Vllm
Visit Toolcandle-vllm is an Open Source & Models tool that provides an efficient platform for local LLM inference and serving. It includes an OpenAI-compatible API server and supports various quantization formats.
At a glance
Trending
candle-vllm is an Open Source & Models tool that provides an efficient platform for local LLM inference and serving. It includes an OpenAI-compatible API server and supports various quantization formats.
Trending
About
candle-vllm offers an efficient and easy-to-use platform for inference and serving local Large Language Models (LLMs), featuring an OpenAI-compatible API server. Its highly extensible trait-based system allows for rapid implementation of new module pipelines, and it supports streaming during generation. Key capabilities include efficient management of key-value cache with PagedAttention, continuous batching for incoming requests, and in-situ quantization (including GPTQ/Marlin 4-bit formats). The platform supports various hardware, including Mac/Metal devices, and offers multi-GPU and multi-node inference. It also features chunked prefilling, CUDA Graph support, and an OpenAI-compatible tool calling API, making it a versatile solution for deploying and managing LLMs.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending