Candle-Vllm

Visit Tool

candle-vllm is an Open Source & Models tool that provides an efficient platform for local LLM inference and serving. It includes an OpenAI-compatible API server and supports various quantization formats.

Claim this tool

2Views

At a glance

Pricing

Open Source

Free tier

Yes

API

Yes

Skill level

Technical

About

What is candle-vllm?

candle-vllm offers an efficient and easy-to-use platform for inference and serving local Large Language Models (LLMs), featuring an OpenAI-compatible API server. Its highly extensible trait-based system allows for rapid implementation of new module pipelines, and it supports streaming during generation. Key capabilities include efficient management of key-value cache with PagedAttention, continuous batching for incoming requests, and in-situ quantization (including GPTQ/Marlin 4-bit formats). The platform supports various hardware, including Mac/Metal devices, and offers multi-GPU and multi-node inference. It also features chunked prefilling, CUDA Graph support, and an OpenAI-compatible tool calling API, making it a versatile solution for deploying and managing LLMs.

Best used for

Ideal for developers and data scientists who need to efficiently serve local LLMs, deploy quantized models, and build custom inference pipelines. Especially valuable for those requiring an OpenAI-compatible API for local models, multi-GPU/multi-node inference, and advanced features like PagedAttention and continuous batching.

Common actions

serve LLMs locally

deploy AI models

optimize LLM inference

quantize LLMs

build custom LLM pipelines

face swappinggithub copilot"AI Agents"automated workflowopen-sourcedeepfakeworkflowscollaborationlow-code/no-code

Capabilities

Key features

OpenAI compatible API server
Streaming support
PagedAttention KV cache
Continuous batching
In-situ quantization
Multi-GPU inference
Multi-node inference

Target Audience

developerdata scientist

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What types of LLMs does candle-vllm support for inference?

candle-vllm supports a wide range of model architectures for chat serving, including LLAMA, Mistral, Phi3/Phi4, QWen2/Qwen3, Gemma, DeepSeek, GLM4, and MiniMax. It also supports various quantization formats like Q4k, Marlin, FP8, and NVFP4.

Can candle-vllm be used for multi-GPU or multi-node inference?

Yes, candle-vllm supports both multi-GPU inference (in multi-process and multi-threaded modes) and multi-node inference using an MPI runner. This allows for scaling up the serving of large language models across multiple hardware resources.

Does candle-vllm offer an OpenAI-compatible API for local LLMs?

Yes, a key feature of candle-vllm is its OpenAI-compatible API server. This allows users to interact with their locally served LLMs using familiar OpenAI API calls, simplifying integration with existing tools and workflows.

Trending

Subcategories trending in Coding & Development

Code Assistants DevOps & Infrastructure No-Code / Low-Code Testing & QA Backend & APIs Prompt Engineering

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce