LMCache

Visit Tool

LMCache is an open-source AI Frameworks & Infra tool that supercharges LLM performance by providing a fast KV cache layer. It reduces TTFT and increases throughput, especially in long-context scenarios.

Claim this tool

1View

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is LMCache?

LMCache is an open-source library designed to accelerate Large Language Model (LLM) performance by acting as a high-speed Key-Value (KV) cache layer. It significantly reduces Time To First Token (TTFT) and boosts throughput, particularly beneficial in scenarios involving long contexts. LMCache achieves this by storing and reusing KV caches of texts across various storage tiers like GPU, CPU, Disk, and even S3, utilizing advanced acceleration techniques such as zero CPU copy and GDS. It integrates seamlessly with popular LLM serving engines like vLLM and SGLang, offering features like high-performance CPU KVCache offloading and disaggregated prefill. This allows developers to achieve substantial delay savings and GPU cycle reductions in diverse LLM use cases, including multi-round QA and RAG.

Best used for

Ideal for developers and data scientists who need to optimize Large Language Model (LLM) serving, reduce Time To First Token (TTFT), and increase inference throughput. Especially valuable for long-context scenarios and integrating with existing LLM serving engines like vLLM and SGLang to save GPU cycles.

Common actions

optimize LLM performance

accelerate LLM inference

manage KV cache

reduce inference latency

improve LLM throughput

face swappinggithub copilotopen-source"AI Agents"deepfakelow-code/no-codecollaborationworkflowsautomated workflow

Capabilities

Key features

KV cache layer
Reduce TTFT
Increase throughput
Multi-tier storage
vLLM integration
SGLang integration
Disaggregated prefill

Target Audience

developerdata scientist

Integrations

vllmsglangrediswekapliops

Pricing & Plans

Open Source

Free

FAQs

What are the primary benefits of using LMCache for LLM serving?

LMCache significantly reduces the Time To First Token (TTFT) and increases the overall throughput of Large Language Models. It achieves this by efficiently reusing Key-Value (KV) caches across various storage tiers, leading to substantial savings in GPU cycles and improved user response times, especially for long-context applications.

Which LLM serving engines does LMCache integrate with?

LMCache is designed to integrate with popular LLM serving engines. It has official integration with vLLM (version 1) and SGLang, offering features like high-performance CPU KVCache offloading and disaggregated prefill to enhance their capabilities.

What are the system requirements for installing and running LMCache?

LMCache primarily works on Linux NVIDIA GPU platforms. Installation is typically done via pip (pip install lmcache). For specific configurations, such as different vLLM versions or other serving engines, detailed instructions and troubleshooting for issues like 'undefined symbol' errors are available in the official documentation.

Trending

Subcategories trending in AI Agents & Automation

Chatbots & Conversational AI General-Purpose Agents Workflow Agents Personal Assistants RAG & Document AI Voice Agents

Trending

Also listed in

This tool also appears in

Coding & Development › Open Source & Models

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce