Airllm
Visit ToolAirLLM optimizes inference memory usage, allowing 70B large language models to run on a single 4GB GPU without quantization. It supports inference for models like Llama3.1 on limited VRAM.
At a glance
Trending
AirLLM optimizes inference memory usage, allowing 70B large language models to run on a single 4GB GPU without quantization. It supports inference for models like Llama3.1 on limited VRAM.
Trending
About
AirLLM is an open-source framework designed to optimize inference memory usage for large language models, enabling 70B models to run on a single 4GB GPU without requiring quantization, distillation, or pruning. It also supports running 405B Llama3.1 models on 8GB VRAM. The tool offers features like model compression for up to 3x inference speed improvement, support for various LLMs including Llama, Qwen, ChatGLM, Baichuan, Mistral, and InternLM, and compatibility with MacOS. AirLLM simplifies the inference process with an AutoModel feature that automatically detects model types and provides prefetching to overlap model loading and computation for enhanced speed.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending