Distributed-Llama

Visit Tool

Distributed-llama enables distributed LLM inference by connecting home devices into a powerful cluster, accelerating performance through tensor parallelism and high-speed synchronization. It supports Linux, macOS, and Windows, optimized for ARM and x86_64 AVX2 CPUs.

Claim this tool

1View

At a glance

Pricing

Open Source

Free tier

Yes

API

Yes

Skill level

Technical

About

What is distributed-llama?

Distributed-llama is an open-source project designed to accelerate Large Language Model (LLM) inference by leveraging a cluster of connected home devices. It utilizes tensor parallelism and high-speed synchronization over Ethernet to distribute the computational load, allowing more devices to contribute to faster performance. The tool supports various operating systems including Linux, macOS, and Windows, and is optimized for both ARM and x86_64 AVX2 CPUs. It features a root node responsible for loading models and weights, and worker nodes that process slices of the neural network. Distributed-llama supports a range of Llama and Qwen models, offering commands for inference, chat, and running worker nodes, along with an API server. It also provides options for manual model conversion and supports specific quantization types.

Best used for

Ideal for developers who need to accelerate LLM inference, distribute computational loads across multiple devices, and build a powerful local cluster for AI model processing. Especially valuable for those looking to leverage existing home hardware for advanced AI tasks.

Common actions

accelerate LLM inference

distribute AI models

run LLMs locally

cluster computing

workflowsopen-sourcecollaborationdeepfakeautomated workflowgithub copilotlow-code/no-codeface swapping"AI Agents"

Capabilities

Key features

Distributed LLM inference
Tensor parallelism
High-speed synchronization
Multi-platform support
Optimized for ARM/x86_64 CPUs
Root/worker node architecture
CLI for inference/chat/worker

Target Audience

developer

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What types of LLMs does Distributed-llama support?

Distributed-llama supports various Llama models, including Llama 3.1, Llama 3.2, and Llama 3.3, as well as Qwen 3 models (0.6B, 1.7B, 8B, 14B, 30B) and DeepSeek R1 Distill Llama 8B. It also offers experimental Vulkan support for Qwen 3 MoE models.

What are the hardware and software requirements for running Distributed-llama?

To run Distributed-llama, you need Python 3 and a C++ compiler. It is optimized for ARM and x86_64 AVX2 CPUs and supports Linux, macOS, and Windows. The system requires sufficient RAM, especially for the root node, which needs more than worker nodes.

How does Distributed-llama distribute the LLM inference workload?

Distributed-llama uses a root node and worker nodes architecture. The root node loads the model and weights, forwarding them to workers and synchronizing the neural network state. Worker nodes process their own slice of the neural network, with RAM usage split across all connected devices.

Trending

Subcategories trending in Coding & Development

Open Source & Models Code Assistants No-Code / Low-Code Testing & QA Backend & APIs Prompt Engineering

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce