Evalplus

Visit Tool

EvalPlus is a Testing & QA tool that provides rigorous evaluation for LLM-synthesized code. It offers extensive test suites like HumanEval+ and MBPP+ and evaluates code efficiency with EvalPerf.

Claim this tool

2Views

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is evalplus?

EvalPlus is a comprehensive and rigorous evaluation framework designed for Large Language Models (LLMs) that generate code. It significantly expands upon existing benchmarks, offering HumanEval+ with 80x more tests and MBPP+ with 35x more tests than their original versions, ensuring a more precise assessment of code correctness. Additionally, EvalPerf evaluates the efficiency of LLM-generated code through performance-exercising tasks and test inputs. The framework supports various LLM backends, including HuggingFace, vLLM, OpenAI-compatible servers, Anthropic, Google Gemini, Amazon Bedrock, and Ollama, allowing for flexible integration. EvalPlus enables developers and researchers to benchmark LLMs, identify fragile code generations, and understand performance beyond mere correctness, making it a critical tool for advancing code AI.

Best used for

Ideal for developers and researchers who need to rigorously evaluate the correctness of LLM-synthesized code, benchmark different LLMs, and assess the efficiency of generated solutions. Especially valuable for those working on improving and deploying code generation AI models.

Common actions

evaluate code generation

benchmark LLMs

test code efficiency

assess code correctness

"AI Agents"github copilotautomated workflowface swappinglow-code/no-codedeepfakeworkflowsopen-sourcecollaboration

Capabilities

Key features

Rigorous LLM code evaluation
HumanEval+ extended tests
MBPP+ extended tests
EvalPerf code efficiency
Multiple LLM backend support
Safe Docker execution

Target Audience

developer

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What LLM backends does EvalPlus support?

EvalPlus supports a wide range of LLM backends, including HuggingFace models, vLLM, OpenAI-compatible servers (like DeepSeek and Grok), OpenAI models, Anthropic models, Google Gemini, Amazon Bedrock, and Ollama, offering flexibility for various research and development needs.

How does EvalPlus ensure safe code execution?

EvalPlus ensures safe code execution by providing options to run code within Docker containers. This isolates the execution environment, preventing potential security risks or unintended side effects on the host system during the evaluation of LLM-generated code.

What is the difference between HumanEval+ and MBPP+?

HumanEval+ and MBPP+ are extended versions of popular code generation benchmarks. HumanEval+ features 80 times more tests than the original HumanEval, while MBPP+ includes 35 times more tests than the original MBPP, both designed for more rigorous correctness evaluation.

Trending

Subcategories trending in Coding & Development

Open Source & Models Code Assistants DevOps & Infrastructure No-Code / Low-Code Backend & APIs Prompt Engineering

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce