Evalplus
Visit ToolEvalPlus is a Testing & QA tool that provides rigorous evaluation for LLM-synthesized code. It offers extensive test suites like HumanEval+ and MBPP+ and evaluates code efficiency with EvalPerf.
At a glance
Trending
EvalPlus is a Testing & QA tool that provides rigorous evaluation for LLM-synthesized code. It offers extensive test suites like HumanEval+ and MBPP+ and evaluates code efficiency with EvalPerf.
Trending
About
EvalPlus is a comprehensive and rigorous evaluation framework designed for Large Language Models (LLMs) that generate code. It significantly expands upon existing benchmarks, offering HumanEval+ with 80x more tests and MBPP+ with 35x more tests than their original versions, ensuring a more precise assessment of code correctness. Additionally, EvalPerf evaluates the efficiency of LLM-generated code through performance-exercising tasks and test inputs. The framework supports various LLM backends, including HuggingFace, vLLM, OpenAI-compatible servers, Anthropic, Google Gemini, Amazon Bedrock, and Ollama, allowing for flexible integration. EvalPlus enables developers and researchers to benchmark LLMs, identify fragile code generations, and understand performance beyond mere correctness, making it a critical tool for advancing code AI.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending