Human-Eval

Visit Tool

Human-eval is an evaluation harness for large language models trained on code. It provides a framework for running and testing untrusted model-generated code, assessing their code generation capabilities.

Claim this tool

2Views

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is human-eval?

Human-eval is an evaluation harness specifically designed for assessing the performance of large language models (LLMs) that have been trained on code. This tool provides a robust framework for running and testing untrusted model-generated code, allowing researchers and developers to evaluate the code generation capabilities of AI models. It includes functionalities for generating samples, evaluating functional correctness, and providing detailed results such as pass@k metrics. The tool emphasizes security, requiring users to enable execution of untrusted code within a robust security sandbox. It is an essential resource for anyone involved in the development and benchmarking of code-generating AI.

Best used for

Ideal for AI researchers and developers who need to rigorously evaluate the code generation capabilities of large language models. Especially valuable for benchmarking new models, comparing different LLM versions, and ensuring the functional correctness of AI-generated code.

Common actions

evaluate code models

benchmark AI code

test generated code

assess LLM performance

low-code/no-codeautomated workflowcollaborationopen-sourceworkflowsdeepfakeface swappinggithub copilot"AI Agents"

Capabilities

Key features

Evaluate code LLMs
Run untrusted code
Functional correctness evaluation
Pass@k metrics
JSON Lines format

Target Audience

ai researchersmachine learning engineersdevelopers working on code generation models

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What is the primary purpose of HumanEval?

HumanEval is designed to evaluate large language models trained on code. It provides a standardized dataset and an evaluation harness to assess the functional correctness and code generation capabilities of these AI models.

How does HumanEval handle untrusted model-generated code?

The tool is built to run untrusted model-generated code. It strongly encourages users to execute this code within a robust security sandbox, with the execution call deliberately commented out to ensure users acknowledge the security implications.

What kind of metrics does HumanEval provide for evaluation?

HumanEval provides metrics such as pass@k, which indicates the percentage of problems for which at least one of k generated samples passes the tests. It also offers fine-grained information on whether a completion passed, timed out, or failed.

Trending

Subcategories trending in Productivity & Business

Workflow Automation HR & Recruiting Document Management Legal & Compliance Team Collaboration Startup Tools

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce