Mle-Bench

Visit Tool

MLE-bench is an open-source benchmark for measuring how well AI agents perform at machine learning engineering. It provides code for dataset construction, evaluation logic, and agents for benchmarking.

Claim this tool

1View

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is mle-bench?

MLE-bench is an open-source benchmark developed by OpenAI for evaluating the performance of AI agents in machine learning engineering tasks. It provides a comprehensive framework including code for constructing datasets, the evaluation logic, and the agents that were initially benchmarked. The platform features a leaderboard showcasing various AI agents and their performance across different complexity levels (Low, Medium, High) and overall scores, along with running times and LLMs used. Users can submit their own agents for evaluation, with clear instructions on how to produce scores for the leaderboard. MLE-bench also offers a "lite" evaluation option focusing on lower complexity tasks to reduce computational costs, making it accessible for broader experimentation. The dataset comprises 75 Kaggle competitions, with preparation scripts to split training data and grading scripts for submission evaluation. Additional features include a rule violation detector and a plagiarism detector.

Best used for

Ideal for data scientists and AI researchers who need to rigorously benchmark AI agents' performance in machine learning engineering tasks. Especially valuable for comparing different agent architectures, developing new ML agents, and assessing their capabilities across a diverse set of Kaggle competitions.

Common actions

benchmark AI agents

evaluate machine learning models

compare agent performance

develop ML agents

collaborationopen-sourceworkflowsdeepfakelow-code/no-codeface swappingautomated workflow"AI Agents"github copilot

Capabilities

Key features

AI agent performance benchmark
Dataset construction code
Evaluation logic
Public leaderboard
Submission grading scripts
Rule violation detector
Plagiarism detector

Target Audience

data scientist

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What kind of tasks does MLE-bench evaluate AI agents on?

MLE-bench evaluates AI agents on machine learning engineering tasks derived from 75 Kaggle competitions. These cover various categories like Image Classification, Text Classification, Tabular data, and Audio Classification, with different complexity levels (Low, Medium, High).

How can I submit my own AI agent to the MLE-bench leaderboard?

To submit your agent, you need to organize grading reports in the `runs/` folder, identify your run groups in `runs/run_group_experiments.csv`, and then use the provided Python scripts to aggregate grading reports for different splits (low, medium, high, all).

Is there a less resource-intensive way to benchmark agents with MLE-bench?

Yes, MLE-bench offers a "lite" evaluation option. This uses only the Low complexity split of the dataset, consisting of 22 competitions, which significantly reduces the number of runs and dataset size compared to the full benchmark.

Trending

Subcategories trending in Data & Analytics

Predictive Analytics Data Labeling & Annotation Real-Time Analytics Market Research Data Cleaning & Prep Data Pipelines & Integration

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce