Mle-Bench
Visit ToolMLE-bench is an open-source benchmark for measuring how well AI agents perform at machine learning engineering. It provides code for dataset construction, evaluation logic, and agents for benchmarking.
At a glance
Trending
MLE-bench is an open-source benchmark for measuring how well AI agents perform at machine learning engineering. It provides code for dataset construction, evaluation logic, and agents for benchmarking.
Trending
About
MLE-bench is an open-source benchmark developed by OpenAI for evaluating the performance of AI agents in machine learning engineering tasks. It provides a comprehensive framework including code for constructing datasets, the evaluation logic, and the agents that were initially benchmarked. The platform features a leaderboard showcasing various AI agents and their performance across different complexity levels (Low, Medium, High) and overall scores, along with running times and LLMs used. Users can submit their own agents for evaluation, with clear instructions on how to produce scores for the leaderboard. MLE-bench also offers a "lite" evaluation option focusing on lower complexity tasks to reduce computational costs, making it accessible for broader experimentation. The dataset comprises 75 Kaggle competitions, with preparation scripts to split training data and grading scripts for submission evaluation. Additional features include a rule violation detector and a plagiarism detector.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending