HaluEval

Visit Tool

HaluEval is an open-source benchmark for evaluating hallucination in Large Language Models. It provides a large-scale dataset and code for generating, evaluating, and analyzing LLM responses.

Claim this tool

No Views Yet

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is HaluEval?

HaluEval is a comprehensive, open-source benchmark designed to evaluate hallucination in Large Language Models (LLMs). It features a substantial dataset of 35,000 samples, including 5,000 human-annotated general user queries and 30,000 task-specific examples across question answering, knowledge-grounded dialogue, and text summarization. The repository provides code for generating hallucinated samples, evaluating LLM performance in recognizing hallucinations, and analyzing the types of content LLMs tend to hallucinate. This tool is invaluable for researchers and developers aiming to improve the reliability and factual consistency of LLMs by offering a standardized method for identifying and understanding hallucination tendencies.

Best used for

Ideal for researchers and developers who need to systematically evaluate the hallucination tendencies of Large Language Models, generate synthetic hallucinated data for various tasks, and analyze the specific types of content where LLMs fail. Especially valuable for improving the factual consistency and trustworthiness of AI-generated text.

Common actions

evaluate LLM hallucination

generate hallucinated data

analyze LLM reliability

benchmark language models

deepfakeworkflowsopen-sourceautomated workflowcollaborationlow-code/no-codeface swapping"AI Agents"github copilot

Capabilities

Key features

35K hallucination data
Data generation code
LLM evaluation code
Hallucination analysis
QA hallucination samples
Dialogue hallucination samples
Summarization hallucination samples

Target Audience

ai researchersml engineersnlp developersdata scientists

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What types of data are included in the HaluEval benchmark?

HaluEval includes 35,000 data points. This comprises 5,000 human-annotated general user queries with ChatGPT responses and 30,000 task-specific examples from question answering, knowledge-grounded dialogue, and text summarization tasks. Each dataset is designed to help identify and analyze LLM hallucinations.

Can HaluEval be used to generate new hallucinated samples?

Yes, HaluEval provides code and instructions for generating hallucinated samples. Users can leverage existing task datasets like HotpotQA, OpenDialKG, and CNN/Daily Mail as seed data, and then apply task-specific instructions to ChatGPT to create new hallucinated examples for evaluation.

What kind of analysis can be performed with HaluEval?

HaluEval allows for the analysis of LLM recognition results, specifically identifying which topics or content types LLMs succeed or fail to recognize hallucinations in. This analysis can be performed using techniques like LDA on both all task samples and only those where LLMs failed.

Trending

Subcategories trending in Productivity & Business

Workflow Automation HR & Recruiting Document Management Legal & Compliance Team Collaboration Startup Tools

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce