Glossary

What is an AI Benchmark?

Benchmarks are standardized tests for AI — like the SAT, but for artificial intelligence. They give AI models a set of problems to solve and measure how well they perform compared to other models (and sometimes compared to humans).

When a company announces their new AI "scores 92% on MMLU" or "beats GPT-4 on HumanEval," they're talking about benchmark results. These scores help researchers, companies, and users compare AI models objectively.

The simple version: Benchmarks are standardized tests for AI. They help us answer "is this AI actually getting smarter?" with data instead of vibes.

Common benchmarks you'll see in the news

Why benchmarks aren't the full picture

A model that aces benchmarks might still give bad advice in real conversations. Benchmarks test specific skills under controlled conditions — they don't capture everything about how useful an AI is in practice. Think of it like a student who's great at standardized tests but struggles with real-world problem-solving.

FAQ

Can AI cheat on benchmarks?

Sort of. If an AI was trained on the benchmark questions, it might 'memorize' answers rather than truly reasoning. This is called 'data contamination' and it's a real concern. Good benchmarks try to prevent this by keeping test questions secret or generating new ones.

Related Terms

Large Language Model

The technology behind ChatGPT, Claude, and Gemini — an AI trained on vast amounts of text.

Open Source AI

AI models that anyone can download, use, modify, and share for free.

Get this in your inbox

AI news explained without the jargon. Free, daily.

Subscribe