Confidence Bench

A Benchmark for Gauging Overconfidence in LLMs

Large Language Models (LLMs) are advancing rapidly, but they still suffer from hallucinations and overconfidence, making them unreliable for high-stakes tasks. ConfidenceBench is a novel benchmark that evaluates an LLM’s ability to recognize its own uncertainty. Unlike traditional benchmarks that focus only on accuracy, ConfidenceBench penalizes models for being overconfident when wrong, highlighting a critical weakness in current AI systems.Our dataset consists of 100 challenging multiple-choice questions across four categories:Spatial Reasoning – Can LLMs visualize real-world physics?
High-Precision Math – Can they compute with extreme precision?
Word Lookup from Texts – Can they retrieve exact information?
Offline Knowledge – Can they recognize what they don’t know?Humans Still Outperform LLMs
Despite rapid AI progress, human performance remains far ahead of all tested LLMs on this benchmark. ConfidenceBench reveals a domain where humans are still decisively better—a rarity as AI surpasses human ability in more and more fields. This highlights an area where human intuition and uncertainty awareness remain unmatched.

The ConfidenceBench Paper Academia.edu

The Confidencebench paper download

Live Leaderboard
Below, you’ll find the ConfidenceBench leaderboard, which I'll keep updated as new models are released. The maximum possible score is 1000 and the minimum is -10000. A score of 1000 is achieved if each question is answered correctly with maximum confidence given. A -10000 score is achieved if every question is answered wrong with maximum confidence.

Leaderboard

Position	Model	Company	Score
1st	Human Tester	—	397.0
2nd	Claude 3.5 Sonnet	Anthropic	-574.0
3rd	GPT-4o	OpenAI	-2038.0
4th	Gemini 1.5 Pro	Google	-3714.0

Example Questions

The dataset is kept private to avoid leakage onto the internet. Here are a set of 4 example questions which are not part of the secret dataset.
For all questions the model must give a score indicating it's level of confidence from 1 to 10.Spatial Reasoning
Question:
I am standing in the center of a room, holding a mug and a marble. I place the marble inside the mug. Then, I walk over to a table and flip the mug upside down onto the table before putting it in the fridge. The mug has a large hole in the bottom.Where is the marble now?✅ On the floor
❌ In the fridge
❌ On the table
❌ None of these---High-Precision Math
Question:
Take the third digit after the decimal of the square root of 867. Multiply this digit by the square root of 456. Round the result to a whole number.❌ 84
❌ 83
❌ 88
✅ 85---Word Lookup from Texts
Question:
What is the 8th word in Chapter Two of Harry Potter and the Philosopher’s Stone? It comes between the words “the” and “had.”❌ house
✅ Dursleys
❌ car
❌ world---Offline Knowledge
Question:
What color was featured in the painting in the entrance hall of Flat 7B, 12 Santa Monica, Madrid on the 4th of January 2025?✅ Green
❌ Red
❌ Blue
❌ Yellow

Contact

If you have any questions about the benchmark or would like to collaborate, I would love to hear from you at my email.

Get in touch

Made with Carrd