
Why Benchmark-Driven Selection Matters
Other tools make you guess which model is best. We use real performance data.
Traditional Model Selection
Manual research to find "best" models
Marketing claims vs. actual performance
Models go stale as new versions release
No way to know if better options exist
Configuration becomes technical debt
Hive's Benchmark-Driven Approach
7 ground-truth benchmarks (SWE-bench, LiveCodeBench, GPQA, etc.)
2 preference sources for reasoning and creativity fallback
Daily sync from OpenRouter's 340+ model catalog
Automatic reassignment when better models appear
Zero maintenance—profiles improve automatically
Evergreen Architecture
Model assignments stay current automatically—updated daily without any action required.
Daily Model Sync
Every 24 hours, Hive syncs with OpenRouter's catalog of 340+ models from 50+ providers. New models are automatically evaluated.
Benchmark Scoring
Models are scored against 9 benchmark sources. When a new model outperforms the current assignment, it's automatically promoted.
Zero Maintenance
Select a profile once. As the AI landscape evolves—new models, deprecations, benchmark changes—your profiles keep improving.
Trusted Benchmark Sources
Real performance data from 9 authoritative sources—ground-truth benchmarks, academic research, and peer-reviewed studies.
Updated daily • Verified 2025 • No marketing claims—only measured results
Ground-Truth Sources
(Actual Task Completion)SWE-bench
Princeton NLP
The gold standard for coding ability. Tests AI on real GitHub issues with verified test suites.
Measures: CODE tenant
Aider Polyglot
Paul Gauthier
Tests actual code EDITING across 6 programming languages. Real polyglot capability.
Measures: CODE tenant
LiveCodeBench
CMU/Stanford
Contamination-free benchmark. Continuously harvests NEW problems from LeetCode, AtCoder, CodeForces.
Measures: CODE tenant
GPQA Diamond
NYU/Anthropic
PhD-level science questions that experts can solve but non-experts cannot.
Measures: REASONING tenant
FaithJudge
Vectara
Measures hallucination rates using human-annotated examples. Lower hallucination = higher accuracy.
Measures: FACTUAL tenant
SimpleQA
OpenAI
OpenAI's factual accuracy benchmark. Tests whether models give correct, verifiable facts.
Measures: FACTUAL tenant
Agent Leaderboard
Galileo / HuggingFace
Multi-step task completion and tool selection across 5 real-world domains.
Measures: DOMAIN + CODE
Preference Sources
(Human Feedback Fallback)LMSYS Arena
UC Berkeley
Human preference via blind A/B comparisons. Captures quality and helpfulness that synthetic benchmarks miss.
Measures: REASONING, CREATIVITY (fallback)
OpenRouter
API Gateway
Live model availability, pricing, latency, and performance tiers from the API gateway Hive uses.
Provides: Metadata, context windows, cost optimization
Why Ground-Truth Matters
Most AI tools rank models by popularity—which response do people prefer? Hive ranks models by capability—which model actually solves the problem?
Ground-truth benchmarks prove a model fixed the bug, answered correctly, or completed the task. Preference benchmarks only show what people liked. For hard problems, you want proven capability—not popularity.