Arena (formerly LLMArena)

Roadmap & Position in AI Evaluation

Crowdsourced human-preference benchmarking platform for LLMs and generative AI models.

Company Overview

Arena is an AI evaluation platform that runs human-preference model comparisons and publishes leaderboards used by frontier labs and enterprise buyers. The leaderboard is cited by OpenAI, Google, Anthropic, and xAI, with enterprise evaluation services sold into software engineering, legal, medical, and research workflows.

What They're Building

The company's public product roadmap & what they're committed to building.

Community Leaderboard

Public head-to-head voting across text, image, and (as of January 2026) video models.

Arena Enterprise Evaluations

Commercial benchmarking service for model labs and enterprises, reportedly reaching $30M ARR within four months of launch.

Arena-Hard and RouteLLM

Research-grade datasets and routing tools released through the lmarena GitHub org.

Vision Arena

Multimodal preference evaluation, extending the battle format to image and video outputs.

Latest Intelligence

Zeitgeist tracks private signals to determine where the company is heading strategically.

No Signals Yet

Founder and Key Execs

Anastasios N. Angelopoulos

CEO and Co-founder (PhD ML/CV, UC Berkeley; ex-Google DeepMind researcher)

Wei-Lin Chiang

CTO and Co-founder (PhD candidate, UC Berkeley SkyLab; ex-Amazon, Google Research, MSR Asia)

Ion Stoica

Co-founder and Advisor (UC Berkeley professor; co-founder Databricks and Anyscale)

Founder Force Multiplier

The team came out of UC Berkeley's SkyLab under Ion Stoica, whose track record includes Databricks and Anyscale. That gives Arena academic credibility with the labs being evaluated, distribution into the AI research community, and a systems pedigree few evaluation startups can match.

Funding History

2025 | $100M Seed co-led by a16z and UC
2026 | $150M Series A led by Felicis

Competitors

Hugging Face:

Model hosting and open leaderboards, broader scope but less focused on human-preference battles.

Artificial Analysis:

Automated benchmarking and pricing comparisons, no crowdsourced preference layer.

Vellum:

Enterprise eval and prompt ops tooling aimed at application teams, not model labs.

Arena (formerly LLMArena)

's Moat:

A proprietary dataset of millions of human preference votes across frontier models, a data asset competitors cannot replicate without matching Arena's community scale and neutrality.

How They're Leveraging AI

Human Preference Model Evaluation

Runs pairwise model comparisons across text, image, and video outputs, then converts crowdsourced preferences into benchmark scores and leaderboards used to compare frontier AI systems.

AI Use Overview:

Arena runs human-preference battles across text, image, and now video models, turning crowdsourced pairwise votes into Elo-style leaderboards and the underlying dataset used for research releases like Arena-Hard and RouteLLM.

More Similar Companies

Ashr

Catches AI agent failures before users see them by stress-testing across text, voice, and images.

AI agents are shipping to production faster than anyone can test them. Ashr generates synthetic users that stress-test agents across text, voice, and images before real users hit the failure modes.

Cajal

Deploys AI mathematicians that formally verify proofs, grounding outputs in truth not guesses.

LLMs hallucinate. Lean proves things. Cajal pairs LLMs with formal verification so every mathematical result is machine-checked, starting with quantum computing and finance where a wrong proof costs real money.

Cascade

Evaluates and certifies AI agents for safe deployment with red teaming and formal guarantees.

Red teaming and guardrails exist as separate tools. Cascade combines them into one platform with adaptive scaffolding that learns from production runs, already deployed across legal reasoning and customer support agents. The CEO researched graph reasoning and agentic safety at UC Berkeley's BAIR Lab.

Envariant

Lets model builders inspect and steer AI behavior inside the latent space to catch failures.

Most AI safety tools work on model outputs. Envariant operates inside the latent space itself, detecting hallucinations and drift at the representation level before they surface. Beta SDK launched with applications in text LLMs, robotic agents, and protein models.

Back To All Companies >