Pluralistic Leaderboards Explorer

Why "best AI model" depends on who you ask, and what social-choice theory does about it.

Independent reference implementation of Pluralistic Leaderboards by Nika Haghtalab, Ariel Procaccia, Han Shao, Serena Wang, and Kunhe Yang (January 2026). [paper] · [code] · Apache-2.0 · Independent project; not officially affiliated with the paper authors.

The Problem: Leaderboards Squash Minorities

LMArena ranks AI models by asking humans which one is better in head-to-head conversations. With 100k+ votes from a diverse audience, you'd think the rankings would be reliable. They are (for the average voter). But "the average voter" is a statistical fiction.

Different people use AI for different things. Coders rate models by code quality. Creative writers rate them by prose. A casual user votes on completely different criteria than a research scientist. When you mash all those preferences together with the standard scoring rule (a method called Bradley-Terry), you get a single ranking that is mathematically biased toward the majority and systematically suppresses what minority sub-groups care about.

If 30% of users strongly prefer Model X for coding and 70% prefer Model Y for everything else, Bradley-Terry says "Y wins, every position." The ranking acts like the 30% don't exist. The leaderboard fails to tell coders the right answer (an information loss, not just a fairness complaint).

The Solution: Proportional Voice for Voter Sub-Groups

A 2026 paper from Ariel Procaccia's group at Harvard SEAS (with Nika Haghtalab at UC Berkeley) proposes a fix: instead of producing one ranking that pleases the average, produce a ranking where every proportionally-large sub-group of voters has its preferred model represented. This idea comes from social choice theory (the math of how groups should aggregate preferences fairly). The key concept is local stability: a ranking is "locally stable" if no sub-group can complain that another model would have served them better.

The paper proposes three algorithms. The first one (Algorithm 1) does committee selection (it picks one stable committee). The other two extend that idea to producing a full ranking, top to bottom, with a stability guarantee at every position.

What This Tool Does

This tool takes 20,000 real LMArena head-to-head votes and asks: how would the leaderboard look under each method? You can compare three rankings side by side:

BT Bradley-Terry (the standard)

How it works: Treats every vote as evidence of one universal "skill" score per model. Fits a single number per model that best explains all the votes. Models are ranked by skill score.

Key property: Simple, fast, mathematically clean, but it silently assumes everyone agrees what "better" means. Breaks when voters split into sub-groups with different preferences.

A2 Algorithm 2 (geometric checkpoints)

How it works: Uses Algorithm 1 to fill the leaderboard in geometrically-spaced batches (positions 1, 2, 4, 8, 16). Each batch is chosen to be locally stable on its own.

Key property: Theoretically simpler. Empirically can fail at high voter dispersion (on our data, it violates stability at five prefix sizes when φ=0.9). Useful as a comparison point, not as the right answer.

A3 Algorithm 3 (one model at a time)

How it works: Builds the leaderboard one model at a time. For each new pick, it asks: "Would adding this model leave any proportionally-large sub-group of voters without a preferred option in the top-k so far?" If yes, picks differently. Repeats until the leaderboard is full.

Key property: No proportionally-large sub-group can complain that their favorite isn't represented. Gives every voter cluster a "voice" on the leaderboard, in proportion to its size. This is the paper's central ranking algorithm and the one to focus on.

How to Read the Results

  1. Toggle the φ button (top of the interactive section) to change how spread-out voter preferences are. Low φ = voters cluster into clear sub-groups; high φ = preferences are noisy and broad.
  2. Look at γ̂ (gamma-hat). This is the worst-case "complaint score" (how loudly the most-shortchanged sub-group could complain). Lower γ̂ = more representative ranking. The dashed red line at γ̂ = 1 is the fairness threshold; below it means no sub-group has a valid complaint.
  3. Compare the three rankings. Models that move up from Bradley-Terry's order get a green badge; models that move down get an orange one. The differences are where Algorithm 3 is doing real work (surfacing models that a sub-group prefers but Bradley-Terry's average-voter view drowned out).
  4. Try this: set φ = 0.1 (cohesive sub-groups). Watch Bradley-Terry rank gpt-4.1 at #7. Now look at Algorithm 3. gpt-4.1 drops to #19. Why? Bradley-Terry sees its average win-rate as solid, but no sub-group of voters strongly prefers it; A3's pluralism filter catches that. Compare to claude-opus-4-thinking-16k, which jumps from #11 (BT) to #7 (A3) because a niche sub-group favors it strongly enough to deserve representation.

What "Algorithm 3" Means in One Sentence

Build the leaderboard so that no proportionally-large sub-group of voters can name a different model they'd rather have at any position. At each rank, pick the model the next under-represented sub-group needs (instead of the model the average voter likes most). Repeated, top to bottom, until the leaderboard is full.

A note on the data: The model names below (Gemini 2.5 Pro, Claude Opus 4, Sonnet 4, etc.) reflect what was on LMArena in mid-2025, when the underlying dataset (lmarena-ai/arena-human-preference-140k) was released. LMArena hasn't published a newer general-preference dataset of comparable size, and this is the same snapshot the paper uses. The algorithmic point is independent of model freshness; the rankings would update with newer data, but the BT-vs-A3 separation pattern wouldn't. A sensitivity check on a smaller, newer slice (arena-expert-5k, November 2025) is in the repo if you want to verify.

The Run, Live

All numbers below come from a single end-to-end run committed in the repo (results/results.json). Reproducible from one seed in ~35 seconds on CPU.

Battles
Models
Categories
Eval users
Epsilon
Seed
Runtime
Voter dispersion φ:
Lower φ = cohesive sub-groups · Higher φ = noisier preferences

Max γ̂ at the current φ (worst-case complaint score)

Lower is better. γ̂ measures the worst-case complaint a sub-group could make about the leaderboard at the size that bothers them most. Below 1 = no sub-group has a valid complaint at any position.

Stability curves

Each line plots the complaint score at every leaderboard size from top-1 to top-19. The horizontal threshold at γ̂ = 1 is the fairness boundary; algorithms staying below it preserve representation across all leaderboard sizes. The further the green line (A3) sits below the gray line (BT), the more pluralism the algorithm is recovering.

Top-20 rankings, side by side

Same models, three different rankings. Rank-change badges show movement relative to Bradley-Terry. Green up-arrows mark a model that moved up vs BT (a sub-group's favorite that BT missed). Orange down-arrows mark a model that moved down vs BT (BT over-weighted it relative to its actual sub-group support).

Where the votes come from

12 prompt categories from a 4-base × 3-difficulty flattening of LMArena's category_tag schema. The mixture weight is the share of votes in that category; the per-category top model is the Bradley-Terry winner restricted to that slice. gemini-2.5-pro wins 9 of 12 categories on this snapshot, which explains why Bradley-Terry performs well at high φ here and why a more diverse dataset would surface the BT-vs-A3 gap more starkly.

Does This Hold on Newer Data?

The canonical run above uses LMArena's August 2025 dataset because it's the largest publicly-released slice and the same one Procaccia's paper uses. To check whether the algorithmic findings hold on more recent data, we re-ran the same Algorithm 1/2/3 + Bradley-Terry comparison on lmarena-ai/arena-expert-5k (an expert-only vote slice from November 2025: smaller but newer, with 605 in-slice battles after filtering, 19 active categories, no dominant faction).

Side-by-side: max γ̂ on both datasets

φ
Dataset
BT max γ̂
A2 max γ̂
A3 max γ̂
0.1
arena-human-preference-140k
0.260
0.132
0.132 ✓
0.1
arena-expert-5k (Nov 2025)
0.699
0.694
0.704
0.5
arena-human-preference-140k
0.430
0.549
0.387 ✓
0.5
arena-expert-5k (Nov 2025)
0.742
0.974
0.560 ✓
0.9
arena-human-preference-140k
0.689
1.140 (5 unstable)
0.883
0.9
arena-expert-5k (Nov 2025)
0.813
0.865
1.171 (4 unstable)

What the side-by-side shows

Three observations from the comparison:

  1. Mid-φ (0.5): the paper's main claim replicates cleanly on both datasets. Algorithm 3 cuts max γ̂ vs Bradley-Terry by 10% on 140k and 24% on the 5k slice. Mid-dispersion (cohesive but overlapping sub-groups) is where pluralism matters most, and the result transfers cleanly across datasets.
  2. Low-φ (0.1): the pattern weakens on 5k. All three methods land within ~1% of each other on this seed (A3 = 0.704 vs BT = 0.699). A second seed (seed = 7) shows A3 winning by 13%, so the qualitative claim holds in expectation but not strictly per-seed. The expert-5k Mallows centers are highly heterogeneous across 19 categories, so even concentrated oracles produce a high baseline γ̂.
  3. High-φ (0.9): the pattern inverts. On 140k, A2 violated stability at 5 prefixes and A3 was stable. On 5k, A3 violates at 4 prefixes (max 1.17 at k=8); A2 stays stable. Confirmed across seeds. Mechanism: the 5k slice has 19 categories of comparable mass with no dominant faction (max weight 27% on "software"); 140k had gemini-2.5-pro winning 9 of 12 categories. Algorithm 3's single-addition decomposition cannot satisfy 19 high-dispersion factions for small k; Algorithm 2's geometric checkpoints reset the lottery and recover.
Algorithm 3 reliably dominates Bradley-Terry in the mid-dispersion regime, where the paper's central claim sits and where most real LMArena-style preference data falls. At low and high dispersion, empirical performance depends on the underlying user-distribution structure: whether there's a dominant faction (favors A3) or many comparable-mass factions (favors A2). The theoretical A3 guarantee remains; empirical max γ̂ varies with the oracle's structure. Worth flagging in any joint write-up: the data sits in results/results-expert-5k.json and the discussion in run_log.md.

Why This Matters Beyond LMArena

Leaderboards shape AI training data. RLHF (the human-feedback loop that fine-tunes every modern model) flows through the same kind of preference aggregation. If Bradley-Terry suppresses minority preferences in evaluation, it suppresses them in training too. The result is models that are quietly miscalibrated for entire user populations.

Procaccia's group calls this mechanism alignment: aligning the AI's training and evaluation mechanisms with the actual diversity of human preferences, instead of an averaged-out fiction. Pluralistic Leaderboards is one piece of a three-paper 2026 program from his lab on this question:

All three argue the same thing from different angles: the standard aggregation method (Bradley-Terry) is a load-bearing failure mode in modern AI evaluation and alignment, and social-choice theory provides the principled fix.