EQ-Bench Judgemark v2 Leaderboard

⚖️ Judgemark v2 explanation of displayed stats:

Judgemark Score: Overall Judgemark score
Stability: A measure of how much the judge's assigned rankings vary between iterations
Separability: How well the judge can separate models by ability (measured by CI99 overlap with adjacent models)
Human Corr: Correlation with human preferences (per the LMSys Arena creative writing category)
Cost: The cost to complete one run of the benchmark (via openrouter)

For each row, you can view:

Stats – A link to the .json file with full computed statistics
Scoring Chart – Plots of the judge's assigned scores to the test models (raw & calibrated), and a heatmap of the frequency each score (0-10) was given by the judge.
Scatter Chart – Scatter plots of item length vs score.

Judgemark Score

The final Judgemark score is computed as a weighted sum of several metrics that quantify how well the judge has separated each model by ability, how stable the judge's rankings are, and how well they agree with human preference.

			
# All elements here are pre-normalised 0-1 (larger=better)		  
# Compute an aggregated separability metric
separability_agg = (
	+ kw_stat            # kruskal-wallis cluster analysis (separability)
	+ ci99_overlap       # confidence interval overlap between adjacently ranked models (separability)
) / 2.0

# Combine into final Judgemark score  
judgemark_score = (
	kendall_tau_iters    # correlation between iterations (ranking stability)
	+ kendall_tau_lmsys  # correlation with lmsys arena score (corr to human pref)        
	+ 4 * separability_agg # aggregate of separability metrics
) / 6.0