Judgemark v2

Emotional Intelligence Benchmarks for LLMs

Github | Paper | | Twitter | About

💙EQ-Bench3 | 🎨Creative Writing | ⚖️Judgemark v2 | 🎤BuzzBench | 💗EQ-Bench (Legacy)

A benchmark measuring LLM judging ability. Learn more

Model Judgemark Score Stability Separability Human Corr Cost


⚖️ Judgemark v2 explanation of displayed stats:

For each row, you can view:

Judgemark Score

The final Judgemark score is computed as a weighted sum of several metrics that quantify how well the judge has separated each model by ability, how stable the judge's rankings are, and how well they agree with human preference.

			
# All elements here are pre-normalised 0-1 (larger=better)		  
# Compute an aggregated separability metric
separability_agg = (
	+ kw_stat            # kruskal-wallis cluster analysis (separability)
	+ ci99_overlap       # confidence interval overlap between adjacently ranked models (separability)
) / 2.0

# Combine into final Judgemark score  
judgemark_score = (
	kendall_tau_iters    # correlation between iterations (ranking stability)
	+ kendall_tau_lmsys  # correlation with lmsys arena score (corr to human pref)        
	+ 4 * separability_agg # aggregate of separability metrics
) / 6.0