Judgemark v2

Emotional Intelligence Benchmarks for LLMs

Github | Paper | | Twitter | About

💗EQ-Bench | 🎨Creative Writing | ⚖️Judgemark v2 | 🎤BuzzBench

A benchmark measuring a LLM judge's ability to grade creative writing using a complex scoring rubric. Learn more

Model Score (Calibrated) Score (Raw) Stability Separability Cost


⚖️ Judgemark v2 explanation of displayed stats:

For each row, you can view:

Judgemark Score

The final Judgemark score is computed as a weighted sum of several metrics that quantify how well the judge has separated each model by ability, how stable the judge's rankings are, and how well they agree with human preference.

			
# All elements here are pre-normalised 0-1 (larger=better)		  
# Compute an aggregated separability metric
separability_agg = (
	std_dev              # std deviation *between* models (separability)
	+ kw_stat            # kruskal-wallis (separability)
	+ ci99_overlap       # confidence interval overlap between adjacently ranked models (separability)
	+ score_range        # range of assigned scores (separability)
	+ per_model_ci95_avg # average ci95 per model scored (score stability + separability)
	+ emd                # earth-movers distance (separability)
) / 6.0

# Combine into final Judgemark score  
judgemark_score = (
	kendall_tau_iters    # correlation between iterations (ranking stability)
	+ kendall_tau_lmsys  # correlation with lmsys arena score (corr to human pref)        
	+ 4 * separability_agg # aggregate of separability metrics
) / 6.0