Emotional Intelligence Benchmarks for LLMs
Github | Paper | | Twitter | About
💗EQ-Bench | 🎨Creative Writing | ⚖️Judgemark v2 | 🎤BuzzBench
A benchmark measuring a LLM judge's ability to grade creative writing using a complex scoring rubric. Learn more
Model | Score (Calibrated) | Score (Raw) | Stability | Separability | Cost |
---|
⚖️ Judgemark v2 explanation of displayed stats:
Score (Calibrated)
: Overall Judgemark score, after calibration to normalise the judge's score distributionScore (Raw)
: Judgemark score without calibration ("out of the box" performance)Stability
: A measure of how much the judge's assigned rankings vary between iterationsSeparability
: How well the judge can separate models by ability (measured by CI99 overlap with adjacent models)Cost
: The cost to complete one run of the benchmark (via openrouter)For each row, you can view:
Judgemark Score
The final Judgemark score is computed as a weighted sum of several metrics that quantify how well the judge has separated each model by ability, how stable the judge's rankings are, and how well they agree with human preference.
# All elements here are pre-normalised 0-1 (larger=better)
# Compute an aggregated separability metric
separability_agg = (
std_dev # std deviation *between* models (separability)
+ kw_stat # kruskal-wallis (separability)
+ ci99_overlap # confidence interval overlap between adjacently ranked models (separability)
+ score_range # range of assigned scores (separability)
+ per_model_ci95_avg # average ci95 per model scored (score stability + separability)
+ emd # earth-movers distance (separability)
) / 6.0
# Combine into final Judgemark score
judgemark_score = (
kendall_tau_iters # correlation between iterations (ranking stability)
+ kendall_tau_lmsys # correlation with lmsys arena score (corr to human pref)
+ 4 * separability_agg # aggregate of separability metrics
) / 6.0