Emotional Intelligence Benchmarks for LLMs
Github | Paper | | Twitter | About
💙EQ-Bench3 | 🎨Creative Writing | ⚖️Judgemark v2 | 🎤BuzzBench | 💗EQ-Bench (Legacy)
A benchmark measuring LLM judging ability. Learn more
Model | Judgemark Score | Stability | Separability | Human Corr | Cost |
---|
⚖️ Judgemark v2 explanation of displayed stats:
Judgemark Score
: Overall Judgemark scoreStability
: A measure of how much the judge's assigned rankings vary between iterationsSeparability
: How well the judge can separate models by ability (measured by CI99 overlap with adjacent models)Human Corr
: Correlation with human preferences (per the LMSys Arena creative writing category)Cost
: The cost to complete one run of the benchmark (via openrouter)For each row, you can view:
Judgemark Score
The final Judgemark score is computed as a weighted sum of several metrics that quantify how well the judge has separated each model by ability, how stable the judge's rankings are, and how well they agree with human preference.
# All elements here are pre-normalised 0-1 (larger=better)
# Compute an aggregated separability metric
separability_agg = (
+ kw_stat # kruskal-wallis cluster analysis (separability)
+ ci99_overlap # confidence interval overlap between adjacently ranked models (separability)
) / 2.0
# Combine into final Judgemark score
judgemark_score = (
kendall_tau_iters # correlation between iterations (ranking stability)
+ kendall_tau_lmsys # correlation with lmsys arena score (corr to human pref)
+ 4 * separability_agg # aggregate of separability metrics
) / 6.0