EQ-Bench 3

Emotional Intelligence Benchmarks for LLMs

Github | Paper | | Twitter | About

💙EQ-Bench3 | ✍️Longform Writing | 🎨Creative Writing v3 | ⚖️Judgemark v2 | 🎤BuzzBench | 🌍DiploBench | 🎨Creative Writing (Legacy) | 💗EQ-Bench (Legacy)

A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7. Learn more

Low
High
Model Abilities Humanlike Safety Assertive Social IQ Warm Analytic Insight Empathy Compliant Moralising Pragmatic Elo Score
Model Abilities Humanlike Safety Assertive Social IQ Warm Analytic Insight Empathy Compliant Moralising Pragmatic Elo Score


For more details about the benchmark, see the About section.

Scoring

The Elo score shown in the leaderboard is calculated from pair-wise model comparisons, where the LLM judge rates each response against eight core dimensions of emotional intelligence:

  • Demonstrated empathy
  • Pragmatic EI (practical application of emotional intelligence)
  • Depth of insight
  • Social dexterity
  • Emotional reasoning
  • Appropriate validation and/or challenge for the scene
  • Message tailoring to the audience and context
  • Overall EQ

Note: the coloured “Abilities” heat-map columns (Humanlike, Safety, Assertive, etc.) are not used in the Elo calculation—they are purely informational, giving a quick view of each model’s stylistic traits and skill profile.

Metrics

The leaderboard displays several metrics:

  • Abilities: Click the chart icon () to view detailed breakdowns of model abilities (radar/bar charts).
  • Feature Columns (Humanlike, Safety, etc.): These represent scores assigned by the LLM judge based on specific criteria related to emotional intelligence facets. Scores range 0-10. Heatmap indicates score relative to other models for that feature (Low=Purple, High=Yellow). Click header to sort. Hover over header for definition.
  • Elo Score: A relative ranking score calculated using the Bradley-Terry model based on pairwise comparisons between models judged on the same prompts. Higher is better. The thin bar behind the score indicates the 95% confidence interval.
  • Sample Link: Provides access to example outputs generated by the model for this benchmark.

Features Measured
  • Humanlike How natural and human-like the response feels.
  • Safety Adherence to safety guidelines; avoids harmful content.
  • Assertive Confident, sets boundaries, and pushes back when needed.
  • Social IQ Understands and navigates social dynamics effectively.
  • Warm Friendly, kind, and approachable tone.
  • Analytic Logical reasoning, problem-solving, structured thinking.
  • Insight Offers depth, novel perspectives, spots underlying issues.
  • Empathy Recognises, understands, and shares others’ feelings.
  • Compliant Willingness to follow instructions or agree with the user.
  • Moralising Tendency to judge or lecture on moral principles.
  • Pragmatic Focus on practical, real-world solutions.