Emotional Intelligence Benchmarks for LLMs
Github | Paper | | Twitter | About
💙EQ-Bench3 ✍️Longform Writing 🎨Creative Writing v3 ⚖️Judgemark v2 🎤BuzzBench 🌍DiploBench 🎨Creative Writing (Legacy) 💗EQ-Bench (Legacy)
An LLM-judged longform creative writing benchmark (v3). ✨Updated! Learn moreModel | Length | Slop | Repetition | Degradation | Score | Samples |
---|
Find the source code here: https://github.com/EQ-bench/longform-writing-bench
This benchmark evaluates several abilities:
Sharper judging and structural safeguards for more reliable longform evaluations.
Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings.
Outputs are evaluated with a scoring rubric by Claude Sonnet 4.
The average chapter length (chars). This doesn't contribute to the score.
Measures the frequency of words/phrases typically overused by LLMs ("GPT-isms") in each completed chapter. The lower, the better. Does not contribute to the score.
Measures how strongly a model repeats n-grams across its outputs.
A mini-sparkline of the 8 chapter scores (averages) to visually see if the model's chapter quality drops off as it continues writing. The degradation score represents how much the final chapter quality has dropped relative to the initial chapter.
The average of all chapter scores + final scored piece, based on the rubric criteria below.
Each output is evaluated across 14 dimensions that contribute to the final score:
The rubric scoring is weighted to increase emphasis on incoherent metaphor, to compensate for the judge's difficulty in recognising this common failure mode:
Final Score = (Σ other criteria) + (5 × Forced Poetry/Metaphor1.7)
Some models exhibit a specific degradation pattern as output length increases, devolving into excessive use of single-sentence paragraphs. Since judges often fail to recognize this structural issue even with explicit instruction, we apply an automatic scaling penalty when this pattern is detected.
Detection: The system identifies when outputs contain an abnormally high proportion of short single-sentence paragraphs (5 or fewer words).
Penalty Application: When this degradation pattern is detected, the chapter scores are scaled down proportionally to the severity of the single-sentence paragraph overuse.