Emotional Intelligence Benchmarks for LLMs
Github | Paper | | Twitter | About
💙EQ-Bench3 | 🎨Creative Writing | ⚖️Judgemark v2 | 🎤BuzzBench | 🌍DiploBench | 💗EQ-Bench (Legacy)
Model | Params | Length | Slop | Vocab* | Creative Writing |
---|
The Vocab column measures the proportion of words with 3+ syllables in a model's output. Higher values indicate more complex vocabulary usage.
The Vocab Control slider applies a scaling penalty to models that use an unusually high proportion of complex words. This addresses an observed bias where the judge model is easily impressed by over-the-top vocabulary flexing, which can artificially inflate scores.
While using complex vocabulary can enhance writing in some contexts, our view is that excessive use of complex, multisyllabic words is perceived negatively by humans. It's a kind of stylistic slop which tends to read unnaturally and sound amateurish.
The Slop column measures the frequency of words and phrases typically overused by language models (sometimes called "GPT-isms"). The GPT-Slop Control slider allows you to penalize models that use these slop words excessively.
For more information on these metrics and how they're calculated, please see the About section.