Longform Creative Writing

Emotional Intelligence Benchmarks for LLMs

Github | Paper | | Twitter | About

💙EQ-Bench3 | ✍️Longform Writing | 🎨Creative Writing v3 | ⚖️Judgemark v2 | 🎤BuzzBench | 🌍DiploBench | 🎨Creative Writing (Legacy) | 💗EQ-Bench (Legacy)

An LLM-judged longform creative writing benchmark (v3). Learn more
Model Length Slop Repetition Degradation Score Samples


Longform Creative Writing Benchmark

This benchmark evaluates several abilities:

  1. Brainstorming & planning out a short story/novella from a minimal prompt
  2. Reflect on the plan & revise
  3. Write a short story/novella over 8x 1000 word turns

Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings.

Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7.

Length

The average chapter length (chars).

Slop Score

The Slop column measures the frequency of words/phrases typically overused by LLMs (“GPT-isms”) in each completed chapter. The lower, the better.

Repetition Metric

The Repetition column measures how strongly a model repeats words/phrases across multiple tasks. Higher means more repetition.

Degradation

A mini-sparkline of the 8 chapter scores (averages) to visually see if the model’s chapter quality drops off as it continues writing. The degradation score represents how much the final chapter quality has dropped relative the the initial chapter.

Score (0-100)

The overall final rating assigned by the judge LLM, scaled to 0–100. Higher is better.