Longform Creative Writing

Emotional Intelligence Benchmarks for LLMs

Github | Paper | | Twitter | About

💙EQ-Bench3 | ✍️Longform Writing | 🎨Creative Writing v3 | ⚖️Judgemark v2 | 🎤BuzzBench | 🌍DiploBench | 🎨Creative Writing (Legacy) | 💗EQ-Bench (Legacy)

An LLM-judged longform creative writing benchmark (v3). Learn more
Model Length Slop Repetition Degradation Score Samples


Longform Creative Writing Benchmark

This benchmark evaluates several abilities:

  1. Brainstorming & planning out a short story/novella from a minimal prompt
  2. Reflect on the plan & revise
  3. Write a short story/novella over 8x 1000 word turns

Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings.

Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7.

Length

The average chapter length (chars).

Slop Score

The Slop column measures the frequency of words/phrases typically overused by LLMs (“GPT-isms”) in each completed chapter. The lower, the better.

Repetition Metric

The Repetition column measures how strongly a model repeats words/phrases across multiple tasks. Higher means more repetition.

Degradation

A mini-sparkline of the 8 chapter scores (averages) to visually see if the model’s chapter quality drops off as it continues writing. The degradation score is the absolute value of the trendline's gradient.

Score (0-100)

The overall final rating assigned by the judge LLM, scaled to 0–100. Higher is better.