---
title: "Claude Fable 5 vs Opus 4.8: Which Model Writes Better PR Story Angles? (50-Brand Blind Eval)"
date: "2026-06-09"
image: /images/blog/claude-fable-5-vs-opus-4-8-pr-angles/cover.png
authorName: Elvis Sun
authorImage: /images/manifesto/elvis-avatar.jpg
authorTwitter: https://x.com/elvissun
authorLinkedIn: https://linkedin.com/in/elvissun
excerpt: "We ran 50 real brands through Claude Fable 5 and Opus 4.8, had GPT-5.5 judge every pair blind in both orderings. Fable won 67 of 100 judgments and led 6 of 7 quality dimensions — but the position-bias finding is the part to take home."
"og:image": /images/blog/claude-fable-5-vs-opus-4-8-pr-angles/cover.png
tags: [eval, claude, ai, pr]
published: true
---

We just ran 50 real brands through both **Claude Opus 4.8** and **Claude Fable 5**, asked each to generate press angles from the same company update, and had **GPT-5.5** judge every pair blind in both orderings.

Headline: Fable won. But the more interesting result was the one we almost reported wrong.

## The setup

Both models ran the exact same `angle-generator` skill on the exact same input — a one-paragraph company update for 50 brands across consumer CPG, fintech, dev tools, climate, edtech, hardware, and a 14-person seed startup as the small-end stress test. Every generation was a clean-context subagent. Neither model knew the other existed. Neither saw the rubric.

The judge was a separate vendor's model — GPT-5.5 via `codex exec`, grounded in our `meanest-editor` skill. Two contestants from Anthropic, one neutral judge from OpenAI. Self-preference doesn't get to be a factor when no contestant judges itself.

Then we ran every brand through the judge **twice**, once as A=Opus/B=Fable and once swapped. 50 brands × 2 orderings = 100 judgments. Every judgment got re-anchored to model identity in aggregation, so slot position cancels out.

That second ordering is the whole reason this number means anything. More on that in a second.

Full methodology, prompts, and every per-brand verdict are in [the open-source eval repo](https://github.com/elvisun/newsjack/tree/main/eval/fable-vs-opus).

## Headline numbers

Re-anchored to model identity:

| Metric | Fable 5 | Opus 4.8 |
|---|---|---|
| Wins (of 100 judgments) | **67** | 33 |
| Dimensions led (of 7) | **6** | 0 (Proof Rigor tied) |
| Meanest-editor `publishable` verdict | 78/100 | 54/100 |
| Overall dimension mean (1–5) | **4.60** | 4.36 |

Per dimension (Fable minus Opus):

| Dimension | Fable 5 | Opus 4.8 | Δ |
|---|---|---|---|
| News Value | 4.41 | 4.08 | +0.33 |
| Distinctness | 4.78 | 4.37 | +0.41 |
| Journalist Shape | 4.90 | 4.62 | +0.28 |
| Grounding | 3.90 | 3.65 | +0.25 |
| Anti-Slop | 4.73 | 4.51 | +0.22 |
| Proof Rigor | 4.79 | 4.80 | −0.01 |
| Usefulness | 4.69 | 4.49 | +0.20 |

![Fable 5 vs Opus 4.8 per-dimension scorecard — Claude Fable 5 leads 6 of 7 quality dimensions](/images/blog/claude-fable-5-vs-opus-4-8-pr-angles/per-dimension-scorecard.png)

Clean sweep on quality, dead tie on the one dimension that measures restraint.

## The bit we almost got wrong

GPT-5.5, it turns out, has a hard position bias. In **69% of decisive judgments, it picked whichever set it saw first** — regardless of which model produced it. We caught this in a 1-brand pilot where slot A won 2/2 and the headline result flipped depending on ordering. That's why the full study mandates both orderings on every brand.

The honest read isn't the 67–33 win count. It's the position-independent view — who won in *both* orderings of the same brand:

| Outcome per brand (both orderings) | Brands |
|---|---|
| **Fable won both** (robust win) | **24 / 50** |
| **Opus won both** (robust win) | **7 / 50** |
| Split (judge defaulted to position) | 19 / 50 |

All 19 split brands are textbook position bias: slot A won in both orderings. So on those brands, the two sets were close enough that the judge fell back on order.

Where the judge had a real, order-independent preference: **Fable won 24 to 7, ≈77%.**

If we'd run a single ordering and called it a day, we'd have shipped a number contaminated by which slot we happened to put each model in. That number would have been wrong in a way that's invisible without the counterbalance. We came close.

## What Fable actually did better

The wins cluster on two things: finding one more genuinely distinct angle per brand, and — somewhat counterintuitively — better hallucination discipline at the headline level.

**Canva** is the cleanest example. The brief said Canva is launching an enterprise suite, *and* recently acquired an AI image-generation startup. Opus connected them, producing an angle headlined "Canva's recent AI acquisition is already powering its enterprise suite." Nothing in the brief said the acquired tech powers the new tool. Fable refused that angle by name and shipped it on a "refused angles" list with the reason `hallucinated_fact`. The judge called this out: "the difference between an angle and an invented connection."

**Reddit** is similar. Opus said Reddit's data-licensing deals "rival ads." The brief said $200M in data deals, no ads comparison. Fable: "the profit story now leans on $200M of data deals investors can't fully see." Same fact, no overclaim.

**Duolingo**: Fable's "Educators call gamified learning shallow. Duolingo's answer is a talking cartoon" vs Opus's, per the judge, "more generic markets mush."

**Discord**: Fable separated the human server-operator story ("about to find out what their communities are worth") as its own reporting path with its own protagonist. Opus left it at trend altitude.

The pattern: Fable consistently found one extra angle per brand that was distinct from the others, and shaped it sharper. The `distinctness` and `journalist_shape` deltas (+0.41 and +0.28) are where it earns the win.

## Where Opus won

The Opus counter-signal is real and narrow: **evidentiary restraint.** It tied `proof_rigor` and produced 7 robust wins of its own (Airbnb, Oatly, Peloton, Substack, Allbirds, Deel, Replit).

**Ramp** is the case study. The judge: "'there is no single breaking news peg supplied here' is the kind of restraint that keeps a founder out of fake-peg trouble." Opus refused to invent a peg the brief didn't supply. Fable, on the same brand, took a single launch and inflated it into an unsupported "category thesis" — the judge dinged it for that.

On **Klarna**, **Hims & Hers**, and **Replit**, Opus flagged the exact hole a reporter would probe. Fable's stronger headline energy, the judge wrote, "went past the supplied facts — how a decent angle becomes a correction waiting to happen."

So the trade is: Fable is more generative, more distinct, sharper-shaped. Opus is marginally more conservative. If your downside is "this PR pitch sounds underwhelming," Fable. If your downside is "this pitch contains a claim we can't back up," there's a real argument for Opus's caution — and a stronger argument for keeping a human (or `meanest-editor` pass) in the loop regardless.

The fact that Fable also led `grounding` (+0.25) suggests its extra reach usually stayed inside the facts. But the tied `proof_rigor` is the place where Opus's discipline showed up.

## What this isn't

One judge model. Real-world PR editors don't all think like GPT-5.5, and a second judge or a human PR panel would harden the conclusion against single-judge idiosyncrasy.

Position bias is strong enough (0.69) that ~38% of brands were too close for the judge to separate without falling back on order. Counterbalancing cancels it in aggregate; it doesn't make those brands suddenly decidable.

These are constructed scenarios, not live news. We're measuring angle *craft* on controlled facts, not real-time newsjacking.

And we measured the models, not the skill. The `angle-generator` skill was held fixed across every run. We did not tune the prompt to move a number. If we had, the result would be meaningless.

## Why we shipped this

If you're building anything that depends on an LLM judge — eval suites, RLHF preference data, anything where "which output is better" gets turned into a score — the position bias finding is the part to take home. We pulled a judgment from one ordering on the pilot, and it pointed at the wrong winner. Two orderings, every time, or your number is downstream of slot order.

Full methodology, brand list, prompts, every per-brand verdict, and the aggregation script are in [the eval repo on GitHub](https://github.com/elvisun/newsjack/tree/main/eval/fable-vs-opus). The headline writeup is [`runs/2026-06-09-full/verdict.md`](https://github.com/elvisun/newsjack/blob/main/eval/fable-vs-opus/runs/2026-06-09-full/verdict.md).

If you replicate this with a different judge model and see a different position bias, that's the actually-interesting follow-up. Tell us.