AI Coding Benchmark Leaderboard: Cursor + DeepSWE

Authors
  • avatar
    Name
    Hamza Rahman
Published on
-
11 mins read

Every time I picked a model I ended up bouncing between two AI coding benchmarks: CursorBench and DeepSWE. They're both solid, but they rank models differently, and on its own neither one told me what I actually wanted to know.

So I merged them into one AI coding benchmark leaderboard: a single ranked table that weighs "how good is it" against "what does it actually cost me." Short version, among the models you can run today, GPT-5.5 comes out on top. Fable 5 beats everything on paper, but it's suspended right now (more on that below).

Update (June 2026): Fable 5 did launch on June 9, but days later Anthropic suspended access for all users to comply with a US government export-control directive, so it's currently unavailable. It posts the highest scores in both benchmarks, so it's hidden by default in the leaderboard below (toggle it on with the model filter) and left out of the recommendations until access returns.

Why Two Benchmarks?

Single benchmarks saturate. Once every frontier model scores in the high 90s, the leaderboard stops telling you anything. Both of these were built to fix that, from opposite directions.

CursorBench is Cursor's internal eval. Its tasks come from real developer sessions (traced back through "Cursor Blame"), so the prompts are short, underspecified, and messy, which is how people actually talk to an agent. That sidesteps the contamination problem of public GitHub issues that models may have trained on. The current production version is CursorBench 3.1.

DeepSWE, from Datacurve, goes the other way: 113 hand-written tasks across 91 repos in 5 languages, none of them adapted from existing commits or PRs. Solutions need roughly 5.5x more code than comparable benchmarks. It runs every model on mini-swe-agent for consistency and scores with hand-written verifiers that test behavior, not implementation. Its headline metric is Pass@1: did the model solve the task on the first try.

One is real-world and underspecified, the other is original and hard. Put together, they land closer to day-to-day usefulness than either does alone.

How I Combined Them

I kept the math deliberately simple so anyone can reproduce it:

  • Mean correctness = average of (CursorBench correctness, DeepSWE Pass@1)
  • Mean cost = average of (CursorBench cost, DeepSWE cost per task)

No weighting, no normalization tricks. Both benchmarks already report cost in dollars per task, so the averages stay in real units. Then I ranked by mean correctness.

One caveat up front: a flat average treats both benchmarks as equally important, which is a choice, not a law. If your work looks more like quick in-editor edits, lean on the CursorBench column. If it looks like large multi-file features, lean on DeepSWE.

A few models only show up on one benchmark: Composer 2.5 and Kimi 2.6 on CursorBench, Kimi K2.7 Code on DeepSWE. Averaging a single score with a missing one isn't a mean, so I keep them out of the ranked table and off the chart. A one-benchmark number just isn't comparable to a combined one. That matters most for DeepSWE-only models, because DeepSWE runs harsher than CursorBench (most models score lower on it). Kimi K2.7 Code's lone 31% DeepSWE Pass@1 makes it look weak, but it ranks much higher on broader evals like Artificial Analysis. So I list these models separately below with their one score, as reference points only.

The Combined Leaderboard

Ranked by mean correctness (higher is better), with mean cost alongside. Use the buttons to toggle models on the chart and table. Fable 5 starts hidden since it's suspended. Cost runs high to low (cheaper is to the right, matching the CursorBench and DeepSWE charts), so the green value sweet spot, $7 per task or less and 55%+ mean correctness, sits in the top-right.

best value (≤ $7, ≥ 55%)$0$2$4$6$8$10$1240%50%60%70%Mean cost per task ($), cheaper is rightMean correctness (%)↗ better = up & rightGPT-5.5 - Extra HighGPT-5.5 - HighClaude Opus 4.8 - MaxClaude Opus 4.8 - Extra HighGPT-5.5 - MediumClaude Opus 4.8 - HighClaude Opus 4.8 - MediumGLM 5.2 - MaxClaude Opus 4.8 - LowGemini 3.5 Flash - Default/MediumClaude Sonnet 4.6 - HighGPT-5.5 - LowGPT-5.5Claude Opus 4.8Claude Sonnet 4.6GLM 5.2Gemini 3.5 Flash
Mean correctness vs mean cost (average of CursorBench + DeepSWE); cheaper is to the right, matching the source charts. Lines connect a model's effort tiers. Fable 5 is currently suspended (US export-control directive), so it's hidden by default. Toggle it on with the buttons above. The green box marks the value sweet spot: $7 per task or less and 55%+ mean correctness. Models on only one benchmark have no combined mean, so they're listed in the second table rather than plotted here. Hover any point for its name.
RankModel / effortMean correctnessMean costCursorBenchDeepSWE Pass@1Cursor costDeepSWE cost
1GPT-5.5 - Extra High65.67%$5.8064.30%67.04%$4.37$7.23
2GPT-5.5 - High63.49%$4.3562.60%64.38%$3.59$5.10
3Claude Opus 4.8 - Max61.39%$10.4163.80%58.97%$7.59$13.22
4Claude Opus 4.8 - Extra High58.23%$7.0862.10%54.36%$6.14$8.01
5GPT-5.5 - Medium56.59%$2.4959.20%53.98%$2.22$2.75
6Claude Opus 4.8 - High55.08%$4.3558.40%51.77%$4.41$4.28
7Claude Opus 4.8 - Medium52.64%$3.6456.60%48.67%$3.83$3.44
8GLM 5.2 - Max49.19%$3.5254.60%43.78%$3.11$3.92
9Claude Opus 4.8 - Low47.55%$2.6154.30%40.80%$2.93$2.29
10Gemini 3.5 Flash - Default/Medium43.59%$4.6449.80%37.39%$1.94$7.34
11Claude Sonnet 4.6 - High39.37%$4.2948.80%29.93%$3.06$5.52
12GPT-5.5 - Low37.90%$1.2048.80%26.99%$1.19$1.20

On one benchmark only (no combined mean)

Not plotted above and not ranked: a single score isn't comparable to a mean, and the benchmarks differ in difficulty (DeepSWE runs harsher). A low DeepSWE-only number can understate a model. Kimi K2.7 Code, for instance, ranks far higher on broader evals like Artificial Analysis.

ModelBenchmarkScoreCostMean
Composer 2.5CursorBench63.20%$0.55n/a
Kimi 2.6CursorBench47.60%$1.27n/a
Kimi K2.7 CodeDeepSWE Pass@131.00%$2.82n/a

What Jumps Out

Among the models you can actually use today, GPT-5.5 takes the top two spots. Extra High (65.67%) and High (63.49%) sit at the top of the ranking, and they get there cheaply. High lands its 63.49% for just $4.35, the best skill-for-money trade on the list.

Claude Opus 4.8 is the agentic specialist. Its Max tier hits 61.39% mean correctness. Strong, but pricier, especially on DeepSWE ($13.22 a task). Its real edge shows up off the benchmark, which I get into in my take below.

Cranking the reasoning effort costs more than it gives back. Every model improves as you raise the effort, but the cost curve climbs faster than the correctness curve, so the middle tiers are almost always the sweet spot.

The gap to the bottom is real, which is the whole point of using hard, contamination-free benchmarks. Claude Sonnet 4.6 and GPT-5.5 Low fall off a cliff on DeepSWE (29.93% and 26.99% Pass@1) even though they look fine-ish on CursorBench. Quick, underspecified edits are forgiving. Original multi-file tasks are not.

Fable 5 tops the chart, on paper. It would take the first four spots and post the highest correctness anywhere (71.31%), but it launched and was then suspended under a US export-control directive (see the note up top). So it's hidden by default above (flip it on with the filter) and left out of the picks. I'll revisit the ranking if access comes back.

Best Value Picks

If you don't want to stare at the table, here's how I'd choose:

PickWhy
GPT-5.5 HighBest balance overall: 63.49% mean correctness at $4.35
GPT-5.5 Extra HighBest available top-end: 65.67% at $5.80
GPT-5.5 MediumBudget sweet spot: 56.59% at just $2.49
Claude Opus 4.8 MaxWhen you want Claude's agentic strength: 61.39% at $10.41
Fable 5 High / MaxHighest raw correctness, but suspended. Revisit if access returns

My take: of the models you can actually buy right now, GPT-5.5 is the one I'd reach for. Fable 5 wins on paper, but you can't run it. And when a model hits your codebase hundreds of times a day, cost is what decides. Paying 3-4x for a few points of correctness isn't a close call.

How I Actually Use These (My Take)

Benchmarks rank models. They don't tell you how to actually use them. This is the flow I've settled into.

  • General, everyday coding → GPT-5.5 Medium. For most of what I do, the cost-to-usage ratio is excellent and it handles the work without fuss. It's my default.
  • Complex work → I split the model by phase:
    1. Planning → GPT-5.5 High or Extra High. Worth the extra reasoning when the task is big or ambiguous.
    2. Plan review → Claude Opus 4.8. It's good at poking holes in a plan and catching what the planner missed.
    3. Implementation → GPT-5.5 Medium. Once the plan is solid, Medium executes it cheaply.
  • Agentic / ops work → Claude Opus 4.8. When I need an agent to SSH around a server, check pods and logs, and trace the actual flow of a bug, Opus is noticeably better at it. GPT-5.5 can do this too, but Opus feels more reliable, and Artificial Analysis also ranks Opus 4.8 first for agentic work once you set Fable aside.
  • Fable 5 → still a question mark. It launched, then got pulled under the export-control suspension, so I haven't been able to test it properly. Once access returns I want to see how it compares to GPT-5.5 Extra High on vibes and on cost-to-usage during planning, to judge whether it's actually practical or just a benchmark leader. I'll report back then.

If you're wiring these models into something beyond the editor, I used the same GPT models in my walkthrough on building a RAG system with Node.js and OpenAI.

Caveats

A few things to keep in mind before you treat this as gospel:

  • It's an unweighted average of two benchmarks. That's a judgment call, so weight the columns toward your real workload.
  • Cost is per task and provider-dependent. Your effective cost shifts with caching, context size, and how you drive the agent.
  • Benchmarks measure benchmark tasks. They're a strong signal, not a substitute for trying the model on your own repo.
  • These numbers are a snapshot. CursorBench refreshes quarterly and DeepSWE keeps changing, so re-run the merge when either updates.

What You Get From This

One ranked view of AI coding agents that weighs skill against cost, built from two benchmarks instead of one. The takeaways:

  1. Among the models you can actually get, GPT-5.5 leads on skill per dollar. Extra High or High for serious work, Medium for everyday and budget.
  2. Claude Opus 4.8 is the pick for agentic and ops work (tracing real flow issues through pods and logs).
  3. Middle reasoning-effort tiers usually beat maxing the knob.
  4. Fable 5 tops the paper ranking but is suspended under a US export-control directive. I'll update this if access returns.

If you want a version tuned to your own work, swap the flat average for a weighted one and re-rank.

References

  1. CursorBench: Cursor's model evaluation framework
  2. DeepSWE: Datacurve's contamination-free SWE benchmark
  3. Anthropic: statement on the US government directive suspending Fable 5 and Mythos 5 access