Most Coding Agents Break 75%+ of Their Own Fixes Over Time

Most coding agent benchmarks ask a model to fix one bug, in one snapshot of a repo, one time. SWE-CI asks something harder: can you maintain a real codebase across months of evolution without breaking what you just fixed?

The answer, for most models, is no.

100 tasks from 68 real Python repos, each spanning an average of 233 days and 71 consecutive commits
18 models from 8 providers tested, consuming over 10 billion tokens
Most models had a zero-regression rate below 0.25 — meaning they introduced regressions in 75%+ of tasks
Only the Claude Opus series exceeded a 50% zero-regression rate
Newer models consistently outperform older ones within the same provider family — but none have “solved” maintainability

The Problem with SWE-bench

SWE-bench is the canonical coding agent benchmark. An agent gets a GitHub issue, a repo snapshot, and one shot to produce a fix. If the tests pass, it wins.

That’s not how software development works.

Real codebases evolve. A feature added in January affects tests in March. A dependency upgrade breaks something from six months ago. The consequences of past decisions accumulate — and a benchmark that evaluates isolated, one-shot fixes can’t see any of that.

SWE-CI introduces what the authors call an “evolution-based” paradigm instead of a snapshot-based one. Each task plays out across multiple CI iterations (up to 20), where the agent must make code changes that not only pass today’s tests but don’t undo what was already working.

How the Benchmark Works

Task construction started from 4,923 repositories filtered down through a series of quality gates: 3+ years of active maintenance, 500+ GitHub stars, unit tests present, permissive license (MIT/Apache-2.0). From 8,311 candidate commit pairs, 1,458 were viable, and after a final quality pass, 100 tasks survived.

Each task has a minimum of 500 lines of modified source code (excluding test changes) and comes packaged in a pre-built Docker environment for reproducibility.

The evaluation loop uses a dual-agent protocol:

Architect — reviews failing CI tests, identifies root causes, writes a high-level requirement document (max 5 incremental requirements per iteration)
Programmer — reads the requirements, plans an implementation, modifies the code

This split mimics actual engineering workflow: someone understands the problem, someone else implements the fix. It also keeps the tasks tractable — the agent isn’t expected to context-switch between analysis and coding in one pass.

Scoring is done via EvoScore, a future-weighted metric that rewards long-term stability. Passing 90% of tests in iteration 10 matters more than passing 90% in iteration 2, because that’s where accumulated technical debt shows up. A regression — breaking tests that previously passed — is penalized heavily.

SWE-bench vs. SWE-CI

	SWE-bench	SWE-CI
Evaluation paradigm	Snapshot-based	Evolution-based
Task scope	Single issue, one shot	Up to 20 CI iterations
Time horizon	Single commit	Avg. 233 days / 71 commits
Measures regressions	No	Yes
Agent protocol	Single agent	Architect + Programmer

What the Results Show

The headline finding is regression control — or the near-total lack of it.

Most models introduce regressions on more than 75% of tasks. The zero-regression rate (fraction of tasks where the model makes no change that breaks an already-passing test) is below 0.25 for the majority of models tested. For a benchmark of long-term maintainability, that’s the core failure mode.

Claude Opus is the outlier. It’s the only model family to exceed a 50% zero-regression rate. Every other provider falls short of that bar — in some cases significantly.

Within provider families, newer always beats older. This holds without exception across all 8 providers tested. The gap also widens for post-2026 releases, suggesting the industry is directionally improving on this dimension even if no model has solved it.

Zero-Regression Rate (higher is better)

Fraction of tasks where the agent introduced zero regressions across all CI iterations

Claude Opus (series)above 50%

Most other modelsbelow 25%

18 models from 8 providers tested. Approximate ranges; exact per-model scores not published in the preprint.

The paper also finds that provider preferences diverge when the EvoScore weighting parameter (γ) changes. Some providers — MiniMax, DeepSeek, GPT — score better when long-term stability is weighted heavily. Others — Kimi, GLM — score better under short-term weighting. Claude and Qwen are stable across both. What you optimize for matters, and different architectures / training regimes produce different tradeoff profiles.

Why This Is Hard

The reason regression rates are so high is probably structural: agents are local optimizers. They see the current failing tests, fix them, and move on. They don’t model how their change interacts with tests that are currently passing, especially across a 200-day history of accumulated logic.

This is also why the benchmark required 10 billion tokens to run across 18 models. Each task is genuinely long — multiple CI rounds, each involving code comprehension, requirement synthesis, and implementation against a realistic repo. You can’t fake your way through this with pattern matching on file diffs.

The dual-agent split (Architect / Programmer) helps, but it’s still 20 chances to introduce a regression per task. Compound errors add up fast.

Caveats

The benchmark is currently 100 tasks, all Python, all open-source repos with permissive licenses. That’s a deliberate tradeoff for quality over quantity — each task required manual validation — but it’s a narrow slice of real-world software diversity. Enterprise codebases with proprietary dependencies, mixed language stacks, and legacy constraints are not represented.

The exact per-model scores aren’t in the preprint (only qualitative observations and relative rankings). Numerical results are described in terms of observations rather than published leaderboard tables in the version reviewed here.

The 20-iteration cap and 3600-second per-test timeout also set an upper bound on task complexity. Real CI pipelines can be longer and slower — the benchmark is realistic but not exhaustive.

Still, as a signal: if the best models on the market can only avoid regressions half the time on a controlled benchmark, the gap between “writes code that passes tests” and “maintains a codebase reliably” is real, large, and worth measuring.