Engineer's Codex

Research papers and engineering breakthroughs, distilled for engineers who build.

AI Summary · Apr 5, 2026

LLMs Can Improve at Code by Training on Their Own Wrong Answers

Simple Self-Distillation (SSD) lets LLMs improve at code generation by training on their own unverified outputs, no correctness labels or execution environment needed. Qwen3-30B jumps 12.9 points on LiveCodeBench v6.

Edoardo Cetin et al.

AI Analysis · Mar 8, 2026

Most Coding Agents Break 75%+ of Their Own Fixes Over Time

SWE-CI is a new benchmark that evaluates coding agents on long-term codebase maintenance via continuous integration loops — not one-shot bug fixes. Most models introduced regressions on 75%+ of tasks. Only Claude Opus exceeded a 50% zero-regression rate.

Jialong Chen et al.

AI Analysis · Mar 8, 2026

Claude Just Solved an Open Math Problem That Had Stumped Researchers for Weeks

Don Knuth published a paper describing how Claude Opus 4.6 solved an open combinatorics problem he'd been working on for weeks: finding a general decomposition of a 3D digraph's arcs into three directed Hamiltonian cycles. Claude found it in about one hour across 31 explorations.

Donald E. Knuth

AI Brief · Mar 1, 2026

Claude Adds Ability to Import Memory From Other AI Providers

Anthropic added a memory import tool that lets you copy your context and preferences from ChatGPT, Gemini, or any other AI provider into Claude in under a minute.

Anthropic

AI Analysis · Mar 1, 2026

The Answer Key Trick That Cuts Reasoning LLM Training Time in Half

A*-PO is a new RL training algorithm for LLMs that precomputes an 'optimal value' offline, then trains with just one sample per prompt instead of many. It matches or beats PPO and GRPO at up to 2x faster speed and 30%+ lower memory.

Brantley et al. (Harvard, BU, Cornell, Princeton)

AI Analysis · Mar 1, 2026

LLMs Can Now Figure Out Who's Behind Any Pseudonym — For Just $4

Researchers from ETH Zurich and Anthropic show that LLM agents can re-identify pseudonymous online accounts at scale — achieving up to 68% recall at 90% precision compared to near 0% for the best classical methods. The assumption that posting under a pseudonym is safe no longer holds.

Lermen, Paleka, Swanson, Aerni, Carlini, Tramèr (ETH Zurich, Anthropic, MATS)

Industry Analysis · Feb 27, 2026

Block Announces Layoffs of 4,000 People, Over 40% Cut

Jack Dorsey announces Block is cutting over 4,000 employees — nearly half its workforce — citing AI-driven changes to how companies operate. The stock jumped almost 25% after hours.

Jack Dorsey Jack Dorsey Balaji Srinivasan

DevTools Brief · Feb 27, 2026

Claude Offers Free 6-Month Claude Max Memberships for Open-Source Maintainers

Anthropic is giving up to 10,000 open-source maintainers free Claude Max 20x subscriptions for six months.

Anthropic

DevTools Summary · Feb 26, 2026

Claude Code Now Remembers What It Learns Across Sessions

Anthropic shipped auto-memory for Claude Code. Claude now persists project context, debugging patterns, and preferences across sessions without manual setup.

Thariq Shihipar Anthropic

DevTools Summary · Feb 22, 2026

Google Restricts AI Ultra Accounts Over OpenClaw OAuth

Google locked AI Ultra subscribers out of Gemini models for using OpenClaw OAuth, with no warning or explanation. Anthropic banned third-party access two days earlier.

Google AI Community Thomas Claburn The Hacker News

AI Analysis · Feb 19, 2026

The truth about AI and skill retention

A randomized trial found that developers using AI assistance scored 17% lower on a skills test without gaining any speed advantage. The finding matters, but the study design limits how far you can take it.

Judy Hanwen Shen & Alex Tamkin

AI Analysis · Feb 18, 2026

New Study: Businesses Are Replacing Freelancers with AI at a 97% Cost Savings

A Ramp study using real firm-level spending data finds businesses are rapidly substituting freelancers for AI — with the heaviest spenders seeing $1 of AI replace $33 of freelance labor.

Ryan Stevens (Ramp)

AI Analysis · Feb 18, 2026

AI-Generated Agent Skills Are Pointless

SkillsBench tests whether structured knowledge packages improve LLM agents across 84 tasks. Curated Skills add 16pp. Self-generated Skills add nothing, or make things worse.

Xiangyi Li et al.

DevTools Summary · Feb 18, 2026

Your Agents.md Might Be Making AI Worse

An ETH Zurich study tests whether AGENTS.md and CLAUDE.md files actually help coding agents. LLM-generated context files reduce success rates while adding 20%+ to costs. Human-written ones barely help.

Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich)

AI Analysis · Feb 18, 2026

An LLM Benchmark Idea: Earnings Forecasting

A proposed LLM benchmark: feed a model pre-earnings data, have it forecast the results, compare to actual. Here's why it's worth building — and why today's models make it more interesting than ever.

Karger et al. (ForecastBench) FinCall-Surprise Shaffer & Wang (HBS)

AI Summary · Feb 18, 2026

Anthropic's Confusing Claude Subscription Policy, Explained

Anthropic updated its Claude Code docs to ban OAuth tokens from being used in third-party tools. The community exploded. Then Anthropic said nothing was changing.

Thariq Shihipar r/ClaudeCode