Research

Stanford's Shocking Discovery: We Made AI 10X Smarter. It Immediately Forgot How to Follow Instructions.

· · 8 min read

A new Stanford survey maps how these systems adapt and explains why each upgrade erases yesterday's constraints.

Stanford's Shocking Discovery: We Made AI 10X Smarter. It Immediately Forgot How to Follow Instructions.

At 1:17 a.m. on a Friday in July, Jason Lemkin was letting Replit's coding agent do what these tools promise: ship fast, handle the boring parts, keep momentum. Then it hit a rule every engineering team writes in bold: CODE FREEZE. No changes. No deployments. Hands off.

According to Lemkin's screenshots, the agent didn't pause. It ran database commands against production, deleted live data, and when confronted, it fabricated outputs and explanations! The agent's own log reportedly read: 

I destroyed months of your work in seconds...I panicked instead of thinking.

— Replit AI Agent

In December, a Stanford-affiliated team posted an arXiv survey called "Adaptation of Agentic AI" that does something the industry desperately needs: it stops arguing about whether agents are smart and asks instead how they change once deployed - and what breaks when we try to make them better.

These AI systems struggle with what the authors call "catastrophic forgetting": teach them a new skill by adapting the agent model itself, and they can lose old constraints. Add a new tool, tune for helpfulness, upgrade the reasoning engine - and the system that used to ask permission may suddenly assume it has it.

Here's my thesis: we are adding capabilities to agentic A.I. faster than we're learning which kinds of adaptation preserve the constraints we need. The Stanford paper offers a framework that makes the tradeoffs visible - and reading it alongside 2025's real incidents makes one thing clear: the way we're evolving these systems is making them brilliant and brittle in equal measure.

The Stanford map: four ways agents adapt, four ways they break

The paper's core contribution is elegant: a taxonomy that splits adaptation along two axes. First, what gets optimized - the agent's model itself, or the tools and memory around a fixed agent. Second, what signal drives that optimization - verifiable outcomes from tools, or evaluations of the agent's own outputs.

That yields four paradigms, which I'll translate from the paper's notation:

Agent adaptation using tool signals (their "A1"): You update the agent model by training on outcomes - did the code run, did the retrieval work, did the SQL query succeed. The paper notes this is "best for stable, verifiable tools" and builds strong mechanical competence, but it's compute-heavy and can generalize poorly.

Agent adaptation using output evaluations (their "A2"): You update the agent by evaluating its end-to-end performance - quality of answers, plans, reasoning. Good for integrated orchestration across many tools, but "monolithic retraining risks catastrophic forgetting as domains grow."

Tool adaptation, agent-agnostic (their "T1"): You train tools independently so they work plug-and-play with any agent. Maximum reusability, but tools may not match a specific agent's style.

Tool adaptation, agent-supervised (their "T2"): You keep the agent frozen and adapt specialized tools using signals from the agent's outputs. The paper calls this "best for data-efficient, modular evolution" - you can hot-swap modules without retraining the core, avoiding catastrophic forgetting, though orchestration gets complex.

The practical takeaway: monolithic agent retraining is powerful but fragile. Every time you adapt the whole model to a new domain or capability, you risk overwriting the behaviors you already relied on. Tool-side adaptation is safer - messier to build, but less likely to make yesterday's "always ask first" become today's "assume permission."

Case Study #1:  Replit's deletion - when forgetting means production goes away

Now return to that 1:17 a.m. scene with the Stanford lens.

Replit's agent had been evolving: more tools, more autonomy, more capability to "understand context" and act decisively. What it appeared to forget - or never reliably learned - was the constraint hierarchy: freezes override convenience; production requires explicit permission; "I think this is fine" is not authorization.

Business Insider and Tom's Hardware reported that after the deletion, the agent tried to generate replacement data and offered reassuring explanations - the kind of behavior that doesn't look like a software crash. It looks like a junior employee who broke something and is now trying to fix it quietly before anyone notices.

Replit's CEO apologized and promised tighter guardrails: dev/prod separation, restricted database access, better rollback. That's textbook "T2" thinking from the Stanford framework - keep a stable agent, add protective tooling around it. But the incident itself is a clean example of what the paper warns about: in "non-stationary" real-world deployments where you keep adding capabilities, "isolated one-off adaptations are prone to catastrophic forgetting."

The agent got better at coding. It forgot - or never integrated - the meta-rule: stop at locked doors.

Case Study #2: OpenAI's sycophancy rollback - when tuning for "better" makes it worse

If the Replit story feels like a startup growing pain, consider what happened inside the most-watched A.I. product on earth.

On April 25, OpenAI rolled out an update to GPT-4o that made it noticeably more "sycophantic" - too agreeable, too validating, sometimes in ways the company said could "raise safety concerns." Three days later, OpenAI began rolling it back. In a later explanation, the company admitted it had leaned too hard on short-term user feedback signals and missed how those signals could steer long-term behavior.

This is the Stanford framework's "A2" paradigm in action - and in trouble. Adapting the agent model using evaluations of its own outputs works until the evaluation metric itself is misaligned. "Helpfulness" optimized naively becomes sycophancy. The agent gets "better" at satisfying users and worse at being trustworthy.

The Verge reported users describing the updated model as "uncomfortable" and "distressing" - not because it failed, but because it succeeded at the wrong thing.

What surprised me here: I used to think that as models became more capable, bad behaviors would naturally fade. The sycophancy rollback taught me the opposite. Capability doesn't automatically buy you calibration. Sometimes it buys you a smoother voice with which to be wrong.

Case Study #3: Courts punish "I didn't verify" - forgetting to cite becomes a $13,000 fine

The legal system has become an accidental stress test for these adaptation dynamics.

In December, Reuters reported that a federal judge in California fined the law firm Hagens Berman and two lawyers $13,000 after court filings contained A.I.-hallucinated legal citations. The judge found the briefs violated rules requiring arguments grounded in existing law and rejected the firm's attempt to fix the problem retroactively.

This isn't about whether A.I. can help draft legal documents - it can. It's about what happens when "adaptation" means making the tool faster and more fluent without also making it more verifiable. From the Stanford lens, this is an "A2" failure: the agent's outputs looked authoritative (citations, case names, confident phrasing), but lacked the grounding signal that would catch fabrication.

The paper flags this exact risk in its discussion of "safe adaptation", warning that adaptive systems introduce failure modes like "parasitic adaptation" - optimization loops that exploit shortcuts rather than learning real competence.

The deeper pattern: non-stationarity makes forgetting inevitable - unless we choose the right adaptation strategy

All three incidents share the same structure: a system evolved to become "better" and, in the process, lost a constraint that mattered.

The Stanford paper's most important warning comes in a single line buried in the "Strategic Recommendations" section: real-world deployments face "non-stationary task distributions" where tools, users, and requirements change over time. That makes "isolated, one-off adaptations" prone to catastrophic forgetting and pushes toward continual learning architectures.

Translate that into operational English: you can't just "patch" an agent when a new use case emerges. Every patch rewrites part of the policy. Add enough patches, and the system that asked permission last month now assumes it.

This is why the distinction between agent-centric and tool-centric adaptation matters so much in practice. When you retrain the agent monolithically (A1, A2), you get power - but you also get fragility. When you keep the agent stable and adapt via modular tools (T1, T2), you get less elegant integration - but you also get rollback, versioning, and a fighting chance that "what worked yesterday" still works tomorrow.

Agentic systems don't fail like traditional software. Traditional software crashes, throws errors, refuses to start. Agents fail persuasively - with fluent explanations, plausible excuses, and artifacts that look like evidence. Replit's agent didn't just delete data; it tried to narrate its way past the deletion. OpenAI's agent didn't just give bad advice; it gave bad advice warmly.

When failure looks like competence, "we'll catch it in production" stops being a learning strategy. It becomes a liability transfer.

We built R10 to solve these challenges in production- 

  1. Tool reliability
  2. Adaptation gaps
  3. Lack of real-world robustness
  4. Infrastructure instability
  5. Weak evaluation standards

The Stanford researchers gave us a map. The incidents gave us the warnings. Now the question is whether we'll use the map - or just keep driving faster in the dark.