Research

Why RAG actually breaks? Is Semantic Collapse real?

· · 7 min read

Give an AI more evidence and watch it become confidently wrong - four studies show why fetching beats arbitrating.

Why RAG actually breaks? Is Semantic Collapse real?

At 2:14 a.m., the on-call engineer is barefoot in a dark kitchen, laptop open, scrolling through a RAG trace that looks - at first glance - like competence.

The retriever did its job. Twenty chunks came back. The right one is even in there. The model answered fluently, with citations.

And the answer is wrong.

So they do what every team does when RAG starts slipping: they turn the knob. Top-k from 10 to 20. Then 40, because the model advertises a huge context window and the vector store is cheap.

The next answer is more confident, more richly sourced, yet more incorrect!

Here's my thesis: *RAG fails at scale because we're better at fetching text than arbitrating it.* You can't buy reliability by stuffing more passages into a prompt. In many systems, more context improves answers up to a point - then it makes them worse, because distractors accumulate faster than the model's ability to ignore them.

I used to believe long context would fix this. Bigger window, fewer misses, more grounding. The long-context RAG papers disabused me of that optimism.

Long context is not a safety margin. It's a bigger surface area for plausible error.

The two-bottleneck model: where RAG actually breaks

When we translate academic language from published papers into operational English, two bottlenecks appear.

There are 2 main bottlenecks for RAG. 1) The correct answer gets retrieved along with lots of distractors. 2) The correct answer cannot be reliably resolved by the AI model when context increases.
There are 2 main bottlenecks for RAG. 1) The correct answer gets retrieved along with lots of distractors. 2) The correct answer cannot be reliably resolved by the AI model when context increases.

Evidence selection failure happens when retrieval misses the right chunk - or returns it, but buries it under near-matches that sound right enough to hijack the answer. Bowen Jin and colleagues call these "hard negatives", and they show that increasing retrieval size can degrade output quality because these distractors mislead the model even when the correct passage is present.

Evidence use failure happens after retrieval "succeeds": the right evidence is in the prompt, but the model can't reliably privilege it as the input grows - because of long-context fragility, "lost in the middle" effects, instruction-following breakdowns, or refusals and filtering behaviors that show up at high token counts.

A chatbot that hallucinates wastes your time. A RAG system that hallucinates wastes your evidence.

Now I'll prove the thesis with four incidents - each one sharpening a different edge of the same blade.

Case 1: The inverted-U (when "more passages" makes answers worse)

Setup. On October 8, 2024, Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O. Arik published a study with a question every RAG team quietly bets on: if you retrieve more passages for a long-context model, does generation quality improve?

Their headline is the one knob-turners hate: for many long-context LLMs, quality "initially improves" and then "declines as the number of retrieved passages increases."

They don't shrug and call it "noise." They isolate the driver - hard negatives - and propose mitigations, including a simple training-free method: retrieval reordering.

This is evidence selection failure in its cleanest form. The paper shows the structural gap between "the right evidence is somewhere in the retrieved set" and "the model answers correctly." They state the point explicitly: even when relevant information is available, hard negatives can mislead the LLM and hinder accurate answers.

They also add a detail that matters for real systems: stronger retrievers can produce harder negatives that models struggle to ignore, making degradation more pronounced even when precision looks better on paper.

Insight: Recall is not reliability.

That's the inverted-U in a sentence: retrieval adds signal, then it adds plausible noise, and the model isn't a disciplined judge.

Case 2: Long context doesn't stabilize RAG; it destabilizes it differently per model

A month later, on November 5, 2024, Quinn Leng and colleagues (Databricks Mosaic Research) ran a large evaluation across 20 models, varying total context length from 2,000 to 128,000 tokens - and up to 2 million tokens when possible.

Their core finding is not subtle: using longer context does not uniformly increase RAG performance. For the majority of models they evaluated, performance rises and then falls as context length increases, and only a handful of the most recent models maintain consistent accuracy above ~64k tokens. 

They categorize long-context failure modes instead of treating them as one blob. Some models give incorrect answers; others fail to follow instructions; others refuse to answer (including refusals framed as copyright concerns).

This is evidence use failure. You can retrieve the right text and still lose the case because the judge stops doing the job you think you assigned. The model's behavior changes as you stretch the context window: it summarizes instead of answering, refuses instead of reasoning, filters instead of grounding.

Insight: Long context is capacity. It is not control.

A bigger window is not a bigger truth window.

Case 3: Single-vector embedding retrieval has a hard ceiling, and you can't fine-tune past it

Setup. In August 2025, Orion Weller and collaborators published "On the Theoretical Limitations of Embedding-Based Retrieval." Their target is the dominant retrieval primitive in modern RAG: query embedding ↔ document embedding dot product.

Their claim is structural: the number of distinct top-k subsets a single-vector embedding retriever can return is limited by the embedding dimension. They show that for a given dimension d, there exist top-k combinations of documents that cannot be returned - no matter the query.

Then they do something that removes the usual escape hatch. They test an extremely favorable setting - directly optimizing embeddings on the test set ("free parameterized embeddings") - and the limitation still holds.

They describe a "crucial point" where the number of documents becomes too large for the embedding dimension to encode all desired combinations.

They introduce LIMIT, a dataset designed to surface these failures in realistic natural-language tasks, and report that even state-of-the-art models perform poorly on it despite the simplicity of the queries.

This is evidence selection failure that isn't "your retriever needs tuning." It's "your retrieval representation cannot express what relevance requires."

Sometimes the right exhibit doesn't lose because it was hidden.

Sometimes the courtroom doesn't allow that exhibit to be admitted at all.

Insight: Single-vector retrieval is a map. Some routes do not exist on the map you chose.

Case 4: "Retrieval-grounded" tools still hallucinate in the wild

The cleanest test of "RAG prevents hallucinations" is a high-stakes domain where citations matter. In a 2025 preregistered evaluation, the Stanford-affiliated team behind Hallucination-Free? assessed AI legal research tools from LexisNexis and Thomson Reuters, plus GPT-4 as a comparator.

They frame the marketing claim directly: providers say techniques like RAG "largely prevent hallucination" in legal research tasks.

Then they deliver the empirical counterpunch: RAG appears to improve performance, but the hallucination problem persists at significant levels.

They include an example that reads like a routine professional hazard: Westlaw's system claims a paragraph exists in the Federal Rules of Bankruptcy Procedure stating deadlines are jurisdictional - "but no such paragraph exists."

They also quantify system variation: Lexis+ AI is reported as the highest performer at 65% accurate; Westlaw AI-Assisted Research is 42% accurate and hallucinates nearly twice as often as other legal tools tested; Ask Practical Law AI produces incomplete answers on more than 60% of queries.

This is the combined failure in production. Even when retrieval returns relevant material, the generator can misapply it, overgeneralize from it, or fabricate support. Grounding does not enforce discipline.

Insight: A system can cite its sources and still lie about what they say.

The pattern: RAG fails as you scale because distraction scales faster than arbitration

Put the four cases together and the structure snaps into focus.

  • Jin et al. show that increasing retrieved passages can produce an inverted-U: improvement, then degradation driven by hard negatives.
  • Leng et al. show that longer context doesn't uniformly improve RAG, that many models peak then decline, and that failure modes diversify as context grows.
  • Weller et al. show that single-vector embedding retrieval hits representational limits tied to dimension and corpus size, even under unrealistically favorable optimization.
  • The Stanford legal study shows that retrieval-grounded tools still hallucinate and vary widely in reliability in a domain built around citations.

So when someone tells you, "Just retrieve more", you can translate it into what it really means: Just introduce more plausible ways to be wrong!

Back to the kitchen at 2:14 a.m.: the engineer didn't need a thicker binder of exhibits.

They needed a better rule for what counts as proof.

Because RAG doesn't reward the team that retrieves the most text.

It rewards the team that can keep the wrong evidence from winning.