GraphRAG March 12, 2026

GraphRAG breaks in production. Here's the structural reason.

Every GraphRAG demo uses clean data, stable entities, and zero contradictions. Production has none of those things. The problem isn't the graph — it's what's missing on top of it.

Jan Szymanski

Founder, theup

We've been building on knowledge graphs for two years. We think graph-structured retrieval is strictly better than flat vector search for enterprise knowledge. Microsoft Research's GraphRAG paper validated what practitioners already knew: structured relationships produce better answers than chunk similarity.

But there's a growing gap between GraphRAG in demos and GraphRAG in production. Most implementations break the moment you leave the demo environment. We think the failure is structural, not operational.

The demo vs. production gap

In every GraphRAG demo you've seen, the knowledge graph was built from a small, curated corpus. The entities are clean. The relationships are consistent. There's one source of truth for each fact.

In production, none of this holds:

Entities drift and duplicate. "Microsoft Corp," "Microsoft," "MSFT," and "Microsoft Corporation" are four nodes for the same entity. Embedding similarity catches some duplicates but misses context-dependent variations — "Apple" the company vs. "Apple" the ingredient in a pharmaceutical formulation.
Sources contradict each other. Two clinical trials report different efficacy rates. A pitch deck says $10M ARR; the board minutes say $8M. Neither is "wrong" — they're from different times, different contexts, different levels of authority.
The graph densifies and gets noisy. After 100K entities and millions of relationships, traversal returns too much context. Without a way to rank which relationships are trustworthy, the LLM drowns in noise.
Nobody knows what's stale. A fact ingested 18 months ago sits at the same confidence level as a fact ingested yesterday. The graph has no concept of temporal decay, domain-appropriate freshness, or evidence aging.

The result: GraphRAG in production makes your agent more confidently wrong. It retrieves contradictory facts, presents them as structured knowledge, and the LLM synthesizes them into a coherent, authoritative, incorrect answer.

Eight questions to ask before buying any GraphRAG pitch

If you're evaluating GraphRAG for an enterprise use case — especially in regulated industries — these are the questions that separate demo-ready from production-ready:

What happens when two sources in the graph assert contradictory values for the same entity?

If the answer is "newest wins" or "we let the LLM figure it out at query time," the system doesn't handle contradictions — it hides them. You need multi-source claim tracking where every assertion exists alongside its provenance, and a conflict detection layer that flags contradictions deterministically.

How does entity resolution work at scale — and what happens when it's wrong?

Embedding similarity with a threshold works in demos. In production, you need multi-stage resolution: exact matching, pattern normalization, semantic comparison, and a rollback mechanism when a merge turns out to be incorrect. A false merge in a clinical knowledge base — combining two different drugs because their names are similar — can be catastrophic.

Can the system distinguish between "this fact is well-supported" and "this fact comes from one unverified upload"?

If every relationship in the graph has equal weight, the system treats a single Slack message and a peer-reviewed study as equally authoritative. You need source-weighted confidence scoring — ideally one that learns over time as sources are proven right or wrong.

How does the graph handle temporal relevance?

A financial metric from 18 months ago isn't as relevant as one from last week. A pharmaceutical trial result from 2 years ago is still highly relevant. The system needs domain-configurable temporal decay — not a universal recency bias that discards old-but-valid evidence.

When the agent answers a question, can you prove what knowledge it used?

EU AI Act Article 12 requires automatic event logging "appropriate to the intended purpose" for high-risk systems. For an agent that acts on a knowledge graph, that means recording what the graph state was at query time — which relationships were active, what their confidence was, whether conflicts existed. Without this, you can't reconstruct why the agent said what it said.

Does the system get better with use, or does it require constant manual curation?

A graph that requires a team of knowledge engineers to maintain is a consultancy, not infrastructure. Look for systems where every conflict resolution, every expert validation, and every source verification feeds back into the model — improving accuracy without manual intervention.

Can the system detect conflicts proactively, or only when a user happens to ask the right question?

Reactive conflict detection — "we'll find it when someone queries" — means contradictions sit undetected until they cause harm. Proactive detection scans the graph continuously using parallel detectors, surfacing conflicts before any agent encounters them.

What's the audit trail?

For every fact in the graph: where did it come from? How confident is it? Has it been contested? Who resolved the conflict? When? What was their determination? If the answer to any of these is "we don't track that" — you're building on sand.

The architecture gap

These eight questions all point to the same structural gap. GraphRAG systems have two layers:

What most GraphRAG implementations have: Layer 2: Agent / Application LLM, orchestrator, chat interface Layer 1: Knowledge Graph Entities, relationships, embeddings, traversal What's missing: Layer 2: Agent / Application Layer 1.5: Truth Layer ← consensus scoring, conflict detection, expert routing, governance gates, temporal decay, audit trail Layer 1: Knowledge Graph

Layer 1 gives you structure. Layer 2 gives you intelligence. But without the truth layer in between, the intelligence operates on unverified structure — and that's how you get agents that are more confidently wrong, not less.

This is why we built Brain

Brain is that missing layer. It sits on top of any knowledge graph and provides:

Continuous consensus scoring — every relationship carries a confidence value based on source authority, evidence weight, expert validation, and temporal decay
Proactive conflict detection — parallel detectors scan the graph deterministically, using graph queries rather than LLM judgment
Multi-stage entity resolution — exact matching, pattern normalization, semantic comparison, with rollback when merges are incorrect
A learning flywheel — every resolution strengthens the model. Sources proven right gain authority. Experts build accuracy profiles. The system improves with use.
Query-time governance — before an agent acts, Brain returns a verdict (ALLOW / WARN / BLOCK) based on the consensus state of the knowledge relevant to that query
Sealed audit trail — every decision, every consensus score, every conflict resolution is hash-chained for compliance

We didn't build Brain because we think GraphRAG is wrong. We built it because GraphRAG is right about the structure — and wrong about the assumption that the structure is clean.

Enterprise knowledge is messy, contradictory, and constantly changing. The graph captures that reality. Brain makes it safe to act on.

See how Brain upgrades your GraphRAG pipeline

Consensus scoring, conflict detection, and governance gates — layered on top of any knowledge graph.

Get a Demo