RAG Evaluation for Internal Knowledge Bases

In practice

We build RAG evaluation around real questions, expected sources, answer requirements, and failure categories. The outputs are RAG metrics, examples, trend reports, and release checks that data and engineering teams can act on.

Build golden sets from real work.

A golden set should include questions people actually ask the knowledge base. Each item needs the question, acceptable source documents, required facts, answer constraints, owner, and review date.

The set should cover common requests, edge cases, recent policy changes, ambiguous wording, and questions the system should refuse. This gives the team a stable input for regression checks.

Measure retrieval before the answer.

Retrieval metrics show whether the right context reached the model. Useful checks include recall at k, context precision, duplicate rate, stale-source rate, and whether the top result came from an approved source.

These metrics isolate failures. If the source never appears in retrieved context, prompt changes will have limited value. The retrieval owner needs to inspect chunking, metadata, filters, permissions, and ranking.

Judge answers against sources.

Answer metrics should evaluate faithfulness, completeness, citation accuracy, refusal behavior, and usefulness for the task. The expected output is a scored result with the source passages used to justify the score.

Faithfulness matters because an answer can sound correct while adding claims the source does not support. A good eval report shows the unsupported sentence, the missing source, and the likely cause.

Turn reports into release checks.

RAG evaluation should run before changes to prompts, models, retrievers, chunking, indexes, or source permissions. CI checks can block releases when critical questions regress or when source quality drops below a threshold.

The report should name the changed component, passing and failing examples, metric deltas, owners, and next actions. That makes evaluation a maintenance system instead of a one-time scorecard.

Working rule

Evaluate retrieval and answer quality separately, then make the failure owner obvious.

RAG evaluation for internal knowledge bases.

Build golden sets from real work.

Measure retrieval before the answer.

Judge answers against sources.

Turn reports into release checks.

Need a RAG evaluation loop for internal knowledge?