RAG evaluation harness for retrieval quality —

On this page

Overview

Everyone evaluates RAG differently. Some teams use retrieval precision@k. Others measure answer faithfulness against the ground truth. A few just eyeball the outputs. None of these approaches scale when you’re iterating on embedding models, chunking strategies, and reranker configurations simultaneously.

This harness provides a structured evaluation pipeline that runs retrieval and generation against a test set, then scores each step independently — retrieval quality, answer faithfulness, and end-to-end latency.

Why Separate the Metrics

The core insight is that RAG failure happens at different stages, and you need to know which stage to fix. A bad answer could mean:

The right document wasn’t retrieved (retrieval precision issue)
The right document was retrieved but the model ignored it (faithfulness issue)
The right document was retrieved and used, but the answer is wrong (model capability issue)

By scoring each stage independently, the harness tells you where to invest engineering effort.

Architecture

The harness has three scoring layers:

Retrieval scoring — For each query in the test set, the ground truth documents are known. The harness measures precision@k and recall@k across different embedding models and chunk sizes. It also tracks the average embedding dimension and cosine similarity distribution to catch embedding drift.

Faithfulness scoring — The generated answer is compared against the retrieved context using a lightweight LLM-as-judge prompt. The judge scores whether the answer stays within the bounds of the retrieved context, flags hallucinations, and identifies when the model introduces external knowledge.

End-to-end scoring — The final answer is compared against the ground truth answer using a combination of semantic similarity (embedding-based) and exact match on key facts. This gives a single number that correlates with user-perceived quality.

Current Results

Running against a 500-query test set from a legal document corpus:

Configuration	Precision@5	Faithfulness	E2E Score	Latency (p50)
text-embedding-3-small, 512-chunk	0.72	0.68	0.61	1.2s
text-embedding-3-large, 256-chunk	0.78	0.74	0.67	1.8s
custom fine-tuned, 512-chunk	0.84	0.81	0.76	1.4s

The fine-tuned embedding model shows a meaningful lift, but the biggest gains came from reducing chunk overlap from 20% to 10% — less duplicate context in the retrieval pool.

What’s Left

Adding adversarial query generation to test retrieval robustness
Building a dashboard for tracking metric drift over time
Supporting multi-modal retrieval (images + text)

RAG evaluation harness for retrieval quality —

Overview

Why Separate the Metrics

Architecture

Current Results

What’s Left

Local LLM inference pipeline on M-series Macs —

Custom automation layer over GitHub Actions —

Local LLM inference pipeline on M-series Macs

Custom automation layer over GitHub Actions