RAG evaluation harness for retrieval quality

On this page

Skip to content

    Overview

    Everyone evaluates RAG differently. Some teams use retrieval precision@k. Others measure answer faithfulness against the ground truth. A few just eyeball the outputs. None of these approaches scale when you’re iterating on embedding models, chunking strategies, and reranker configurations simultaneously.

    This harness provides a structured evaluation pipeline that runs retrieval and generation against a test set, then scores each step independently — retrieval quality, answer faithfulness, and end-to-end latency.

    Why Separate the Metrics

    The core insight is that RAG failure happens at different stages, and you need to know which stage to fix. A bad answer could mean:

    • The right document wasn’t retrieved (retrieval precision issue)
    • The right document was retrieved but the model ignored it (faithfulness issue)
    • The right document was retrieved and used, but the answer is wrong (model capability issue)

    By scoring each stage independently, the harness tells you where to invest engineering effort.

    Architecture

    The harness has three scoring layers:

    Retrieval scoring — For each query in the test set, the ground truth documents are known. The harness measures precision@k and recall@k across different embedding models and chunk sizes. It also tracks the average embedding dimension and cosine similarity distribution to catch embedding drift.

    Faithfulness scoring — The generated answer is compared against the retrieved context using a lightweight LLM-as-judge prompt. The judge scores whether the answer stays within the bounds of the retrieved context, flags hallucinations, and identifies when the model introduces external knowledge.

    End-to-end scoring — The final answer is compared against the ground truth answer using a combination of semantic similarity (embedding-based) and exact match on key facts. This gives a single number that correlates with user-perceived quality.

    Current Results

    Running against a 500-query test set from a legal document corpus:

    ConfigurationPrecision@5FaithfulnessE2E ScoreLatency (p50)
    text-embedding-3-small, 512-chunk0.720.680.611.2s
    text-embedding-3-large, 256-chunk0.780.740.671.8s
    custom fine-tuned, 512-chunk0.840.810.761.4s

    The fine-tuned embedding model shows a meaningful lift, but the biggest gains came from reducing chunk overlap from 20% to 10% — less duplicate context in the retrieval pool.

    What’s Left

    • Adding adversarial query generation to test retrieval robustness
    • Building a dashboard for tracking metric drift over time
    • Supporting multi-modal retrieval (images + text)