MASAR

Model quality measurement · guards drift over time

Eval Harness

How it works

For every published entry flagged as an eval case, we re-ask the question through the full /ask pipeline (RAG + few-shot + LLM) and compare against the canonical answer. Token recall (60%) + citation-page Jaccard (40%) = a composite 0-100 score. ≥60 passes.

Scope