MASAR
How it works
For every published entry flagged as an eval case, we re-ask the question through the full /ask pipeline (RAG + few-shot + LLM) and compare against the canonical answer. Token recall (60%) + citation-page Jaccard (40%) = a composite 0-100 score. ≥60 passes.