HazelJS Eval Package
@hazeljs/eval is an evaluation toolkit for AI-native HazelJS apps: golden datasets (JSON cases with expected answers, tool names, and document IDs), classical IR metrics (precision/recall@k, MRR, NDCG), RAG heuristics, agent trajectory scoring, LLM-as-judge prompt builders, and CI-friendly reporting (reportEvalForCi).
Use it to regression-test RAG pipelines, HCEL chains, and agents before deploy—same idea as unit tests, but for non-deterministic LLM behavior with explicit thresholds.
Quick Reference
- Purpose: Run golden datasets against your pipeline, score outputs (substring match, tool order, retrieval precision), and fail CI when quality drops.
- When to use: After you have a working RAG or Agent flow and want repeatable checks in CI; optional offline metrics without calling an LLM.
- Key concepts:
GoldenDatasetJSON →runGoldenDataset(runner)→EvalRunResult→reportEvalForCi. - Dependencies: Peer
@hazeljs/core. For end-to-end tests that call HazelAI + RAG, also install@hazeljs/aiand@hazeljs/rag. - CLI:
hazel eval <golden.json>runs a smoke check with a placeholder runner; real evals userunGoldenDatasetin your own script (CLI).
Installation
npm install @hazeljs/eval @hazeljs/core
How scoring works (runGoldenDataset)
For each case, the runner returns { output, toolCalls?, retrievedIds? }. Scores combine:
expectedOutput—1if the model output contains the expected string (case-insensitive exact match also counts), else0.expectedToolCalls— Combined with (1) using average with trajectory alignment (trajectoryScore).expectedRetrievedIds— Combined with precision@5 over retrieved IDs vs. expected relevant IDs.
If a case defines multiple expectations, scores are averaged across the active signals. The run passes when every case meets minAverageScore (default 0.7), the run average meets the same threshold, and there are no errors.
Empty expectations (no output, tools, or IDs) yield a neutral score of 1 for that case—useful for smoke datasets.
API overview
| Area | Exports |
|---|---|
| Dataset | loadGoldenDatasetFromJson, types GoldenDataset, GoldenCase |
| Runner | runGoldenDataset, CaseRunner |
| IR metrics | precisionAtK, recallAtK, meanReciprocalRank, ndcgAtK |
| RAG | evaluateRetrieval, answerContextOverlap |
| Agent | trajectoryScore, toolCallAccuracy, AgentTrajectory |
| LLM judge | parseJudgeScore, buildRelevanceJudgePrompt, buildFaithfulnessJudgePrompt, LLMJudgeFn |
| CI | reportEvalForCi, CiReporterOptions |
Golden dataset JSON
{
"name": "support-bot",
"version": "1.0.0",
"cases": [
{
"id": "refund-1",
"input": "How do I get a refund?",
"expectedOutput": "within 30 days",
"expectedToolCalls": ["lookup_policy"],
"expectedRetrievedIds": ["policy-refunds"]
}
]
}
Optional metadata per case is passed through to the runner for fixtures or A/B flags.
CLI smoke check
The HazelJS CLI can load a file and run the eval harness with a placeholder runner (echo-style) to validate JSON shape and wiring:
hazel eval ./eval/golden.json
hazel eval ./eval/golden.json --ci
For production evals, call runGoldenDataset from a script with your real HazelAI / agent / RAG service.
Related Pages
- AI Package —
HazelAIand the RAG facade (ai.rag.ask) - RAG Package — Retrieval pipelines and vector stores
- Agent Package — Tool calls and trajectories to assert against
- Guardrails Package — Complementary safety checks on inputs/outputs
- HCEL Guide — Chains you can wrap in a custom eval runner
- Installation — Full package list
Recipes
Recipe: Run a golden dataset with a custom runner
// File: scripts/run-eval.ts
import * as path from 'path';
import {
loadGoldenDatasetFromJson,
runGoldenDataset,
reportEvalForCi,
} from '@hazeljs/eval';
async function main() {
const dataset = loadGoldenDatasetFromJson(
path.join(process.cwd(), 'eval/golden.json')
);
const result = await runGoldenDataset(
dataset,
async ({ input }) => {
const output = await myApp.answer(input);
return {
output,
toolCalls: ['search', 'summarize'],
retrievedIds: ['doc-1', 'doc-2'],
};
},
{ concurrency: 2, minAverageScore: 0.75 }
);
reportEvalForCi(result, { exitOnFail: process.env.CI === 'true' });
}
main().catch((e) => {
console.error(e);
process.exitCode = 1;
});
Recipe: Retrieval metrics only (no golden runner)
// File: src/eval/retrieval-spot-check.ts
import { evaluateRetrieval } from '@hazeljs/eval';
const metrics = evaluateRetrieval({
query: 'What is HazelJS?',
retrievedIds: ['a', 'b', 'c'],
relevantIds: ['a', 'x'],
k: 5,
});
// metrics.precisionAtK, recallAtK, mrr, ndcgAtK
Recipe: Fail CI when evals do not pass
After runGoldenDataset returns, pass the result to reportEvalForCi. With exitOnFail: true, a failed run sets process.exitCode = 1 so GitHub Actions and other CI systems mark the job red.
import { runGoldenDataset, reportEvalForCi, loadGoldenDatasetFromJson } from '@hazeljs/eval';
const dataset = loadGoldenDatasetFromJson('./eval/golden.json');
const result = await runGoldenDataset(dataset, async ({ input }) => {
/* call your RAG or agent; return { output, toolCalls?, retrievedIds? } */
return { output: '' };
}, { minAverageScore: 0.8 });
reportEvalForCi(result, { exitOnFail: true });