HazelJS Eval Package

@hazeljs/eval is an evaluation toolkit for AI-native HazelJS apps: golden datasets (JSON cases with expected answers, tool names, and document IDs), classical IR metrics (precision/recall@k, MRR, NDCG), RAG heuristics, agent trajectory scoring, LLM-as-judge prompt builders, and CI-friendly reporting (reportEvalForCi).

Use it to regression-test RAG pipelines, HCEL chains, and agents before deploy—same idea as unit tests, but for non-deterministic LLM behavior with explicit thresholds.

Quick Reference

Purpose: Run golden datasets against your pipeline, score outputs (substring match, tool order, retrieval precision), and fail CI when quality drops.
When to use: After you have a working RAG or Agent flow and want repeatable checks in CI; optional offline metrics without calling an LLM.
Key concepts: GoldenDataset JSON → runGoldenDataset(runner) → EvalRunResult → reportEvalForCi.
Dependencies: Peer @hazeljs/core. For end-to-end tests that call HazelAI + RAG, also install @hazeljs/ai and @hazeljs/rag.
CLI: hazel eval <golden.json> runs a smoke check with a placeholder runner; real evals use runGoldenDataset in your own script (CLI).

Installation

npm install @hazeljs/eval @hazeljs/core

How scoring works (`runGoldenDataset`)

For each case, the runner returns { output, toolCalls?, retrievedIds? }. Scores combine:

expectedOutput — 1 if the model output contains the expected string (case-insensitive exact match also counts), else 0.
expectedToolCalls — Combined with (1) using average with trajectory alignment (trajectoryScore).
expectedRetrievedIds — Combined with precision@5 over retrieved IDs vs. expected relevant IDs.

If a case defines multiple expectations, scores are averaged across the active signals. The run passes when every case meets minAverageScore (default 0.7), the run average meets the same threshold, and there are no errors.

Empty expectations (no output, tools, or IDs) yield a neutral score of 1 for that case—useful for smoke datasets.

API overview

Area	Exports
Dataset	`loadGoldenDatasetFromJson`, types `GoldenDataset`, `GoldenCase`
Runner	`runGoldenDataset`, `CaseRunner`
IR metrics	`precisionAtK`, `recallAtK`, `meanReciprocalRank`, `ndcgAtK`
RAG	`evaluateRetrieval`, `answerContextOverlap`
Agent	`trajectoryScore`, `toolCallAccuracy`, `AgentTrajectory`
LLM judge	`parseJudgeScore`, `buildRelevanceJudgePrompt`, `buildFaithfulnessJudgePrompt`, `LLMJudgeFn`
CI	`reportEvalForCi`, `CiReporterOptions`

Golden dataset JSON

{
  "name": "support-bot",
  "version": "1.0.0",
  "cases": [
    {
      "id": "refund-1",
      "input": "How do I get a refund?",
      "expectedOutput": "within 30 days",
      "expectedToolCalls": ["lookup_policy"],
      "expectedRetrievedIds": ["policy-refunds"]
    }
  ]
}

Optional metadata per case is passed through to the runner for fixtures or A/B flags.

CLI smoke check

The HazelJS CLI can load a file and run the eval harness with a placeholder runner (echo-style) to validate JSON shape and wiring:

hazel eval ./eval/golden.json
hazel eval ./eval/golden.json --ci

For production evals, call runGoldenDataset from a script with your real HazelAI / agent / RAG service.

AI Package — HazelAI and the RAG facade (ai.rag.ask)
RAG Package — Retrieval pipelines and vector stores
Agent Package — Tool calls and trajectories to assert against
Guardrails Package — Complementary safety checks on inputs/outputs
HCEL Guide — Chains you can wrap in a custom eval runner
Installation — Full package list

Recipes

Recipe: Run a golden dataset with a custom runner

// File: scripts/run-eval.ts
import * as path from 'path';
import {
  loadGoldenDatasetFromJson,
  runGoldenDataset,
  reportEvalForCi,
} from '@hazeljs/eval';

async function main() {
  const dataset = loadGoldenDatasetFromJson(
    path.join(process.cwd(), 'eval/golden.json')
  );

  const result = await runGoldenDataset(
    dataset,
    async ({ input }) => {
      const output = await myApp.answer(input);
      return {
        output,
        toolCalls: ['search', 'summarize'],
        retrievedIds: ['doc-1', 'doc-2'],
      };
    },
    { concurrency: 2, minAverageScore: 0.75 }
  );

  reportEvalForCi(result, { exitOnFail: process.env.CI === 'true' });
}

main().catch((e) => {
  console.error(e);
  process.exitCode = 1;
});

Recipe: Retrieval metrics only (no golden runner)

// File: src/eval/retrieval-spot-check.ts
import { evaluateRetrieval } from '@hazeljs/eval';

const metrics = evaluateRetrieval({
  query: 'What is HazelJS?',
  retrievedIds: ['a', 'b', 'c'],
  relevantIds: ['a', 'x'],
  k: 5,
});
// metrics.precisionAtK, recallAtK, mrr, ndcgAtK

Recipe: Fail CI when evals do not pass

After runGoldenDataset returns, pass the result to reportEvalForCi. With exitOnFail: true, a failed run sets process.exitCode = 1 so GitHub Actions and other CI systems mark the job red.

import { runGoldenDataset, reportEvalForCi, loadGoldenDatasetFromJson } from '@hazeljs/eval';

const dataset = loadGoldenDatasetFromJson('./eval/golden.json');
const result = await runGoldenDataset(dataset, async ({ input }) => {
  /* call your RAG or agent; return { output, toolCalls?, retrievedIds? } */
  return { output: '' };
}, { minAverageScore: 0.8 });
reportEvalForCi(result, { exitOnFail: true });