HazelJS Eval Package

npm downloads

@hazeljs/eval is an evaluation toolkit for AI-native HazelJS apps: golden datasets (JSON cases with expected answers, tool names, and document IDs), classical IR metrics (precision/recall@k, MRR, NDCG), RAG heuristics, agent trajectory scoring, LLM-as-judge prompt builders, and CI-friendly reporting (reportEvalForCi).

Use it to regression-test RAG pipelines, HCEL chains, and agents before deploy—same idea as unit tests, but for non-deterministic LLM behavior with explicit thresholds.

Quick Reference

  • Purpose: Run golden datasets against your pipeline, score outputs (substring match, tool order, retrieval precision), and fail CI when quality drops.
  • When to use: After you have a working RAG or Agent flow and want repeatable checks in CI; optional offline metrics without calling an LLM.
  • Key concepts: GoldenDataset JSON → runGoldenDataset(runner)EvalRunResultreportEvalForCi.
  • Dependencies: Peer @hazeljs/core. For end-to-end tests that call HazelAI + RAG, also install @hazeljs/ai and @hazeljs/rag.
  • CLI: hazel eval <golden.json> runs a smoke check with a placeholder runner; real evals use runGoldenDataset in your own script (CLI).

Installation

npm install @hazeljs/eval @hazeljs/core

How scoring works (runGoldenDataset)

For each case, the runner returns { output, toolCalls?, retrievedIds? }. Scores combine:

  1. expectedOutput1 if the model output contains the expected string (case-insensitive exact match also counts), else 0.
  2. expectedToolCalls — Combined with (1) using average with trajectory alignment (trajectoryScore).
  3. expectedRetrievedIds — Combined with precision@5 over retrieved IDs vs. expected relevant IDs.

If a case defines multiple expectations, scores are averaged across the active signals. The run passes when every case meets minAverageScore (default 0.7), the run average meets the same threshold, and there are no errors.

Empty expectations (no output, tools, or IDs) yield a neutral score of 1 for that case—useful for smoke datasets.

API overview

AreaExports
DatasetloadGoldenDatasetFromJson, types GoldenDataset, GoldenCase
RunnerrunGoldenDataset, CaseRunner
IR metricsprecisionAtK, recallAtK, meanReciprocalRank, ndcgAtK
RAGevaluateRetrieval, answerContextOverlap
AgenttrajectoryScore, toolCallAccuracy, AgentTrajectory
LLM judgeparseJudgeScore, buildRelevanceJudgePrompt, buildFaithfulnessJudgePrompt, LLMJudgeFn
CIreportEvalForCi, CiReporterOptions

Golden dataset JSON

{
  "name": "support-bot",
  "version": "1.0.0",
  "cases": [
    {
      "id": "refund-1",
      "input": "How do I get a refund?",
      "expectedOutput": "within 30 days",
      "expectedToolCalls": ["lookup_policy"],
      "expectedRetrievedIds": ["policy-refunds"]
    }
  ]
}

Optional metadata per case is passed through to the runner for fixtures or A/B flags.

CLI smoke check

The HazelJS CLI can load a file and run the eval harness with a placeholder runner (echo-style) to validate JSON shape and wiring:

hazel eval ./eval/golden.json
hazel eval ./eval/golden.json --ci

For production evals, call runGoldenDataset from a script with your real HazelAI / agent / RAG service.

Recipes

Recipe: Run a golden dataset with a custom runner

// File: scripts/run-eval.ts
import * as path from 'path';
import {
  loadGoldenDatasetFromJson,
  runGoldenDataset,
  reportEvalForCi,
} from '@hazeljs/eval';

async function main() {
  const dataset = loadGoldenDatasetFromJson(
    path.join(process.cwd(), 'eval/golden.json')
  );

  const result = await runGoldenDataset(
    dataset,
    async ({ input }) => {
      const output = await myApp.answer(input);
      return {
        output,
        toolCalls: ['search', 'summarize'],
        retrievedIds: ['doc-1', 'doc-2'],
      };
    },
    { concurrency: 2, minAverageScore: 0.75 }
  );

  reportEvalForCi(result, { exitOnFail: process.env.CI === 'true' });
}

main().catch((e) => {
  console.error(e);
  process.exitCode = 1;
});

Recipe: Retrieval metrics only (no golden runner)

// File: src/eval/retrieval-spot-check.ts
import { evaluateRetrieval } from '@hazeljs/eval';

const metrics = evaluateRetrieval({
  query: 'What is HazelJS?',
  retrievedIds: ['a', 'b', 'c'],
  relevantIds: ['a', 'x'],
  k: 5,
});
// metrics.precisionAtK, recallAtK, mrr, ndcgAtK

Recipe: Fail CI when evals do not pass

After runGoldenDataset returns, pass the result to reportEvalForCi. With exitOnFail: true, a failed run sets process.exitCode = 1 so GitHub Actions and other CI systems mark the job red.

import { runGoldenDataset, reportEvalForCi, loadGoldenDatasetFromJson } from '@hazeljs/eval';

const dataset = loadGoldenDatasetFromJson('./eval/golden.json');
const result = await runGoldenDataset(dataset, async ({ input }) => {
  /* call your RAG or agent; return { output, toolCalls?, retrievedIds? } */
  return { output: '' };
}, { minAverageScore: 0.8 });
reportEvalForCi(result, { exitOnFail: true });