Skip to content

Design — services/eval-harness/ (Phase 0 weeks 7-9)

Pre-code design memo for the headless offline runner that produces the two plots and one table the SBIR D2P2 proposal must carry:

  1. Exploitability-vs-iterations — convergence story for the solver.
  2. Wall-clock-vs-problem-size — scaling story across 3, 10, 30 effectors / 3, 10, 30, 100 tracks.
  3. Agent comparison tableSolverAgent vs scriptedAgent vs llmAgent on identical seeds, scored by bluePayoff.

Everything else in plan/sbir-osw26bz02-dv004-game-theoretic-coa.md already exists in the repo: solver kernel, daemon, agent selection, red-agent, doctrinal subroutines. The eval-harness is what turns those working pieces into evidence.

The PR ships deliberately scoped to the headless runner + report schema + score replay + two plot generators. The CI workflow (.github/workflows/eval.yml) and the operational scaling sweep (3 → 10 → 30 effectors) ship as follow-ups so reviewers see the runner working before the bench grows.

The headline acceptance gate: a reviewer runs pnpm --filter @uci-demo/eval-harness exec tsx src/main.ts \ --scenarios counter-uas-tripwire \ --agents scripted,solver \ --degrade none \ --episodes-per-cell 3 \ --report out/eval/smoke.json and gets a complete EvalReport JSON with per-episode scores, plus NDJSON timelines on disk under out/eval/<runId>/, in under five minutes wall-clock.


Strand 1 — services/eval-harness/

What it does (one paragraph)

A standalone Node CLI that boots Mosquitto via the existing docker-compose.yml, then in-process imports startWorldSim() from @uci-demo/world-sim and startCopilot() from @uci-demo/copilot (both exported as service factories in PR #30), drives the resulting service triad over a parameterized sweep scenarios × agents × degrade-presets × episodes-per-cell, captures the full bus to disk, adjudicates bluePayoff from the captured wire via scoreReplay.ts, and emits a versioned EvalReport JSON. The harness is the only place in the repo that owns end-to-end reproducibility: it seeds the scenario YAML's mulberry32 RNG, the red-agent's mulberry32 RNG, and the solver's ε-greedy RNG from one master seed per episode, so re-running with the same --seed is bit- identical. No UI. No human in the loop. No external network calls unless an episode uses the llmAgent (then it pulls the same @uci-demo/llm provider the live copilot would).

What it does NOT do

  • Compute true exploitability for Tripwire. That number is what the convergence plot needs, but Tripwire's depth-50 tree makes exact best-response intractable. The harness computes local-best-response (LBR) exploitability (Lisý et al. 2014) as a tractable proxy and ships the exact exploitability for Kuhn poker — the latter is the gate the kernel already passes (packages/uci-solver/test/kuhn.test.ts). The Kuhn point on the same plot makes the LBR curve credible.
  • Replace the cop-ui. The harness does not subscribe any UI topics. The copilot's existing publishers (uci-demo/copilot/* and uci-demo/solver/*) are captured by the bus logger and replayed offline by scoreReplay.ts; the harness does not render them.
  • Train a new blueprint. When agents=solver, the harness boots the existing services/solver-daemon/ as a child process and lets it train against the same dynamics it always trains against, then consumes uci-demo/solver/blueprint exactly like the live copilot does. Training cadence is decoupled from episode boundaries — the daemon keeps training across episodes; each episode reads whatever blueprint is current. This matches the live demo's behavior and avoids the false story of "trained from scratch per episode."
  • Run human-in-the-loop tournaments. That's the week 9-11 SME micro-tournament, a separate workstream.
  • Persist between runs. Each --report writes a fresh tree. Aggregation across runs is a one-liner over the JSON files outside the harness's scope.

Service shape

services/eval-harness/
├── package.json          # workspace deps on @uci-demo/{bus,codec,game,solver,world-sim,copilot,red-agent}
├── tsconfig.json
├── src/
│   ├── main.ts           # CLI entry: arg parsing → runner.run() → write report
│   ├── cli.ts            # arg-parse helper (no third-party dep; we already avoid yargs)
│   ├── runner.ts         # core sweep loop; one process, sequential episodes
│   ├── episode.ts        # single-episode lifecycle: boot → run → drain → score → tear down
│   ├── scoreReplay.ts    # offline PayoffCounters reconstruction from captured bus
│   ├── scenarios.ts      # registry of named scenarios with synthetic-scale variants
│   ├── busLogger.ts      # subscribes WILDCARD_ALL; writes one NDJSON per episode
│   ├── seeds.ts          # one master seed → derived seeds for scenario/red/solver
│   ├── lbr.ts            # local-best-response exploitability (Tripwire)
│   ├── kuhnExploit.ts    # exact exploitability (Kuhn — credibility anchor)
│   ├── plots/
│   │   ├── exploitVsIter.ts   # reads daemon status archive; writes SVG
│   │   └── clockVsSize.ts     # reads scaling-sweep report; writes SVG
│   └── types.ts          # EvalReport, EpisodeResult, RunnerConfig
├── regression.config.json   # cells the perf workflow exercises on every label
└── test/
    ├── runner.test.ts        # 1-cell smoke: scripted on Tripwire, 1 episode, asserts report shape
    ├── scoreReplay.test.ts   # adjudicates a hand-crafted NDJSON, checks all 9 counters
    └── seeds.test.ts         # determinism: same seed twice → identical EpisodeResult

The package is private: true like every other service. No published artifacts; the test surface gates correctness.


Strand 2 — EvalReport schema (the contract)

The whole point of the harness is to emit a JSON that someone other than us can read and reproduce. The schema is versioned exactly like the solver blueprint:

// services/eval-harness/src/types.ts

export const EVAL_REPORT_VERSION = 1;

export interface EvalReport {
  schemaVersion: 1;
  runId: string;                    // ISO timestamp + 6-char nonce
  startedAt: string;                // ISO
  finishedAt: string;               // ISO
  durationMs: number;
  hostname: string;                 // os.hostname()
  nodeVersion: string;              // process.version
  gitSha: string | null;            // null if not in a git repo
  versions: {
    beliefV: number;                // BELIEF_V from @uci-demo/game
    payoffV: number;                // PAYOFF_V
    infosetV: number;               // INFOSET_V
    solverSchemaVersion: number;    // SCHEMA_VERSION
    evalReportV: 1;
  };
  config: RunnerConfig;             // exactly what was on the CLI, normalized
  cells: CellResult[];              // one per (scenario × agent × degrade) cross
  summary: {
    totalEpisodes: number;
    totalDurationMs: number;
    failedEpisodes: number;         // episodes that timed out or threw
    cellsBlueWon: number;           // cells with mean bluePayoff > 0
  };
}

export interface CellResult {
  scenario: string;
  agent: AgentSelector;             // "scripted" | "solver" | "llm:<provider>"
  degradePreset: string;
  episodes: EpisodeResult[];
  aggregate: {
    bluePayoffMean: number;
    bluePayoffStdev: number;
    bluePayoffMin: number;
    bluePayoffMax: number;
    counterMeans: PayoffCounters;   // mean across episodes, per counter
    meanTimeToDecisionMs: number;
    meanWallClockSec: number;
  };
}

export interface EpisodeResult {
  episodeId: string;                // "<cellKey>-<idx>" — also the NDJSON dir name
  seed: number;                     // master seed for this episode
  startedAt: string;
  durationMs: number;               // wall-clock from boot → tear-down
  simulatedSeconds: number;         // scenario seconds elapsed (≤ scenario.loopSeconds)
  status: "ok" | "timeout" | "error";
  errorMessage?: string;            // when status !== "ok"
  counters: PayoffCounters;         // final
  bluePayoff: number;
  redPayoff: number;
  validatorAuditPercent: number;    // from validator's /audit?n=N response
  decisionLatencies: number[];      // copilot evaluate() wall-times, ms, capped at 5000
  blueprintAtStart: {               // null if agent !== "solver" or daemon was cold
    iterations: number;
    infoSetCount: number;
    deltaNorm: number;
    variant: "es" | "os";
  } | null;
  artifacts: {
    busNdjsonPath: string;          // relative to runId dir
    daemonStatusPath: string | null;
  };
}

export interface RunnerConfig {
  scenarios: string[];              // ["counter-uas-tripwire", ...]
  agents: AgentSelector[];          // ["scripted", "solver", "llm:anthropic"]
  degradePresets: string[];         // ["none", "comms-flap-30s", "burst-2x"]
  episodesPerCell: number;
  episodeTimeoutMs: number;         // default 600_000 (10 min wall-clock)
  scenarioSimTimeoutSec: number;    // default 180 (3 min sim-time)
  masterSeed: number;               // root seed; episodes derive from it
  brokerUrl: string;
  reportPath: string;
  outDir: string;                   // out/eval/<runId>/
  daemonWarmIterations: number;     // wait until daemon has run ≥ N iter before first solver episode
  validatorFullAudit: boolean;      // sets VALIDATOR_FULL_AUDIT=1 for the run
}

export type AgentSelector =
  | "scripted"
  | "solver"
  | { llm: "anthropic" | "ollama" | "bedrock" | "openai-compat" };

AgentSelector accommodates the three live agent paths from services/copilot/src/service.ts. The harness translates these into the env vars startCopilot() already understands (USE_SOLVER=1 or LLM_PROVIDER=…) AND injects an Agent directly via the existing StartCopilotOptions.agent slot when finer control is needed (e.g. deterministic scripted-agent in cells where we want zero env state leakage).

Why version this

PAYOFF_V and BELIEF_V already bump when the math changes (PR #33 took PAYOFF_V from 1 → 2). The EvalReport carries them as a freeze-frame so the report can be diffed across repo HEADs. Adding a counter or reweighting requires bumping PAYOFF_V; the report faithfully records which world it was scored under. The plot generator refuses to mix reports with different payoffV.


Strand 3 — Per-episode lifecycle (episode.ts)

One episode = one (scenario, agent, degradePreset, seed) tuple. The sequence below is bracketed by a try / finally so any failure still hits tear-down. Each step has a wall-clock budget so a hung service can't poison a sweep.

1.  derive seeds from master:
      scenarioSeed = mulberry32(master, 0)
      redSeed      = mulberry32(master, 1)
      solverSeed   = mulberry32(master, 2)

2.  start NDJSON bus logger
      subscribe WILDCARD_ALL, write each {topic, ts, payload?} line.
      payload included only when length ≤ 4 KiB; longer payloads
      noted as { topic, ts, payloadLen, payloadSha256 }.

3.  publish degrade-preset events
      (uci-demo/world/degrade) with retained=false; presets are
      named in scenarios.ts, e.g. "comms-flap-30s" = 30s 0.5 loss.

4.  startWorldSim({ installSignalHandlers: false,
                    scenarioPath: scenarioFile,
                    seedOverride: scenarioSeed })
      (need a thin extension to scenario.ts to accept seedOverride
       — additive, default unchanged)

5.  startCopilot({ installSignalHandlers: false,
                   agent: buildAgent(agentSelector, ctx) })
      where buildAgent selects between:
        "scripted"          → import { scriptedAgent } from copilot/scriptedAgent.js
        "solver"            → createSolverAgent (subscribes the live
                              uci-demo/solver/blueprint that the
                              already-running daemon publishes)
        { llm: provider }   → makeLlmAgent(createClient({ provider, ... }))

6.  if agentSelector === "solver" and daemon not running:
      childProcess.spawn("pnpm --filter @uci-demo/solver-daemon dev",
                         { env: { ...process.env, SOLVER_SEED: solverSeed }})
      wait until uci-demo/solver/status iterations ≥ daemonWarmIterations
      (10 min cap before giving up)
      daemon is reused across episodes within a single solver-cell; it
      is stopped at cell boundary, not episode boundary.

7.  if agentSelector starts with "red:" or sweep includes red:
      childProcess.spawn("pnpm --filter @uci-demo/red-agent dev",
                         { env: { ...process.env, RED_AGENT_SEED: redSeed }})

8.  loop until simulatedSeconds ≥ scenarioSimTimeoutSec
        or counters indicate terminal state
        or wall-clock ≥ episodeTimeoutMs.

9.  scrape final counters from uci-demo/copilot/score
      (retained — last published is authoritative)
      AND re-adjudicate via scoreReplay.ts to catch any drift.
      Discrepancy → log warning, prefer scoreReplay (offline truth).

10. drain — wait 2s for in-flight publishes, close bus logger.

11. fetch http://127.0.0.1:7700/audit?n=500 → validatorAuditPercent.

12. tear-down: copilotHandle.stop(); worldSimHandle.stop().
       red-agent / daemon survive to next episode in the same cell.

13. write EpisodeResult to the in-memory cell list; flush partial
       report to disk every cell boundary (crash safety).

Timeout discipline

  • scenarioSimTimeoutSec defaults to 180 (Tripwire loops at 180s).
  • episodeTimeoutMs defaults to 600 000 — three times the sim cap to cover boot, daemon warm-wait, and tear-down.
  • A timed-out episode emits status: "timeout", partial counters, and the bus NDJSON is still kept (it's the only forensic trail).
  • Three consecutive timeouts in one cell → harness aborts the cell and continues to the next. The summary records failedEpisodes.

Why a child-process daemon + same-process services

The daemon already exists as a long-lived service with its own bus connection and its own training loop. Running it in-process would require either rebuilding the daemon's interface or pulling the solver kernel into the harness process — both bigger lifts than just spawning the existing service. The world-sim and copilot, by contrast, were already factored into start* factories in PR #30 precisely so the harness can boot them in-process. That keeps episode boundaries cheap (no docker churn between episodes within a cell) and keeps the daemon's training continuity intact across episodes.


Strand 4 — Determinism (seeds.ts)

The contract: same --seed N → identical EpisodeResult (modulo wall-clock fields, which are documented as non-deterministic).

import { mulberry32 } from "./mulberry32.js";

export interface DerivedSeeds {
  scenario: number;
  red: number;
  solverEpsilon: number;
}

export function deriveSeeds(master: number, episodeIdx: number): DerivedSeeds {
  // mulberry32 cycles long enough that we can index into it for
  // each role without correlation, but to be safe each role gets
  // its own keyed stream: hash(master, role, episodeIdx).
  const key = (role: number) =>
    Math.imul(master ^ 0x9e3779b9, role + 1) + episodeIdx * 0x85ebca6b;
  return {
    scenario: key(0) >>> 0,
    red: key(1) >>> 0,
    solverEpsilon: key(2) >>> 0,
  };
}

Seed plumbing checklist

Consumer Today Eval-harness needs
scenario.ts mulberry32 hardcoded in some events accept seedOverride in loadScenarioFromFile; thread through
services/red-agent/policies/scripted.ts RED_AGENT_SEED env unchanged — harness sets env on child spawn
packages/uci-solver/src/osCfr.ts ε-greedy Math.random() bring in an injectable RNG arg (additive, defaults to Math.random)
solver-daemon seeds the solver via env accept SOLVER_SEED env; daemon passes through to iterate()
services/copilot/src/llmAgent.ts sampling temp + seed in client not deterministic — llmAgent cells are advisory; the report flags them with nonDeterministic: true

The osCfr RNG change is the only subtle one. It's a 5-line change to take an optional rng?: () => number parameter, defaulted. Kuhn correctness test gets a deterministic-seed variant added so the exploitability ≤ 0.05 gate is now reproducible bit-for-bit.


Strand 5 — Score replay (scoreReplay.ts)

The live copilot publishes uci-demo/copilot/score continuously and the final retained value is normally authoritative. But the harness needs to be able to re-score a captured episode offline — both as a cross-check on the live value and to re-score historical runs after a PAYOFF_V bump.

export function scoreReplayFromNdjson(
  ndjsonPath: string,
  payoffV: number = PAYOFF_V,
): { counters: PayoffCounters; bluePayoff: number; redPayoff: number };

Implementation is straightforward: stream the NDJSON, accumulate counters using the same rules scoreMirror.ts uses online (PR #33's 9-counter set), and call bluePayoff(counters) at the end. The same rules are re-implemented here intentionally — scoreMirror.ts lives in services/copilot/src/ and pulling it into a library would be gratuitous coupling. The two implementations are unit-tested against each other on a shared NDJSON fixture so they cannot silently drift.

If the caller passes a payoffV that does not match the constants in the bundled @uci-demo/game, scoreReplayFromNdjson throws — same discipline as deserializeBlueprint.


Strand 6 — Sweep semantics (runner.ts)

The runner is a deliberately boring nested loop:

for scenario in config.scenarios:
  for agentSelector in config.agents:
    for degradePreset in config.degradePresets:
      # one cell. solver-daemon is started here if agent === "solver",
      # stopped at cell exit. red-agent likewise if in scope.
      for episodeIdx in 0..config.episodesPerCell:
        seed = deriveEpisodeSeed(masterSeed, scenario, agentSelector,
                                 degradePreset, episodeIdx)
        episode = await runEpisode({ scenario, agentSelector, degradePreset,
                                     seed, ...config })
        cell.episodes.push(episode)
        flushReportToDisk(report)   # crash safety
      cell.aggregate = computeAggregate(cell.episodes)

What's intentionally sequential

  • Episodes within a cell — they share daemon/red state and same bus broker; parallelism would require multi-broker which is a Phase II problem.
  • Cells across agents — the daemon is per-cell, and llm cells make rate-limited external calls.

What can be parallelized later

  • Outer scenario loop, once we ship multi-broker support (Phase II).
  • Read-only score-replay across an existing run — already pure function, trivially parallelizable.

This memo does not introduce parallelism. The 3-effector Tripwire sweep on a 4-core developer machine takes ~45 minutes for agents=[scripted, solver] × episodes=10 × degrade=[none, comms-flap-30s], which is the smoke-able size for PRs.


Strand 7 — Local Best Response exploitability (lbr.ts)

Tractable proxy for true exploitability on tactical-scale games. Lisý, Lanctot, Bowling (2014); Davis et al. (2014). The idea: hold the trained strategy fixed for one player, locally best-respond at each information set the other player visits during play, and measure the gap. Bounds true exploitability from below; correlates strongly in practice.

export function localBestResponseGap(
  blueprint: Blueprint,
  dynamics: GameDynamics,
  rng: () => number,
  iterations: number,
): { bluePayoffVsLbrRed: number; redPayoffVsLbrBlue: number; gap: number };

gap = (bluePayoffVsLbrRed + redPayoffVsLbrBlue) / 2. Lower is better. We plot gap against blueprint iteration count by reading the per-iteration blueprint snapshots the daemon archives to disk during a long training run (additive: daemon gains an opt-in SOLVER_ARCHIVE_INTERVAL=1000 env that writes blueprint snapshots to out/blueprint-archive/<iter>.json).

The Kuhn point is computed via kuhnExploit.ts which closed-forms exact exploitability (the same code the existing packages/uci-solver/test/kuhn.test.ts uses). Plotting both on the same axis gives reviewers an anchor: "Kuhn converges to known-correct 0.006; LBR-gap on Tripwire drops monotonically and approaches a plateau under the same kernel."


Strand 8 — Plot generators (plots/*.ts)

SVG-only. No matplotlib subprocess, no headless Chromium. We render two SVGs directly from typed-array data with one small helper (packages/uci-game already lives without plotting deps; the harness can take a single tiny zero-dep helper or just emit raw SVG strings). Rationale: SVG renders deterministically, is text-diffable in code review, and embeds cleanly in the white paper.

exploitVsIter({
  series: { kuhn: KuhnPoint[]; tripwire: LbrPoint[] },
  out: "out/eval/<runId>/exploit-vs-iter.svg",
  width: 800, height: 500,
});

clockVsSize({
  points: { size: number; wallClockSec: number; gapAtEnd: number }[],
  out: "out/eval/<runId>/clock-vs-size.svg",
  width: 800, height: 500,
});

Both functions also emit the underlying CSV alongside the SVG so the proposal author can re-style in any tool.


Strand 9 — CLI (main.ts, cli.ts)

No third-party arg parser. The existing codebase parses env vars and the few CLI surfaces it has are bespoke. Pattern:

tsx services/eval-harness/src/main.ts \
  --scenarios counter-uas-tripwire,tripwire-10effector \
  --agents scripted,solver \
  --degrade none,comms-flap-30s \
  --episodes-per-cell 5 \
  --seed 42 \
  --report out/eval/2026-05-21-smoke.json

Flags:

Flag Default Meaning
--scenarios counter-uas-tripwire comma list of scenario names registered in scenarios.ts
--agents scripted comma list; scripted, solver, llm:<provider>
--degrade none comma list of named presets in scenarios.ts
--episodes-per-cell 3 positive int
--seed 42 master seed
--episode-timeout-ms 600000 per-episode wall-clock cap
--scenario-sim-timeout-sec 180 per-episode sim-time cap
--daemon-warm-iter 1000 wait until daemon iter ≥ N before first solver episode
--broker-url mqtt://127.0.0.1:1883 override broker
--report out/eval/<isots>.json report output path
--validator-full-audit false sets VALIDATOR_FULL_AUDIT=1
--no-boot-broker false assume broker is up; skip docker compose up -d
--dry-run false print the cell matrix that would run, then exit

A --help lists them. Unknown flags abort with a clear error.


Strand 10 — Scenario registry (scenarios.ts)

Tactical: the canonical counter-uas-tripwire.yaml.

Synthetic-scale: programmatic generators that emit larger Tripwire- shaped YAML strings into a tmpdir at sweep start. The naming convention:

tripwire-3effector      ← canonical, 3 Blue effectors / 3-7 tracks/loop
tripwire-10effector     ← 10 effectors / 10-20 tracks/loop
tripwire-30effector     ← 30 effectors / 30-50 tracks/loop
tripwire-100effector    ← 100 effectors / 100-200 tracks/loop (operational bench)

Each variant is a deterministic function of one integer (effectors) and the same FOB center. The generator lives in scenarios.ts; it emits the scenario YAML to os.tmpdir()/<scenario-name>.yaml and hands the path to startWorldSim(). Synthetic scenarios are deterministically reproducible from the scenario name alone — the generator does not take a seed; same-name → same-bytes.

Degrade presets

export const DEGRADE_PRESETS: Record<string, DegradeEvent[]> = {
  none: [],
  "comms-flap-30s": [
    { atSec: 60, durationSec: 30, lossRate: 0.5 },
  ],
  "burst-2x": [
    { atSec: 30, durationSec: 10, lossRate: 0.9 },
    { atSec: 120, durationSec: 10, lossRate: 0.9 },
  ],
  "blackout-15s": [
    { atSec: 90, durationSec: 15, lossRate: 1.0 },
  ],
};

The harness publishes these to uci-demo/world/degrade directly; the copilot integrates over them for commsDegradeSeconds just like in the live demo.


Strand 11 — Test surface

test/runner.test.ts
  - boots a real broker (test/utils/broker.ts wraps docker-compose)
  - runs one cell: scripted on Tripwire, 1 episode, episode-timeout 30s
  - asserts:
      report.schemaVersion === 1
      cells.length === 1
      cells[0].episodes.length === 1
      episode.status === "ok"
      episode.counters has all 9 keys
      episode.bluePayoff is finite
      report.versions.payoffV === PAYOFF_V

test/scoreReplay.test.ts
  - loads test/fixtures/episode-fixture.ndjson (50-message hand-crafted)
  - asserts every counter matches a known expected value
  - asserts bluePayoff matches the closed-form against PAYOFF_WEIGHTS

test/seeds.test.ts
  - runs two episodes with the same seed back-to-back
  - asserts deep-equal on counters, bluePayoff,
    and decisionLatencies.length (latencies themselves vary by clock)

The runner.test.ts smoke needs docker; it's gated by a SKIP_BROKER_TESTS=1 env so CI without docker can skip without red-marking. Local default: runs. Vitest timeout: 120 sec.


Critical files

New files

  • services/eval-harness/package.json
  • services/eval-harness/tsconfig.json
  • services/eval-harness/src/main.ts
  • services/eval-harness/src/cli.ts
  • services/eval-harness/src/runner.ts
  • services/eval-harness/src/episode.ts
  • services/eval-harness/src/scoreReplay.ts
  • services/eval-harness/src/scenarios.ts
  • services/eval-harness/src/busLogger.ts
  • services/eval-harness/src/seeds.ts
  • services/eval-harness/src/lbr.ts
  • services/eval-harness/src/kuhnExploit.ts
  • services/eval-harness/src/plots/exploitVsIter.ts
  • services/eval-harness/src/plots/clockVsSize.ts
  • services/eval-harness/src/types.ts
  • services/eval-harness/regression.config.json
  • services/eval-harness/test/runner.test.ts
  • services/eval-harness/test/scoreReplay.test.ts
  • services/eval-harness/test/seeds.test.ts
  • services/eval-harness/test/fixtures/episode-fixture.ndjson
  • services/eval-harness/test/utils/broker.ts

Edited files (additive, default behavior preserved)

  • services/world-sim/src/scenario.tsloadScenarioFromFile accepts optional { seedOverride?: number } arg; threads through to the existing mulberry32 sites; default unchanged.
  • services/world-sim/src/service.tsStartWorldSimOptions gains scenarioSeed?: number; forwarded to loadScenarioFromFile.
  • services/solver-daemon/src/main.ts — reads SOLVER_SEED env; passes through to iterate() via the new rng parameter.
  • services/solver-daemon/src/daemon.ts — optional SOLVER_ARCHIVE_INTERVAL env; writes blueprint snapshots to out/blueprint-archive/<iter>.json when set.
  • packages/uci-solver/src/escfr.tsiterate(opts) gains optional rng?: () => number; threaded through to osCfr / es-cfr internal RNGs. Default Math.random.
  • packages/uci-solver/test/kuhn.test.ts — gains a it("converges deterministically with injected seed") block.
  • package.json (root) — pnpm run eval runs the harness against the smoke cell; pnpm run eval:full runs the full sweep (10×3×3 = 90 episodes, ~3 hours on developer hardware).
  • README.md — Eval Harness section pointing at pnpm run eval.

Out of scope for this PR (named so reviewers don't ask)

  • .github/workflows/eval.yml — perf workflow runs on solver-perf label. Lands in a follow-up PR after the runner is in main and we've measured the broker-boot cost on the GitHub runner.
  • Operational scaling sweep across 10 / 30 / 100 effectors. The synthetic-scale generators ship; an actual recorded sweep on self-hosted hardware is the week 8-10 workstream.
  • SME micro-tournament harness. Different shape — needs a human-in-the-loop replay UI, not a headless runner.
  • Persistent eval database. The harness emits self-contained JSON files; aggregation across runs is grep/jq.

Existing utilities to reuse (do not duplicate)

Utility Where Use
connectBus, WILDCARD_ALL @uci-demo/bus bus logger subscribes; episode driver publishes degrade
startWorldSim / startCopilot @uci-demo/world-sim, @uci-demo/copilot episode boot; installSignalHandlers: false
loadScenarioFromFile services/world-sim/src/scenario.ts synthetic scenario YAML round-trip
bluePayoff, redPayoff, PAYOFF_WEIGHTS, PAYOFF_V @uci-demo/game offline scoring + report versioning
deserializeBlueprint, SCHEMA_VERSION @uci-demo/solver report stamps blueprintAtStart from a freshly-deserialized payload
scriptedAgent, makeLlmAgent, createSolverAgent services/copilot/src/* direct agent injection
createClient, selectClientConfigFromEnv @uci-demo/llm LLM agent construction
RED_AGENT_* env vars services/red-agent/ seed + cadence pinning
XMLParser fast-xml-parser only used by lbr.ts if it ever needs to parse a captured EntityMT; otherwise score replay is pure JSON-side
xmllint-wasm validator endpoint services/validator/ /audit?n=500 HTTP call per episode

Open questions (resolve before code)

  1. Does the daemon need a --scenario flag so the harness can run scaling sweeps where the daemon trains against the same synthetic scenario the world-sim is running? Currently the daemon hard-codes the Tripwire scenario path. Recommendation: accept SOLVER_SCENARIO_PATH env in this PR; the daemon already has scenario loading.

  2. Is scoreReplay.ts worth the duplication of scoreMirror.ts logic? The shared fixture test gates them; the alternative is to pull scoreMirror into @uci-demo/game as a pure function. Recommendation: defer the refactor. Duplication is small, tests catch drift, and pulling scoreMirror cleanly into @uci-demo/game is its own discussion (it touches a few topic strings that today live in services/copilot/).

  3. Local best response on Tripwire — what depth? Full-depth LBR on a 50-ply tree is itself expensive. Recommendation: ship depth-limited LBR (default depth 10), document the bound, leave full-depth for the eval workflow.

  4. Do we record decision rationale text in the EpisodeResult? The intel-rail text from uci-demo/copilot/reason/* is rich forensic data but it's also large and the LLM-agent path is non-deterministic. Recommendation: write the uci-demo/copilot/reason/* traffic to a separate reasoning.ndjson inside the episode dir; do not embed in EpisodeResult (keeps the JSON diffable). The white paper's "3 interpretability case studies" pull from that file.

  5. Episode count for the proposal plots — 3? 10? 30? Affects wall-clock and statistical credibility. Recommendation: 10 per cell for the headline numbers; 3 per cell for the PR-level smoke. The white paper text states the n explicitly and the EvalReport.summary carries it.


Acceptance gates

For the PR landing this memo:

  • pnpm --filter @uci-demo/eval-harness exec tsx src/main.ts \ --scenarios counter-uas-tripwire --agents scripted \ --episodes-per-cell 2 --report /tmp/smoke.json completes under five minutes and writes a valid EvalReport.
  • pnpm -r typecheck and pnpm -r test both green.
  • runner.test.ts smoke passes against a real broker.
  • scoreReplay.test.ts and scoreMirror.ts agree to within floating-point error on the shared fixture.
  • seeds.test.ts round-trip identical on two consecutive runs.
  • No mutation of services/world-sim/, services/copilot/, services/solver-daemon/ beyond the additive edits listed above.

For the follow-up PRs (named so they don't get dropped):

  • .github/workflows/eval.ymlsolver-perf label triggers a 3-cell tactical bench; failures comment on the PR.
  • Scaling sweep recorded: a real run of --scenarios tripwire-3effector,tripwire-10effector, tripwire-30effector --agents solver --episodes-per-cell 10 on self-hosted hardware, results checked in under docs/benchmarks/.
  • LBR convergence plot (exploit-vs-iter.svg) checked in next to the blueprint archive that produced it.

Why this design over alternatives

Why not a Python harness? Every other piece in the repo is TypeScript on tsx. A Python sidecar would introduce a second package manager, a second test runner, and a second CI lane. The harness has zero numerical work that TypeScript can't do — the solver kernel is already TS, plot generation is SVG strings.

Why not run everything in one process? The solver-daemon is already a service. Inlining it into the harness would mean either duplicating its training loop or surgically extracting it — both worse than a child_process.spawn of an existing entrypoint. The copilot + world-sim, by contrast, were designed to be embedded (PR #30 service factories), so they run in-process.

Why not just use the cop-ui replay panel? That's a human-driven forensic tool, not a sweep runner. Different shape, different audience. The eval harness is upstream of that — it produces the captures that the replay panel can later display, but it does not depend on the panel existing.

Why version EvalReport from day one? Because we already know we'll bump PAYOFF_V and BELIEF_V again. The white paper carries specific numbers from specific reports; the schema version is what lets a reader confirm the report was scored under the same world they're reading about.


What's next after this PR

In sequence, the follow-ups that depend on this landing:

  1. CI workflow.github/workflows/eval.yml runs the runner on a solver-perf PR label. Bench fails the PR if mean bluePayoff for solver drops more than X% vs the last recorded main value (X TBD; start with 25%).
  2. Operational scaling sweep — record one full 3/10/30 effector sweep, check in the resulting SVG + CSV under docs/benchmarks/. This is the wall-clock-vs-problem-size plot the proposal cites.
  3. SME micro-tournament — the human-in-the-loop variant of the harness; reuses scoreReplay.ts, swaps the agent for a thin approval-rail-driven shim, records human decision latencies.
  4. White paper — pulls plots from docs/benchmarks/, pulls case studies from reasoning.ndjson, pulls headline numbers from a specific EvalReport.runId. The version constants in that report are quoted in the methodology section verbatim.