Design — services/eval-harness/ (Phase 0 weeks 7-9)¶
Pre-code design memo for the headless offline runner that produces the two plots and one table the SBIR D2P2 proposal must carry:
- Exploitability-vs-iterations — convergence story for the solver.
- Wall-clock-vs-problem-size — scaling story across 3, 10, 30 effectors / 3, 10, 30, 100 tracks.
- Agent comparison table —
SolverAgentvsscriptedAgentvsllmAgenton identical seeds, scored bybluePayoff.
Everything else in plan/sbir-osw26bz02-dv004-game-theoretic-coa.md
already exists in the repo: solver kernel, daemon, agent selection,
red-agent, doctrinal subroutines. The eval-harness is what turns those
working pieces into evidence.
The PR ships deliberately scoped to the headless runner + report
schema + score replay + two plot generators. The CI workflow
(.github/workflows/eval.yml) and the operational scaling sweep
(3 → 10 → 30 effectors) ship as follow-ups so reviewers see the
runner working before the bench grows.
The headline acceptance gate: a reviewer runs
pnpm --filter @uci-demo/eval-harness exec tsx src/main.ts \
--scenarios counter-uas-tripwire \
--agents scripted,solver \
--degrade none \
--episodes-per-cell 3 \
--report out/eval/smoke.json
and gets a complete EvalReport JSON with per-episode scores, plus
NDJSON timelines on disk under out/eval/<runId>/, in under five
minutes wall-clock.
Strand 1 — services/eval-harness/¶
What it does (one paragraph)¶
A standalone Node CLI that boots Mosquitto via the existing
docker-compose.yml, then in-process imports startWorldSim() from
@uci-demo/world-sim and startCopilot() from @uci-demo/copilot
(both exported as service factories in PR #30), drives the resulting
service triad over a parameterized sweep scenarios × agents ×
degrade-presets × episodes-per-cell, captures the full bus to disk,
adjudicates bluePayoff from the captured wire via
scoreReplay.ts, and emits a versioned EvalReport JSON. The
harness is the only place in the repo that owns end-to-end
reproducibility: it seeds the scenario YAML's mulberry32 RNG, the
red-agent's mulberry32 RNG, and the solver's ε-greedy RNG from one
master seed per episode, so re-running with the same --seed is bit-
identical. No UI. No human in the loop. No external network calls
unless an episode uses the llmAgent (then it pulls the same
@uci-demo/llm provider the live copilot would).
What it does NOT do¶
- Compute true exploitability for Tripwire. That number is what
the convergence plot needs, but Tripwire's depth-50 tree makes
exact best-response intractable. The harness computes
local-best-response (LBR) exploitability (Lisý et al. 2014) as
a tractable proxy and ships the exact exploitability for Kuhn
poker — the latter is the gate the kernel already passes
(
packages/uci-solver/test/kuhn.test.ts). The Kuhn point on the same plot makes the LBR curve credible. - Replace the cop-ui. The harness does not subscribe any UI
topics. The copilot's existing publishers (
uci-demo/copilot/*anduci-demo/solver/*) are captured by the bus logger and replayed offline byscoreReplay.ts; the harness does not render them. - Train a new blueprint. When
agents=solver, the harness boots the existingservices/solver-daemon/as a child process and lets it train against the same dynamics it always trains against, then consumesuci-demo/solver/blueprintexactly like the live copilot does. Training cadence is decoupled from episode boundaries — the daemon keeps training across episodes; each episode reads whatever blueprint is current. This matches the live demo's behavior and avoids the false story of "trained from scratch per episode." - Run human-in-the-loop tournaments. That's the week 9-11 SME micro-tournament, a separate workstream.
- Persist between runs. Each
--reportwrites a fresh tree. Aggregation across runs is a one-liner over the JSON files outside the harness's scope.
Service shape¶
services/eval-harness/
├── package.json # workspace deps on @uci-demo/{bus,codec,game,solver,world-sim,copilot,red-agent}
├── tsconfig.json
├── src/
│ ├── main.ts # CLI entry: arg parsing → runner.run() → write report
│ ├── cli.ts # arg-parse helper (no third-party dep; we already avoid yargs)
│ ├── runner.ts # core sweep loop; one process, sequential episodes
│ ├── episode.ts # single-episode lifecycle: boot → run → drain → score → tear down
│ ├── scoreReplay.ts # offline PayoffCounters reconstruction from captured bus
│ ├── scenarios.ts # registry of named scenarios with synthetic-scale variants
│ ├── busLogger.ts # subscribes WILDCARD_ALL; writes one NDJSON per episode
│ ├── seeds.ts # one master seed → derived seeds for scenario/red/solver
│ ├── lbr.ts # local-best-response exploitability (Tripwire)
│ ├── kuhnExploit.ts # exact exploitability (Kuhn — credibility anchor)
│ ├── plots/
│ │ ├── exploitVsIter.ts # reads daemon status archive; writes SVG
│ │ └── clockVsSize.ts # reads scaling-sweep report; writes SVG
│ └── types.ts # EvalReport, EpisodeResult, RunnerConfig
├── regression.config.json # cells the perf workflow exercises on every label
└── test/
├── runner.test.ts # 1-cell smoke: scripted on Tripwire, 1 episode, asserts report shape
├── scoreReplay.test.ts # adjudicates a hand-crafted NDJSON, checks all 9 counters
└── seeds.test.ts # determinism: same seed twice → identical EpisodeResult
The package is private: true like every other service. No published
artifacts; the test surface gates correctness.
Strand 2 — EvalReport schema (the contract)¶
The whole point of the harness is to emit a JSON that someone other than us can read and reproduce. The schema is versioned exactly like the solver blueprint:
// services/eval-harness/src/types.ts
export const EVAL_REPORT_VERSION = 1;
export interface EvalReport {
schemaVersion: 1;
runId: string; // ISO timestamp + 6-char nonce
startedAt: string; // ISO
finishedAt: string; // ISO
durationMs: number;
hostname: string; // os.hostname()
nodeVersion: string; // process.version
gitSha: string | null; // null if not in a git repo
versions: {
beliefV: number; // BELIEF_V from @uci-demo/game
payoffV: number; // PAYOFF_V
infosetV: number; // INFOSET_V
solverSchemaVersion: number; // SCHEMA_VERSION
evalReportV: 1;
};
config: RunnerConfig; // exactly what was on the CLI, normalized
cells: CellResult[]; // one per (scenario × agent × degrade) cross
summary: {
totalEpisodes: number;
totalDurationMs: number;
failedEpisodes: number; // episodes that timed out or threw
cellsBlueWon: number; // cells with mean bluePayoff > 0
};
}
export interface CellResult {
scenario: string;
agent: AgentSelector; // "scripted" | "solver" | "llm:<provider>"
degradePreset: string;
episodes: EpisodeResult[];
aggregate: {
bluePayoffMean: number;
bluePayoffStdev: number;
bluePayoffMin: number;
bluePayoffMax: number;
counterMeans: PayoffCounters; // mean across episodes, per counter
meanTimeToDecisionMs: number;
meanWallClockSec: number;
};
}
export interface EpisodeResult {
episodeId: string; // "<cellKey>-<idx>" — also the NDJSON dir name
seed: number; // master seed for this episode
startedAt: string;
durationMs: number; // wall-clock from boot → tear-down
simulatedSeconds: number; // scenario seconds elapsed (≤ scenario.loopSeconds)
status: "ok" | "timeout" | "error";
errorMessage?: string; // when status !== "ok"
counters: PayoffCounters; // final
bluePayoff: number;
redPayoff: number;
validatorAuditPercent: number; // from validator's /audit?n=N response
decisionLatencies: number[]; // copilot evaluate() wall-times, ms, capped at 5000
blueprintAtStart: { // null if agent !== "solver" or daemon was cold
iterations: number;
infoSetCount: number;
deltaNorm: number;
variant: "es" | "os";
} | null;
artifacts: {
busNdjsonPath: string; // relative to runId dir
daemonStatusPath: string | null;
};
}
export interface RunnerConfig {
scenarios: string[]; // ["counter-uas-tripwire", ...]
agents: AgentSelector[]; // ["scripted", "solver", "llm:anthropic"]
degradePresets: string[]; // ["none", "comms-flap-30s", "burst-2x"]
episodesPerCell: number;
episodeTimeoutMs: number; // default 600_000 (10 min wall-clock)
scenarioSimTimeoutSec: number; // default 180 (3 min sim-time)
masterSeed: number; // root seed; episodes derive from it
brokerUrl: string;
reportPath: string;
outDir: string; // out/eval/<runId>/
daemonWarmIterations: number; // wait until daemon has run ≥ N iter before first solver episode
validatorFullAudit: boolean; // sets VALIDATOR_FULL_AUDIT=1 for the run
}
export type AgentSelector =
| "scripted"
| "solver"
| { llm: "anthropic" | "ollama" | "bedrock" | "openai-compat" };
AgentSelector accommodates the three live agent paths from
services/copilot/src/service.ts. The harness translates these into
the env vars startCopilot() already understands (USE_SOLVER=1 or
LLM_PROVIDER=…) AND injects an Agent directly via the existing
StartCopilotOptions.agent slot when finer control is needed (e.g.
deterministic scripted-agent in cells where we want zero env state
leakage).
Why version this¶
PAYOFF_V and BELIEF_V already bump when the math changes (PR #33
took PAYOFF_V from 1 → 2). The EvalReport carries them as a
freeze-frame so the report can be diffed across repo HEADs. Adding a
counter or reweighting requires bumping PAYOFF_V; the report
faithfully records which world it was scored under. The plot
generator refuses to mix reports with different payoffV.
Strand 3 — Per-episode lifecycle (episode.ts)¶
One episode = one (scenario, agent, degradePreset, seed) tuple. The
sequence below is bracketed by a try / finally so any failure
still hits tear-down. Each step has a wall-clock budget so a
hung service can't poison a sweep.
1. derive seeds from master:
scenarioSeed = mulberry32(master, 0)
redSeed = mulberry32(master, 1)
solverSeed = mulberry32(master, 2)
2. start NDJSON bus logger
subscribe WILDCARD_ALL, write each {topic, ts, payload?} line.
payload included only when length ≤ 4 KiB; longer payloads
noted as { topic, ts, payloadLen, payloadSha256 }.
3. publish degrade-preset events
(uci-demo/world/degrade) with retained=false; presets are
named in scenarios.ts, e.g. "comms-flap-30s" = 30s 0.5 loss.
4. startWorldSim({ installSignalHandlers: false,
scenarioPath: scenarioFile,
seedOverride: scenarioSeed })
(need a thin extension to scenario.ts to accept seedOverride
— additive, default unchanged)
5. startCopilot({ installSignalHandlers: false,
agent: buildAgent(agentSelector, ctx) })
where buildAgent selects between:
"scripted" → import { scriptedAgent } from copilot/scriptedAgent.js
"solver" → createSolverAgent (subscribes the live
uci-demo/solver/blueprint that the
already-running daemon publishes)
{ llm: provider } → makeLlmAgent(createClient({ provider, ... }))
6. if agentSelector === "solver" and daemon not running:
childProcess.spawn("pnpm --filter @uci-demo/solver-daemon dev",
{ env: { ...process.env, SOLVER_SEED: solverSeed }})
wait until uci-demo/solver/status iterations ≥ daemonWarmIterations
(10 min cap before giving up)
daemon is reused across episodes within a single solver-cell; it
is stopped at cell boundary, not episode boundary.
7. if agentSelector starts with "red:" or sweep includes red:
childProcess.spawn("pnpm --filter @uci-demo/red-agent dev",
{ env: { ...process.env, RED_AGENT_SEED: redSeed }})
8. loop until simulatedSeconds ≥ scenarioSimTimeoutSec
or counters indicate terminal state
or wall-clock ≥ episodeTimeoutMs.
9. scrape final counters from uci-demo/copilot/score
(retained — last published is authoritative)
AND re-adjudicate via scoreReplay.ts to catch any drift.
Discrepancy → log warning, prefer scoreReplay (offline truth).
10. drain — wait 2s for in-flight publishes, close bus logger.
11. fetch http://127.0.0.1:7700/audit?n=500 → validatorAuditPercent.
12. tear-down: copilotHandle.stop(); worldSimHandle.stop().
red-agent / daemon survive to next episode in the same cell.
13. write EpisodeResult to the in-memory cell list; flush partial
report to disk every cell boundary (crash safety).
Timeout discipline¶
scenarioSimTimeoutSecdefaults to 180 (Tripwire loops at 180s).episodeTimeoutMsdefaults to 600 000 — three times the sim cap to cover boot, daemon warm-wait, and tear-down.- A timed-out episode emits
status: "timeout", partial counters, and the bus NDJSON is still kept (it's the only forensic trail). - Three consecutive timeouts in one cell → harness aborts the cell
and continues to the next. The summary records
failedEpisodes.
Why a child-process daemon + same-process services¶
The daemon already exists as a long-lived service with its own bus
connection and its own training loop. Running it in-process would
require either rebuilding the daemon's interface or pulling the
solver kernel into the harness process — both bigger lifts than just
spawning the existing service. The world-sim and copilot, by
contrast, were already factored into start* factories in PR #30
precisely so the harness can boot them in-process. That keeps episode
boundaries cheap (no docker churn between episodes within a cell) and
keeps the daemon's training continuity intact across episodes.
Strand 4 — Determinism (seeds.ts)¶
The contract: same --seed N → identical EpisodeResult (modulo
wall-clock fields, which are documented as non-deterministic).
import { mulberry32 } from "./mulberry32.js";
export interface DerivedSeeds {
scenario: number;
red: number;
solverEpsilon: number;
}
export function deriveSeeds(master: number, episodeIdx: number): DerivedSeeds {
// mulberry32 cycles long enough that we can index into it for
// each role without correlation, but to be safe each role gets
// its own keyed stream: hash(master, role, episodeIdx).
const key = (role: number) =>
Math.imul(master ^ 0x9e3779b9, role + 1) + episodeIdx * 0x85ebca6b;
return {
scenario: key(0) >>> 0,
red: key(1) >>> 0,
solverEpsilon: key(2) >>> 0,
};
}
Seed plumbing checklist¶
| Consumer | Today | Eval-harness needs |
|---|---|---|
scenario.ts mulberry32 |
hardcoded in some events | accept seedOverride in loadScenarioFromFile; thread through |
services/red-agent/policies/scripted.ts |
RED_AGENT_SEED env |
unchanged — harness sets env on child spawn |
packages/uci-solver/src/osCfr.ts ε-greedy |
Math.random() |
bring in an injectable RNG arg (additive, defaults to Math.random) |
| solver-daemon | seeds the solver via env | accept SOLVER_SEED env; daemon passes through to iterate() |
services/copilot/src/llmAgent.ts |
sampling temp + seed in client | not deterministic — llmAgent cells are advisory; the report flags them with nonDeterministic: true |
The osCfr RNG change is the only subtle one. It's a 5-line change to
take an optional rng?: () => number parameter, defaulted. Kuhn
correctness test gets a deterministic-seed variant added so the
exploitability ≤ 0.05 gate is now reproducible bit-for-bit.
Strand 5 — Score replay (scoreReplay.ts)¶
The live copilot publishes uci-demo/copilot/score continuously and
the final retained value is normally authoritative. But the harness
needs to be able to re-score a captured episode offline — both as
a cross-check on the live value and to re-score historical runs after
a PAYOFF_V bump.
export function scoreReplayFromNdjson(
ndjsonPath: string,
payoffV: number = PAYOFF_V,
): { counters: PayoffCounters; bluePayoff: number; redPayoff: number };
Implementation is straightforward: stream the NDJSON, accumulate
counters using the same rules scoreMirror.ts uses online (PR #33's
9-counter set), and call bluePayoff(counters) at the end. The same
rules are re-implemented here intentionally — scoreMirror.ts lives
in services/copilot/src/ and pulling it into a library would be
gratuitous coupling. The two implementations are unit-tested against
each other on a shared NDJSON fixture so they cannot silently drift.
If the caller passes a payoffV that does not match the constants in
the bundled @uci-demo/game, scoreReplayFromNdjson throws — same
discipline as deserializeBlueprint.
Strand 6 — Sweep semantics (runner.ts)¶
The runner is a deliberately boring nested loop:
for scenario in config.scenarios:
for agentSelector in config.agents:
for degradePreset in config.degradePresets:
# one cell. solver-daemon is started here if agent === "solver",
# stopped at cell exit. red-agent likewise if in scope.
for episodeIdx in 0..config.episodesPerCell:
seed = deriveEpisodeSeed(masterSeed, scenario, agentSelector,
degradePreset, episodeIdx)
episode = await runEpisode({ scenario, agentSelector, degradePreset,
seed, ...config })
cell.episodes.push(episode)
flushReportToDisk(report) # crash safety
cell.aggregate = computeAggregate(cell.episodes)
What's intentionally sequential¶
- Episodes within a cell — they share daemon/red state and same bus broker; parallelism would require multi-broker which is a Phase II problem.
- Cells across agents — the daemon is per-cell, and llm cells make rate-limited external calls.
What can be parallelized later¶
- Outer scenario loop, once we ship multi-broker support (Phase II).
- Read-only score-replay across an existing run — already pure function, trivially parallelizable.
This memo does not introduce parallelism. The 3-effector Tripwire
sweep on a 4-core developer machine takes ~45 minutes for
agents=[scripted, solver] × episodes=10 × degrade=[none,
comms-flap-30s], which is the smoke-able size for PRs.
Strand 7 — Local Best Response exploitability (lbr.ts)¶
Tractable proxy for true exploitability on tactical-scale games. Lisý, Lanctot, Bowling (2014); Davis et al. (2014). The idea: hold the trained strategy fixed for one player, locally best-respond at each information set the other player visits during play, and measure the gap. Bounds true exploitability from below; correlates strongly in practice.
export function localBestResponseGap(
blueprint: Blueprint,
dynamics: GameDynamics,
rng: () => number,
iterations: number,
): { bluePayoffVsLbrRed: number; redPayoffVsLbrBlue: number; gap: number };
gap = (bluePayoffVsLbrRed + redPayoffVsLbrBlue) / 2. Lower is
better. We plot gap against blueprint iteration count by reading
the per-iteration blueprint snapshots the daemon archives to disk
during a long training run (additive: daemon gains an opt-in
SOLVER_ARCHIVE_INTERVAL=1000 env that writes blueprint snapshots
to out/blueprint-archive/<iter>.json).
The Kuhn point is computed via kuhnExploit.ts which closed-forms
exact exploitability (the same code the existing
packages/uci-solver/test/kuhn.test.ts uses). Plotting both on the
same axis gives reviewers an anchor: "Kuhn converges to known-correct
0.006; LBR-gap on Tripwire drops monotonically and approaches a
plateau under the same kernel."
Strand 8 — Plot generators (plots/*.ts)¶
SVG-only. No matplotlib subprocess, no headless Chromium. We render
two SVGs directly from typed-array data with one small helper
(packages/uci-game already lives without plotting deps; the harness
can take a single tiny zero-dep helper or just emit raw SVG strings).
Rationale: SVG renders deterministically, is text-diffable in
code review, and embeds cleanly in the white paper.
exploitVsIter({
series: { kuhn: KuhnPoint[]; tripwire: LbrPoint[] },
out: "out/eval/<runId>/exploit-vs-iter.svg",
width: 800, height: 500,
});
clockVsSize({
points: { size: number; wallClockSec: number; gapAtEnd: number }[],
out: "out/eval/<runId>/clock-vs-size.svg",
width: 800, height: 500,
});
Both functions also emit the underlying CSV alongside the SVG so the proposal author can re-style in any tool.
Strand 9 — CLI (main.ts, cli.ts)¶
No third-party arg parser. The existing codebase parses env vars and the few CLI surfaces it has are bespoke. Pattern:
tsx services/eval-harness/src/main.ts \
--scenarios counter-uas-tripwire,tripwire-10effector \
--agents scripted,solver \
--degrade none,comms-flap-30s \
--episodes-per-cell 5 \
--seed 42 \
--report out/eval/2026-05-21-smoke.json
Flags:
| Flag | Default | Meaning |
|---|---|---|
--scenarios |
counter-uas-tripwire |
comma list of scenario names registered in scenarios.ts |
--agents |
scripted |
comma list; scripted, solver, llm:<provider> |
--degrade |
none |
comma list of named presets in scenarios.ts |
--episodes-per-cell |
3 |
positive int |
--seed |
42 |
master seed |
--episode-timeout-ms |
600000 |
per-episode wall-clock cap |
--scenario-sim-timeout-sec |
180 |
per-episode sim-time cap |
--daemon-warm-iter |
1000 |
wait until daemon iter ≥ N before first solver episode |
--broker-url |
mqtt://127.0.0.1:1883 |
override broker |
--report |
out/eval/<isots>.json |
report output path |
--validator-full-audit |
false |
sets VALIDATOR_FULL_AUDIT=1 |
--no-boot-broker |
false |
assume broker is up; skip docker compose up -d |
--dry-run |
false |
print the cell matrix that would run, then exit |
A --help lists them. Unknown flags abort with a clear error.
Strand 10 — Scenario registry (scenarios.ts)¶
Tactical: the canonical counter-uas-tripwire.yaml.
Synthetic-scale: programmatic generators that emit larger Tripwire- shaped YAML strings into a tmpdir at sweep start. The naming convention:
tripwire-3effector ← canonical, 3 Blue effectors / 3-7 tracks/loop
tripwire-10effector ← 10 effectors / 10-20 tracks/loop
tripwire-30effector ← 30 effectors / 30-50 tracks/loop
tripwire-100effector ← 100 effectors / 100-200 tracks/loop (operational bench)
Each variant is a deterministic function of one integer (effectors)
and the same FOB center. The generator lives in scenarios.ts; it
emits the scenario YAML to os.tmpdir()/<scenario-name>.yaml and
hands the path to startWorldSim(). Synthetic scenarios are
deterministically reproducible from the scenario name alone —
the generator does not take a seed; same-name → same-bytes.
Degrade presets¶
export const DEGRADE_PRESETS: Record<string, DegradeEvent[]> = {
none: [],
"comms-flap-30s": [
{ atSec: 60, durationSec: 30, lossRate: 0.5 },
],
"burst-2x": [
{ atSec: 30, durationSec: 10, lossRate: 0.9 },
{ atSec: 120, durationSec: 10, lossRate: 0.9 },
],
"blackout-15s": [
{ atSec: 90, durationSec: 15, lossRate: 1.0 },
],
};
The harness publishes these to uci-demo/world/degrade directly; the
copilot integrates over them for commsDegradeSeconds just like in
the live demo.
Strand 11 — Test surface¶
test/runner.test.ts
- boots a real broker (test/utils/broker.ts wraps docker-compose)
- runs one cell: scripted on Tripwire, 1 episode, episode-timeout 30s
- asserts:
report.schemaVersion === 1
cells.length === 1
cells[0].episodes.length === 1
episode.status === "ok"
episode.counters has all 9 keys
episode.bluePayoff is finite
report.versions.payoffV === PAYOFF_V
test/scoreReplay.test.ts
- loads test/fixtures/episode-fixture.ndjson (50-message hand-crafted)
- asserts every counter matches a known expected value
- asserts bluePayoff matches the closed-form against PAYOFF_WEIGHTS
test/seeds.test.ts
- runs two episodes with the same seed back-to-back
- asserts deep-equal on counters, bluePayoff,
and decisionLatencies.length (latencies themselves vary by clock)
The runner.test.ts smoke needs docker; it's gated by a
SKIP_BROKER_TESTS=1 env so CI without docker can skip without
red-marking. Local default: runs. Vitest timeout: 120 sec.
Critical files¶
New files¶
services/eval-harness/package.jsonservices/eval-harness/tsconfig.jsonservices/eval-harness/src/main.tsservices/eval-harness/src/cli.tsservices/eval-harness/src/runner.tsservices/eval-harness/src/episode.tsservices/eval-harness/src/scoreReplay.tsservices/eval-harness/src/scenarios.tsservices/eval-harness/src/busLogger.tsservices/eval-harness/src/seeds.tsservices/eval-harness/src/lbr.tsservices/eval-harness/src/kuhnExploit.tsservices/eval-harness/src/plots/exploitVsIter.tsservices/eval-harness/src/plots/clockVsSize.tsservices/eval-harness/src/types.tsservices/eval-harness/regression.config.jsonservices/eval-harness/test/runner.test.tsservices/eval-harness/test/scoreReplay.test.tsservices/eval-harness/test/seeds.test.tsservices/eval-harness/test/fixtures/episode-fixture.ndjsonservices/eval-harness/test/utils/broker.ts
Edited files (additive, default behavior preserved)¶
services/world-sim/src/scenario.ts—loadScenarioFromFileaccepts optional{ seedOverride?: number }arg; threads through to the existing mulberry32 sites; default unchanged.services/world-sim/src/service.ts—StartWorldSimOptionsgainsscenarioSeed?: number; forwarded toloadScenarioFromFile.services/solver-daemon/src/main.ts— readsSOLVER_SEEDenv; passes through toiterate()via the newrngparameter.services/solver-daemon/src/daemon.ts— optionalSOLVER_ARCHIVE_INTERVALenv; writes blueprint snapshots toout/blueprint-archive/<iter>.jsonwhen set.packages/uci-solver/src/escfr.ts—iterate(opts)gains optionalrng?: () => number; threaded through to osCfr / es-cfr internal RNGs. DefaultMath.random.packages/uci-solver/test/kuhn.test.ts— gains ait("converges deterministically with injected seed")block.package.json(root) —pnpm run evalruns the harness against the smoke cell;pnpm run eval:fullruns the full sweep (10×3×3 = 90 episodes, ~3 hours on developer hardware).README.md— Eval Harness section pointing atpnpm run eval.
Out of scope for this PR (named so reviewers don't ask)¶
.github/workflows/eval.yml— perf workflow runs onsolver-perflabel. Lands in a follow-up PR after the runner is inmainand we've measured the broker-boot cost on the GitHub runner.- Operational scaling sweep across 10 / 30 / 100 effectors. The synthetic-scale generators ship; an actual recorded sweep on self-hosted hardware is the week 8-10 workstream.
- SME micro-tournament harness. Different shape — needs a human-in-the-loop replay UI, not a headless runner.
- Persistent eval database. The harness emits self-contained JSON files; aggregation across runs is grep/jq.
Existing utilities to reuse (do not duplicate)¶
| Utility | Where | Use |
|---|---|---|
connectBus, WILDCARD_ALL |
@uci-demo/bus |
bus logger subscribes; episode driver publishes degrade |
startWorldSim / startCopilot |
@uci-demo/world-sim, @uci-demo/copilot |
episode boot; installSignalHandlers: false |
loadScenarioFromFile |
services/world-sim/src/scenario.ts |
synthetic scenario YAML round-trip |
bluePayoff, redPayoff, PAYOFF_WEIGHTS, PAYOFF_V |
@uci-demo/game |
offline scoring + report versioning |
deserializeBlueprint, SCHEMA_VERSION |
@uci-demo/solver |
report stamps blueprintAtStart from a freshly-deserialized payload |
scriptedAgent, makeLlmAgent, createSolverAgent |
services/copilot/src/* |
direct agent injection |
createClient, selectClientConfigFromEnv |
@uci-demo/llm |
LLM agent construction |
RED_AGENT_* env vars |
services/red-agent/ |
seed + cadence pinning |
XMLParser |
fast-xml-parser |
only used by lbr.ts if it ever needs to parse a captured EntityMT; otherwise score replay is pure JSON-side |
xmllint-wasm validator endpoint |
services/validator/ |
/audit?n=500 HTTP call per episode |
Open questions (resolve before code)¶
-
Does the daemon need a
--scenarioflag so the harness can run scaling sweeps where the daemon trains against the same synthetic scenario the world-sim is running? Currently the daemon hard-codes the Tripwire scenario path. Recommendation: acceptSOLVER_SCENARIO_PATHenv in this PR; the daemon already has scenario loading. -
Is
scoreReplay.tsworth the duplication ofscoreMirror.tslogic? The shared fixture test gates them; the alternative is to pullscoreMirrorinto@uci-demo/gameas a pure function. Recommendation: defer the refactor. Duplication is small, tests catch drift, and pulling scoreMirror cleanly into@uci-demo/gameis its own discussion (it touches a few topic strings that today live inservices/copilot/). -
Local best response on Tripwire — what depth? Full-depth LBR on a 50-ply tree is itself expensive. Recommendation: ship depth-limited LBR (default depth 10), document the bound, leave full-depth for the eval workflow.
-
Do we record decision rationale text in the EpisodeResult? The intel-rail text from
uci-demo/copilot/reason/*is rich forensic data but it's also large and the LLM-agent path is non-deterministic. Recommendation: write theuci-demo/copilot/reason/*traffic to a separatereasoning.ndjsoninside the episode dir; do not embed inEpisodeResult(keeps the JSON diffable). The white paper's "3 interpretability case studies" pull from that file. -
Episode count for the proposal plots — 3? 10? 30? Affects wall-clock and statistical credibility. Recommendation: 10 per cell for the headline numbers; 3 per cell for the PR-level smoke. The white paper text states the n explicitly and the
EvalReport.summarycarries it.
Acceptance gates¶
For the PR landing this memo:
-
pnpm --filter @uci-demo/eval-harness exec tsx src/main.ts \ --scenarios counter-uas-tripwire --agents scripted \ --episodes-per-cell 2 --report /tmp/smoke.jsoncompletes under five minutes and writes a validEvalReport. -
pnpm -r typecheckandpnpm -r testboth green. -
runner.test.tssmoke passes against a real broker. -
scoreReplay.test.tsandscoreMirror.tsagree to within floating-point error on the shared fixture. -
seeds.test.tsround-trip identical on two consecutive runs. - No mutation of
services/world-sim/,services/copilot/,services/solver-daemon/beyond the additive edits listed above.
For the follow-up PRs (named so they don't get dropped):
-
.github/workflows/eval.yml—solver-perflabel triggers a 3-cell tactical bench; failures comment on the PR. - Scaling sweep recorded: a real run of
--scenarios tripwire-3effector,tripwire-10effector, tripwire-30effector --agents solver --episodes-per-cell 10on self-hosted hardware, results checked in underdocs/benchmarks/. - LBR convergence plot (
exploit-vs-iter.svg) checked in next to the blueprint archive that produced it.
Why this design over alternatives¶
Why not a Python harness? Every other piece in the repo is
TypeScript on tsx. A Python sidecar would introduce a second
package manager, a second test runner, and a second CI lane. The
harness has zero numerical work that TypeScript can't do — the
solver kernel is already TS, plot generation is SVG strings.
Why not run everything in one process? The solver-daemon is
already a service. Inlining it into the harness would mean either
duplicating its training loop or surgically extracting it — both
worse than a child_process.spawn of an existing entrypoint. The
copilot + world-sim, by contrast, were designed to be embedded
(PR #30 service factories), so they run in-process.
Why not just use the cop-ui replay panel? That's a human-driven forensic tool, not a sweep runner. Different shape, different audience. The eval harness is upstream of that — it produces the captures that the replay panel can later display, but it does not depend on the panel existing.
Why version EvalReport from day one? Because we already know
we'll bump PAYOFF_V and BELIEF_V again. The white paper carries
specific numbers from specific reports; the schema version is what
lets a reader confirm the report was scored under the same world
they're reading about.
What's next after this PR¶
In sequence, the follow-ups that depend on this landing:
- CI workflow —
.github/workflows/eval.ymlruns the runner on asolver-perfPR label. Bench fails the PR if meanbluePayoffforsolverdrops more than X% vs the last recordedmainvalue (X TBD; start with 25%). - Operational scaling sweep — record one full 3/10/30
effector sweep, check in the resulting SVG + CSV under
docs/benchmarks/. This is the wall-clock-vs-problem-size plot the proposal cites. - SME micro-tournament — the human-in-the-loop variant of the
harness; reuses
scoreReplay.ts, swaps the agent for a thin approval-rail-driven shim, records human decision latencies. - White paper — pulls plots from
docs/benchmarks/, pulls case studies fromreasoning.ndjson, pulls headline numbers from a specificEvalReport.runId. The version constants in that report are quoted in the methodology section verbatim.