Design — services/solver-daemon/ + SolverPill (Phase 0 weeks 5-7)¶
Pre-code design memo for the workstream that makes the
@uci-demo/solver kernel visibly alive in the demo. The kernel
ships (PR #32); doctrinal subroutines + bank ship (PR #33); now
we run ES-MCCFR continuously and publish the blueprint so the
SolverAgent (next PR) and eval-harness (week 7-9) have a real
trained policy to consume.
This PR is deliberately scoped to the daemon plus the visible proof-of-life pill. SolverAgent integration and red-agent are follow-ups in PR #35 and #36.
The headline acceptance gate: a reviewer running
USE_SOLVER=1 pnpm run up sees a SOLVER ▸ 12.4K iter · 3,217 IS ·
δ 0.014 pill in the top-right of the COP, with the iteration
counter incrementing. Without that visible proof, the algorithmic
story remains invisible to anyone who doesn't read the code.
Strand 1 — services/solver-daemon/¶
What it does (one paragraph)¶
A standalone Node service that loads a synthetic version of the
Tripwire scenario, builds a GameDynamics over it via
@uci-demo/game, and runs iterate() from @uci-demo/solver
continuously. Every ~100 iterations it publishes a status
heartbeat to uci-demo/solver/status (retained). Every ~1000
iterations it publishes the current trained blueprint to
uci-demo/solver/blueprint (retained). The daemon never reads
from uci/v2_5/* — its world is the synthetic dynamics, not the
live bus. The blueprint it publishes is the contract; downstream
consumers (SolverAgent in PR #35, eval-harness in week 7-9)
subscribe and consume in-memory without RPC.
What it does NOT do¶
- Drive Blue decisions. PR #35 ships
SolverAgentconsuming the published blueprint via aBlueprintHoldermodule. The daemon's only output is the blueprint itself. - Read live bus traffic. Self-play happens against synthetic dynamics. Live integration is the SolverAgent's job.
- Run Red. The red-agent service is PR #36.
- Compute true exploitability. Tactical-scale exact best-response is intractable. We publish a delta-norm proxy (L2 norm of strategy change over the last 1k iterations) as the visible convergence signal. Real exploitability via local best-response (Lisý et al.) lands with the eval-harness.
Service shape¶
services/solver-daemon/
├── package.json # workspace deps on @uci-demo/{bus,game,solver}
├── tsconfig.json
├── src/
│ ├── main.ts # entry point, env parsing, bus connect, kick off daemon
│ ├── daemon.ts # iteration loop, heartbeat + blueprint publish
│ ├── scenario.ts # Tripwire YAML → GameState + GameDynamics adapter
│ └── deltaNorm.ts # convergence-proxy helper
└── test/
└── daemon.test.ts # 1k-iter convergence smoke + publish-cadence check
Strand 2 — MQTT side-channels¶
Both topics are under uci-demo/solver/... (free-form JSON, the
same convention as the existing uci-demo/world/... and
uci-demo/copilot/... channels).
uci-demo/solver/status (retained)¶
Published every STATUS_INTERVAL_ITER = 100 iterations.
{
"schemaV": 1,
"iterations": 12400,
"infoSetCount": 3217,
"deltaNorm": 0.0142,
"warm": true, // true once iterations >= WARM_THRESHOLD (5000)
"scenarioName": "OPERATION TRIPWIRE",
"ts": 1779320000000,
"beliefV": 1,
"payoffV": 2,
"infosetV": 1,
"solverSchemaVersion": 1 // matches SCHEMA_VERSION from @uci-demo/solver
}
iterations: monotonically increasing.infoSetCount:regretTableBlue.size()— visible memory proxy. Reaches a stable plateau as the policy covers the reachable info-set space.deltaNorm: L2 norm of(σ̄_now - σ̄_1000_ago)averaged across info-sets. Decreasing curve = convergence.warm: gate flag forSolverAgent— when false, the agent rationale should tag traces as "cold-start; using fallback."- Version tags: every payload embeds the four version
constants from
@uci-demo/game+@uci-demo/solver. Stale UI or SolverAgent can refuse to trust mismatched payloads.
uci-demo/solver/blueprint (retained)¶
Published every BLUEPRINT_INTERVAL_ITER = 1000 iterations.
Payload is the JSON produced by serializeBlueprint() from
@uci-demo/solver — already version-tagged, base64-encoded
Float32Array buffers, refuses to load on version mismatch.
Size: ~5 MB for tactical-scale tables (10⁴ info-sets × 64 actions
× 4 bytes × 2 tables, base64-inflated). Mosquitto handles this
fine; SolverAgent will need to queueMicrotask the deserialize
to avoid blocking the event loop (PR #35 concern).
What's NOT published yet¶
- Per-side blueprints. The current
serializeBlueprintenvelope carries Blue's regret/avg-strategy only (red's is held inredRegretbut not serialized). When PR #36 ships the red-agent, we'll either ship two retained envelopes (one per player) OR extend the schema (and bumpSCHEMA_VERSIONto 2).
Strand 3 — Synthetic scenario adapter¶
The daemon doesn't drive an MQTT world. It builds an in-memory
GameState from the Tripwire YAML once at startup, plus a
ScenarioTruth (the hidden identity/threat-type/fuel map), and
calls createTacticalDynamics() to get a callable simulator.
Where the truth comes from¶
The Tripwire YAML (scenarios/counter-uas-tripwire.yaml) defines
spawn events with identity and threatType per track. For the
daemon's purposes, that IS the truth — we mirror those fields
into ScenarioTruth.trueIdentity and ScenarioTruth.trueThreatType.
Effector fuel truth: each EFFECTOR_* asset starts at 1.0
(NORMAL), drains by 0.05 per engage. The daemon's trueFuel
map is updated by the dynamics' apply step (which already
tracks fuelBurnedTotal in counters — we just need to project
that back to per-effector fuel fractions).
For Phase 0 simplicity, hardcode the Tripwire truth in the adapter. A YAML schema extension lands as a follow-up when we generalize to multi-scenario daemons.
Adapter signature¶
export interface SolverScenario {
readonly name: string;
readonly initialState: GameState;
readonly truth: ScenarioTruth;
readonly dynamics: GameDynamics;
}
export function loadTripwireScenario(): SolverScenario;
The adapter:
1. Builds the WorldSnapshot from the Tripwire spawn events
2. Builds the ScenarioTruth from the YAML identity/threatType
3. Calls buildGameState({world, truth, ...}) from @uci-demo/game
4. Calls createTacticalDynamics() for the dynamics
5. Returns the bundle
Reset / episode boundaries¶
The daemon iterates against this single static initial state.
dynamics.isTerminal(state) triggers episode reset; the inner
loop in @uci-demo/solver's iterate() already handles this via
rootStateFactory(rng). The daemon's rootStateFactory returns
the adapter's initialState (immutable) and lets apply() walk
the tree from there.
Strand 4 — Daemon iteration loop¶
Cadence¶
- Inner ES-MCCFR loop:
BATCH_ITERATIONS = 50iterations per loop tick (a small batch keeps the event loop responsive) - Outer batch loop: schedules the next batch via
setImmediate()so we don't starve the event loop - Status publish: every
STATUS_INTERVAL_ITER = 100iterations (so every 2 batches) - Blueprint publish: every
BLUEPRINT_INTERVAL_ITER = 1000iterations - Delta-norm computation: every 100 iterations, comparing current avg-strategy to a snapshot taken 1000 iterations ago. Stored in a small ring buffer
Pseudocode¶
let totalIterations = 0;
const blueRegret = createRegretTable(64);
const redRegret = createRegretTable(64);
let lastSnapshot: Map<bigint, Float32Array> = new Map();
function tick() {
if (shuttingDown) return;
const result = iterate({
iterations: BATCH_ITERATIONS,
dynamics: scenario.dynamics,
rootStateFactory: () => scenario.initialState,
rng: mulberry32(SEED + totalIterations),
regretBlue: blueRegret,
regretRed: redRegret,
});
totalIterations += BATCH_ITERATIONS;
if (totalIterations % STATUS_INTERVAL_ITER === 0) {
const dnorm = computeDeltaNorm(blueRegret, lastSnapshot);
publishStatus({
iterations: totalIterations,
infoSetCount: blueRegret.size(),
deltaNorm: dnorm,
warm: totalIterations >= WARM_THRESHOLD,
// ... version tags ...
});
}
if (totalIterations % BLUEPRINT_INTERVAL_ITER === 0) {
publishBlueprint(blueRegret);
lastSnapshot = snapshotAvgStrategy(blueRegret); // for the next delta-norm window
}
setImmediate(tick); // yield to other work; resume immediately
}
Convergence proxy: delta-norm¶
function computeDeltaNorm(table: RegretTable, prev: Map<bigint, Float32Array>): number {
let sumSq = 0;
let count = 0;
for (const { info, avgStrategy } of table.entries()) {
const prevRow = prev.get(info);
if (!prevRow) continue;
for (let i = 0; i < avgStrategy.length; i++) {
const d = avgStrategy[i] - prevRow[i];
sumSq += d * d;
count++;
}
}
return count > 0 ? Math.sqrt(sumSq / count) : 1.0; // 1.0 = "no signal yet"
}
As the policy approaches Nash, consecutive averages converge and the L2 norm shrinks. Reaches ~0.001 in well-converged tables.
Why setImmediate, not setInterval¶
setImmediate() yields control to I/O (the MQTT publish + any
incoming messages) between batches without artificially throttling
the iteration rate. Performance-bound by the kernel's per-iteration
cost, not by a wall-clock interval.
Throughput target¶
Phase 0 acceptance: 1k iterations / second of wall-clock on commodity dev hardware (M1/M2 MacBook). At that rate, the daemon reaches "warm" (5000 iterations) in 5 seconds and ships its first blueprint within ~1 minute.
Strand 5 — SolverPill UI component¶
Slot¶
Top-right of TopStrip, next to the existing ROE indicator. The
TopStrip currently shows scenario name + ROE + comms; the pill
goes between ROE and comms.
Props¶
interface SolverPillProps {
readonly status: SolverStatus | null;
}
The pill is a pure render — it subscribes to
state.solverStatus from the zustand store.
Visual spec¶
When status === null (daemon offline):
[ SOLVER ▸ OFFLINE ]
In --color-fg-faint, no glow.
When status is set but !warm (cold-start):
[ SOLVER ▸ 2.1K iter · 412 IS · δ — ]
In --color-amber with phosphor glow. The δ shows — because no
prior snapshot exists yet.
When status.warm === true (running normally):
[ SOLVER ▸ 12.4K iter · 3,217 IS · δ 0.014 ]
In --color-cyan with phosphor glow. The δ value renders in:
- --color-grant when δ < 0.02 (converged-ish)
- --color-cyan when 0.02 ≤ δ < 0.1 (training)
- --color-amber when δ ≥ 0.1 (early-training)
Stale heartbeat (last ts > 30s ago):
[ SOLVER ▸ STALLED 12.4K iter ]
In --color-threat. Detection via ts field on the most recent
status payload.
Iteration formatter¶
function fmtIter(n: number): string {
if (n < 1000) return n.toString();
if (n < 1_000_000) return (n / 1000).toFixed(1) + "K";
return (n / 1_000_000).toFixed(2) + "M";
}
12_437 → 12.4K. 1_412_000 → 1.41M.
InfoSet count formatter¶
function fmtIS(n: number): string {
return n.toLocaleString(); // 3217 → "3,217"
}
Tooltip¶
title attribute with full breakdown:
"iterations=12437 IS=3217 δ=0.0142 warm=true scenario=OPERATION TRIPWIRE payoffV=2 beliefV=1 infosetV=1"
Strand 6 — USE_SOLVER=1 gating¶
The default pnpm run up should NOT spawn the daemon — existing
demo behavior is preserved. The daemon is opt-in via env flag:
USE_SOLVER=1 pnpm run up # spawns the solver-daemon alongside everything else
pnpm run up # existing behavior; daemon stays off
Implementation¶
Two approaches:
A. Conditional concurrently argv (clean, declarative).
Update root package.json's up script to spawn the daemon
conditionally via a small shell expansion:
"up": "docker compose up -d mosquitto && concurrently ... $([ \"$USE_SOLVER\" = \"1\" ] && echo '--names VAL,SIM,CP,UI,ADSB,SLV --prefix-colors cyan,yellow,magenta,green,blue,violet' || echo '--names VAL,SIM,CP,UI,ADSB --prefix-colors cyan,yellow,magenta,green,blue') ... \"pnpm --filter @uci-demo/validator start\" ... $([ \"$USE_SOLVER\" = \"1\" ] && echo '\"pnpm --filter @uci-demo/solver-daemon start\"' || echo '')"
That's ugly. Reject.
B. Two scripts (clearest).
"up": "...existing concurrently list, no daemon...",
"up:solver": "... existing list + '\"pnpm --filter @uci-demo/solver-daemon start\"' ..."
Operator runs pnpm run up:solver when they want the solver.
Two scripts, no shell magic, completely transparent.
Recommendation: B. Add a one-line note in README.md /
CLAUDE.md explaining when to use which.
Why not a runtime flag inside the daemon?¶
We could ship the daemon always-on and have it gate its own heartbeat on an env var. But that wastes CPU on a service that's mostly unused in dev. The two-script approach matches the "intentional gates" discipline from CLAUDE.md.
Strand 7 — Cop-UI integration¶
Store extensions (lib/store.ts)¶
interface SolverStatus {
readonly iterations: number;
readonly infoSetCount: number;
readonly deltaNorm: number;
readonly warm: boolean;
readonly scenarioName: string;
readonly ts: number;
readonly beliefV: number;
readonly payoffV: number;
readonly infosetV: number;
}
interface CopState {
// ...existing fields
solverStatus: SolverStatus | null;
/** Most recently received blueprint blob (raw JSON string) — for replay/debug. */
solverBlueprintJson: string | null;
setSolverStatus: (s: SolverStatus | null) => void;
setSolverBlueprintJson: (json: string) => void;
}
Bus subscriber additions (lib/busSubscriber.ts)¶
Two new dispatch branches:
if (topic === "uci-demo/solver/status") {
try {
const json = JSON.parse(payloadText) as SolverStatus & { schemaV?: number };
if (typeof json.iterations === "number" && typeof json.ts === "number") {
useCop.getState().setSolverStatus(json);
}
} catch (err) { console.error("[cop-ui] solver/status parse error:", err); }
return;
}
if (topic === "uci-demo/solver/blueprint") {
// Just store the JSON; SolverAgent (PR #35) deserializes when it ships.
if (payloadText.length > 0) {
useCop.getState().setSolverBlueprintJson(payloadText);
}
return;
}
Subscribe list extension (currently uci-demo/copilot/#,
uci-demo/world/#, uci-demo/scenario/#): add
uci-demo/solver/#.
Replay buffer¶
The existing message buffer captures all subscribed topics with no allowlist filtering. The solver topics are heavy (~5 MB blueprint, retained-and-resent-on-reconnect), so consider capping buffer entries for the blueprint topic OR excluding it from the buffer. Trade-off: blueprint excluded means replay doesn't reconstruct the solver pill's state precisely. Acceptable for Phase 0; revisit when replay UX matures.
Strand 8 — Tests¶
services/solver-daemon/test/daemon.test.ts¶
- Synthetic 1k-iter convergence smoke: build the Tripwire
adapter, run 1000 iterations against the dynamics, assert
regretBlue.size() > 0(info-sets were touched) anddeltaNorm < 1.0(not stuck at initial uniform). - Status payload shape: mock the bus client, run 200
iterations, capture published
uci-demo/solver/statusmessages. Assert exactly 2 status emissions (at iter 100 and 200). Assert each has the version tags. - Blueprint cadence: run 2000 iterations, assert exactly 2
blueprint emissions (at iter 1000 and 2000). Deserialize each
via
deserializeBlueprintfrom@uci-demo/solver— must not throw. - Delta-norm decreasing: after warming up (5k iterations), the delta-norm at iter 5000 should be smaller than at iter
- (Probabilistic — tolerate one failure with a seed sweep, similar to the Kuhn correctness gate.)
Cop-UI smoke¶
No formal test (cop-ui has no vitest harness). Manual via the acceptance gate below.
Acceptance criteria¶
pnpm -r typecheckclean across all 12 packages plus the new solver-daemon.pnpm -r testpasses all existing 380 tests plus the new daemon tests.USE_SOLVER=1 pnpm run up(orpnpm run up:solver— exact syntax per the memo's Strand 6) boots all services AND the solver-daemon. Defaultpnpm run upcontinues to NOT spawn the daemon.- Within 10 seconds of
pnpm run up:solver: mosquitto_sub -t uci-demo/solver/status -C 1shows a JSON payload withiterations > 0- The COP's
SolverPillreadsSOLVER ▸ <N> iter · <M> IS · δ <V>and the iteration count visibly increments - Within 1 minute:
uci-demo/solver/blueprinthas been published at least oncemosquitto_subon that topic shows a JSON blob ≥ 100 KB- Validator audit stays at 100% valid — none of the new
solver topics touch the schema-bound
uci/v2_5/#channels.
What this memo deliberately does not specify¶
- SolverAgent integration (PR #35) — the daemon publishes; an
Agent that consumes the blueprint to drive Blue decisions is a
separate workstream. PR #35 adds
BlueprintHolder+SolverAgent+ three-way agent selection inservices/copilot/. - Red-agent service (PR #36) — the bus-symmetric Red. Requires the daemon to be stable but doesn't block on it.
- Eval-harness (weeks 7-9) — consumes the blueprint offline, computes true exploitability via local best-response (Lisý et al.), generates the SBIR proposal's exploitability plot.
- YAML schema extension for
ScenarioTruth— the Tripwire truth is hardcoded inscenario.tsfor Phase 0. Generalization lands when the daemon supports multi-scenario training. - Multi-process scaling — single Node process for Phase 0. Worker thread or process-level parallelism (multiple daemons with periodic regret-table merge) is a Phase II problem.
- Local best-response exploitability — proxied via delta-norm. True LBR computation is a substantial standalone module.
Open questions to resolve before code lands¶
- Blueprint retention behavior on broker restart. Retained messages survive subscriber disconnects but are lost when Mosquitto restarts. Acceptable for Phase 0 (the daemon re-emits on the next blueprint cadence ~1 minute later). Reviewer can flag if a persistent blueprint store is needed.
- Delta-norm sensitivity to action-set size changes. When the regret table sees new info-sets during a window, the snapshot map doesn't have rows for them — they're skipped in the delta calc. Slight underestimate but consistent.
- Solver shutdown behavior. SIGINT mid-batch should publish
a final status with
warm=false-equivalent + ashuttingDownflag? Or just stop cleanly without notifying? Recommend the latter for simplicity; downstream consumers treat stale heartbeats as offline (already specified above). - TopStrip space. The current TopStrip has scenario name, ROE, comms, mute indicator, scenario menu. Adding a 5th indicator is fine on standard viewports but might wrap on narrow ones. The memo's recommended slot (between ROE and comms) should work; reviewer can verify visually.
None block writing the rest of the package.