Skip to content

Design — services/solver-daemon/ + SolverPill (Phase 0 weeks 5-7)

Pre-code design memo for the workstream that makes the @uci-demo/solver kernel visibly alive in the demo. The kernel ships (PR #32); doctrinal subroutines + bank ship (PR #33); now we run ES-MCCFR continuously and publish the blueprint so the SolverAgent (next PR) and eval-harness (week 7-9) have a real trained policy to consume.

This PR is deliberately scoped to the daemon plus the visible proof-of-life pill. SolverAgent integration and red-agent are follow-ups in PR #35 and #36.

The headline acceptance gate: a reviewer running USE_SOLVER=1 pnpm run up sees a SOLVER ▸ 12.4K iter · 3,217 IS · δ 0.014 pill in the top-right of the COP, with the iteration counter incrementing. Without that visible proof, the algorithmic story remains invisible to anyone who doesn't read the code.


Strand 1 — services/solver-daemon/

What it does (one paragraph)

A standalone Node service that loads a synthetic version of the Tripwire scenario, builds a GameDynamics over it via @uci-demo/game, and runs iterate() from @uci-demo/solver continuously. Every ~100 iterations it publishes a status heartbeat to uci-demo/solver/status (retained). Every ~1000 iterations it publishes the current trained blueprint to uci-demo/solver/blueprint (retained). The daemon never reads from uci/v2_5/* — its world is the synthetic dynamics, not the live bus. The blueprint it publishes is the contract; downstream consumers (SolverAgent in PR #35, eval-harness in week 7-9) subscribe and consume in-memory without RPC.

What it does NOT do

  • Drive Blue decisions. PR #35 ships SolverAgent consuming the published blueprint via a BlueprintHolder module. The daemon's only output is the blueprint itself.
  • Read live bus traffic. Self-play happens against synthetic dynamics. Live integration is the SolverAgent's job.
  • Run Red. The red-agent service is PR #36.
  • Compute true exploitability. Tactical-scale exact best-response is intractable. We publish a delta-norm proxy (L2 norm of strategy change over the last 1k iterations) as the visible convergence signal. Real exploitability via local best-response (Lisý et al.) lands with the eval-harness.

Service shape

services/solver-daemon/
├── package.json          # workspace deps on @uci-demo/{bus,game,solver}
├── tsconfig.json
├── src/
│   ├── main.ts           # entry point, env parsing, bus connect, kick off daemon
│   ├── daemon.ts         # iteration loop, heartbeat + blueprint publish
│   ├── scenario.ts       # Tripwire YAML → GameState + GameDynamics adapter
│   └── deltaNorm.ts      # convergence-proxy helper
└── test/
    └── daemon.test.ts    # 1k-iter convergence smoke + publish-cadence check

Strand 2 — MQTT side-channels

Both topics are under uci-demo/solver/... (free-form JSON, the same convention as the existing uci-demo/world/... and uci-demo/copilot/... channels).

uci-demo/solver/status (retained)

Published every STATUS_INTERVAL_ITER = 100 iterations.

{
  "schemaV": 1,
  "iterations": 12400,
  "infoSetCount": 3217,
  "deltaNorm": 0.0142,
  "warm": true,           // true once iterations >= WARM_THRESHOLD (5000)
  "scenarioName": "OPERATION TRIPWIRE",
  "ts": 1779320000000,
  "beliefV": 1,
  "payoffV": 2,
  "infosetV": 1,
  "solverSchemaVersion": 1   // matches SCHEMA_VERSION from @uci-demo/solver
}
  • iterations: monotonically increasing.
  • infoSetCount: regretTableBlue.size() — visible memory proxy. Reaches a stable plateau as the policy covers the reachable info-set space.
  • deltaNorm: L2 norm of (σ̄_now - σ̄_1000_ago) averaged across info-sets. Decreasing curve = convergence.
  • warm: gate flag for SolverAgent — when false, the agent rationale should tag traces as "cold-start; using fallback."
  • Version tags: every payload embeds the four version constants from @uci-demo/game + @uci-demo/solver. Stale UI or SolverAgent can refuse to trust mismatched payloads.

uci-demo/solver/blueprint (retained)

Published every BLUEPRINT_INTERVAL_ITER = 1000 iterations. Payload is the JSON produced by serializeBlueprint() from @uci-demo/solver — already version-tagged, base64-encoded Float32Array buffers, refuses to load on version mismatch.

Size: ~5 MB for tactical-scale tables (10⁴ info-sets × 64 actions × 4 bytes × 2 tables, base64-inflated). Mosquitto handles this fine; SolverAgent will need to queueMicrotask the deserialize to avoid blocking the event loop (PR #35 concern).

What's NOT published yet

  • Per-side blueprints. The current serializeBlueprint envelope carries Blue's regret/avg-strategy only (red's is held in redRegret but not serialized). When PR #36 ships the red-agent, we'll either ship two retained envelopes (one per player) OR extend the schema (and bump SCHEMA_VERSION to 2).

Strand 3 — Synthetic scenario adapter

The daemon doesn't drive an MQTT world. It builds an in-memory GameState from the Tripwire YAML once at startup, plus a ScenarioTruth (the hidden identity/threat-type/fuel map), and calls createTacticalDynamics() to get a callable simulator.

Where the truth comes from

The Tripwire YAML (scenarios/counter-uas-tripwire.yaml) defines spawn events with identity and threatType per track. For the daemon's purposes, that IS the truth — we mirror those fields into ScenarioTruth.trueIdentity and ScenarioTruth.trueThreatType.

Effector fuel truth: each EFFECTOR_* asset starts at 1.0 (NORMAL), drains by 0.05 per engage. The daemon's trueFuel map is updated by the dynamics' apply step (which already tracks fuelBurnedTotal in counters — we just need to project that back to per-effector fuel fractions).

For Phase 0 simplicity, hardcode the Tripwire truth in the adapter. A YAML schema extension lands as a follow-up when we generalize to multi-scenario daemons.

Adapter signature

export interface SolverScenario {
  readonly name: string;
  readonly initialState: GameState;
  readonly truth: ScenarioTruth;
  readonly dynamics: GameDynamics;
}

export function loadTripwireScenario(): SolverScenario;

The adapter: 1. Builds the WorldSnapshot from the Tripwire spawn events 2. Builds the ScenarioTruth from the YAML identity/threatType 3. Calls buildGameState({world, truth, ...}) from @uci-demo/game 4. Calls createTacticalDynamics() for the dynamics 5. Returns the bundle

Reset / episode boundaries

The daemon iterates against this single static initial state. dynamics.isTerminal(state) triggers episode reset; the inner loop in @uci-demo/solver's iterate() already handles this via rootStateFactory(rng). The daemon's rootStateFactory returns the adapter's initialState (immutable) and lets apply() walk the tree from there.


Strand 4 — Daemon iteration loop

Cadence

  • Inner ES-MCCFR loop: BATCH_ITERATIONS = 50 iterations per loop tick (a small batch keeps the event loop responsive)
  • Outer batch loop: schedules the next batch via setImmediate() so we don't starve the event loop
  • Status publish: every STATUS_INTERVAL_ITER = 100 iterations (so every 2 batches)
  • Blueprint publish: every BLUEPRINT_INTERVAL_ITER = 1000 iterations
  • Delta-norm computation: every 100 iterations, comparing current avg-strategy to a snapshot taken 1000 iterations ago. Stored in a small ring buffer

Pseudocode

let totalIterations = 0;
const blueRegret = createRegretTable(64);
const redRegret = createRegretTable(64);
let lastSnapshot: Map<bigint, Float32Array> = new Map();

function tick() {
  if (shuttingDown) return;
  const result = iterate({
    iterations: BATCH_ITERATIONS,
    dynamics: scenario.dynamics,
    rootStateFactory: () => scenario.initialState,
    rng: mulberry32(SEED + totalIterations),
    regretBlue: blueRegret,
    regretRed: redRegret,
  });
  totalIterations += BATCH_ITERATIONS;

  if (totalIterations % STATUS_INTERVAL_ITER === 0) {
    const dnorm = computeDeltaNorm(blueRegret, lastSnapshot);
    publishStatus({
      iterations: totalIterations,
      infoSetCount: blueRegret.size(),
      deltaNorm: dnorm,
      warm: totalIterations >= WARM_THRESHOLD,
      // ... version tags ...
    });
  }

  if (totalIterations % BLUEPRINT_INTERVAL_ITER === 0) {
    publishBlueprint(blueRegret);
    lastSnapshot = snapshotAvgStrategy(blueRegret);  // for the next delta-norm window
  }

  setImmediate(tick);  // yield to other work; resume immediately
}

Convergence proxy: delta-norm

function computeDeltaNorm(table: RegretTable, prev: Map<bigint, Float32Array>): number {
  let sumSq = 0;
  let count = 0;
  for (const { info, avgStrategy } of table.entries()) {
    const prevRow = prev.get(info);
    if (!prevRow) continue;
    for (let i = 0; i < avgStrategy.length; i++) {
      const d = avgStrategy[i] - prevRow[i];
      sumSq += d * d;
      count++;
    }
  }
  return count > 0 ? Math.sqrt(sumSq / count) : 1.0;  // 1.0 = "no signal yet"
}

As the policy approaches Nash, consecutive averages converge and the L2 norm shrinks. Reaches ~0.001 in well-converged tables.

Why setImmediate, not setInterval

setImmediate() yields control to I/O (the MQTT publish + any incoming messages) between batches without artificially throttling the iteration rate. Performance-bound by the kernel's per-iteration cost, not by a wall-clock interval.

Throughput target

Phase 0 acceptance: 1k iterations / second of wall-clock on commodity dev hardware (M1/M2 MacBook). At that rate, the daemon reaches "warm" (5000 iterations) in 5 seconds and ships its first blueprint within ~1 minute.


Strand 5 — SolverPill UI component

Slot

Top-right of TopStrip, next to the existing ROE indicator. The TopStrip currently shows scenario name + ROE + comms; the pill goes between ROE and comms.

Props

interface SolverPillProps {
  readonly status: SolverStatus | null;
}

The pill is a pure render — it subscribes to state.solverStatus from the zustand store.

Visual spec

When status === null (daemon offline):

[ SOLVER ▸ OFFLINE ]

In --color-fg-faint, no glow.

When status is set but !warm (cold-start):

[ SOLVER ▸ 2.1K iter · 412 IS · δ — ]

In --color-amber with phosphor glow. The δ shows because no prior snapshot exists yet.

When status.warm === true (running normally):

[ SOLVER ▸ 12.4K iter · 3,217 IS · δ 0.014 ]

In --color-cyan with phosphor glow. The δ value renders in: - --color-grant when δ < 0.02 (converged-ish) - --color-cyan when 0.02 ≤ δ < 0.1 (training) - --color-amber when δ ≥ 0.1 (early-training)

Stale heartbeat (last ts > 30s ago):

[ SOLVER ▸ STALLED 12.4K iter ]

In --color-threat. Detection via ts field on the most recent status payload.

Iteration formatter

function fmtIter(n: number): string {
  if (n < 1000) return n.toString();
  if (n < 1_000_000) return (n / 1000).toFixed(1) + "K";
  return (n / 1_000_000).toFixed(2) + "M";
}

12_43712.4K. 1_412_0001.41M.

InfoSet count formatter

function fmtIS(n: number): string {
  return n.toLocaleString();  // 3217 → "3,217"
}

Tooltip

title attribute with full breakdown: "iterations=12437 IS=3217 δ=0.0142 warm=true scenario=OPERATION TRIPWIRE payoffV=2 beliefV=1 infosetV=1"


Strand 6 — USE_SOLVER=1 gating

The default pnpm run up should NOT spawn the daemon — existing demo behavior is preserved. The daemon is opt-in via env flag:

USE_SOLVER=1 pnpm run up   # spawns the solver-daemon alongside everything else
pnpm run up                 # existing behavior; daemon stays off

Implementation

Two approaches:

A. Conditional concurrently argv (clean, declarative).

Update root package.json's up script to spawn the daemon conditionally via a small shell expansion:

"up": "docker compose up -d mosquitto && concurrently ... $([ \"$USE_SOLVER\" = \"1\" ] && echo '--names VAL,SIM,CP,UI,ADSB,SLV --prefix-colors cyan,yellow,magenta,green,blue,violet' || echo '--names VAL,SIM,CP,UI,ADSB --prefix-colors cyan,yellow,magenta,green,blue') ... \"pnpm --filter @uci-demo/validator start\" ... $([ \"$USE_SOLVER\" = \"1\" ] && echo '\"pnpm --filter @uci-demo/solver-daemon start\"' || echo '')"

That's ugly. Reject.

B. Two scripts (clearest).

"up": "...existing concurrently list, no daemon...",
"up:solver": "... existing list + '\"pnpm --filter @uci-demo/solver-daemon start\"' ..."

Operator runs pnpm run up:solver when they want the solver. Two scripts, no shell magic, completely transparent.

Recommendation: B. Add a one-line note in README.md / CLAUDE.md explaining when to use which.

Why not a runtime flag inside the daemon?

We could ship the daemon always-on and have it gate its own heartbeat on an env var. But that wastes CPU on a service that's mostly unused in dev. The two-script approach matches the "intentional gates" discipline from CLAUDE.md.


Strand 7 — Cop-UI integration

Store extensions (lib/store.ts)

interface SolverStatus {
  readonly iterations: number;
  readonly infoSetCount: number;
  readonly deltaNorm: number;
  readonly warm: boolean;
  readonly scenarioName: string;
  readonly ts: number;
  readonly beliefV: number;
  readonly payoffV: number;
  readonly infosetV: number;
}

interface CopState {
  // ...existing fields
  solverStatus: SolverStatus | null;
  /** Most recently received blueprint blob (raw JSON string) — for replay/debug. */
  solverBlueprintJson: string | null;
  setSolverStatus: (s: SolverStatus | null) => void;
  setSolverBlueprintJson: (json: string) => void;
}

Bus subscriber additions (lib/busSubscriber.ts)

Two new dispatch branches:

if (topic === "uci-demo/solver/status") {
  try {
    const json = JSON.parse(payloadText) as SolverStatus & { schemaV?: number };
    if (typeof json.iterations === "number" && typeof json.ts === "number") {
      useCop.getState().setSolverStatus(json);
    }
  } catch (err) { console.error("[cop-ui] solver/status parse error:", err); }
  return;
}

if (topic === "uci-demo/solver/blueprint") {
  // Just store the JSON; SolverAgent (PR #35) deserializes when it ships.
  if (payloadText.length > 0) {
    useCop.getState().setSolverBlueprintJson(payloadText);
  }
  return;
}

Subscribe list extension (currently uci-demo/copilot/#, uci-demo/world/#, uci-demo/scenario/#): add uci-demo/solver/#.

Replay buffer

The existing message buffer captures all subscribed topics with no allowlist filtering. The solver topics are heavy (~5 MB blueprint, retained-and-resent-on-reconnect), so consider capping buffer entries for the blueprint topic OR excluding it from the buffer. Trade-off: blueprint excluded means replay doesn't reconstruct the solver pill's state precisely. Acceptable for Phase 0; revisit when replay UX matures.


Strand 8 — Tests

services/solver-daemon/test/daemon.test.ts

  • Synthetic 1k-iter convergence smoke: build the Tripwire adapter, run 1000 iterations against the dynamics, assert regretBlue.size() > 0 (info-sets were touched) and deltaNorm < 1.0 (not stuck at initial uniform).
  • Status payload shape: mock the bus client, run 200 iterations, capture published uci-demo/solver/status messages. Assert exactly 2 status emissions (at iter 100 and 200). Assert each has the version tags.
  • Blueprint cadence: run 2000 iterations, assert exactly 2 blueprint emissions (at iter 1000 and 2000). Deserialize each via deserializeBlueprint from @uci-demo/solver — must not throw.
  • Delta-norm decreasing: after warming up (5k iterations), the delta-norm at iter 5000 should be smaller than at iter
  • (Probabilistic — tolerate one failure with a seed sweep, similar to the Kuhn correctness gate.)

Cop-UI smoke

No formal test (cop-ui has no vitest harness). Manual via the acceptance gate below.


Acceptance criteria

  1. pnpm -r typecheck clean across all 12 packages plus the new solver-daemon.
  2. pnpm -r test passes all existing 380 tests plus the new daemon tests.
  3. USE_SOLVER=1 pnpm run up (or pnpm run up:solver — exact syntax per the memo's Strand 6) boots all services AND the solver-daemon. Default pnpm run up continues to NOT spawn the daemon.
  4. Within 10 seconds of pnpm run up:solver:
  5. mosquitto_sub -t uci-demo/solver/status -C 1 shows a JSON payload with iterations > 0
  6. The COP's SolverPill reads SOLVER ▸ <N> iter · <M> IS · δ <V> and the iteration count visibly increments
  7. Within 1 minute:
  8. uci-demo/solver/blueprint has been published at least once
  9. mosquitto_sub on that topic shows a JSON blob ≥ 100 KB
  10. Validator audit stays at 100% valid — none of the new solver topics touch the schema-bound uci/v2_5/# channels.

What this memo deliberately does not specify

  • SolverAgent integration (PR #35) — the daemon publishes; an Agent that consumes the blueprint to drive Blue decisions is a separate workstream. PR #35 adds BlueprintHolder + SolverAgent + three-way agent selection in services/copilot/.
  • Red-agent service (PR #36) — the bus-symmetric Red. Requires the daemon to be stable but doesn't block on it.
  • Eval-harness (weeks 7-9) — consumes the blueprint offline, computes true exploitability via local best-response (Lisý et al.), generates the SBIR proposal's exploitability plot.
  • YAML schema extension for ScenarioTruth — the Tripwire truth is hardcoded in scenario.ts for Phase 0. Generalization lands when the daemon supports multi-scenario training.
  • Multi-process scaling — single Node process for Phase 0. Worker thread or process-level parallelism (multiple daemons with periodic regret-table merge) is a Phase II problem.
  • Local best-response exploitability — proxied via delta-norm. True LBR computation is a substantial standalone module.

Open questions to resolve before code lands

  1. Blueprint retention behavior on broker restart. Retained messages survive subscriber disconnects but are lost when Mosquitto restarts. Acceptable for Phase 0 (the daemon re-emits on the next blueprint cadence ~1 minute later). Reviewer can flag if a persistent blueprint store is needed.
  2. Delta-norm sensitivity to action-set size changes. When the regret table sees new info-sets during a window, the snapshot map doesn't have rows for them — they're skipped in the delta calc. Slight underestimate but consistent.
  3. Solver shutdown behavior. SIGINT mid-batch should publish a final status with warm=false-equivalent + a shuttingDown flag? Or just stop cleanly without notifying? Recommend the latter for simplicity; downstream consumers treat stale heartbeats as offline (already specified above).
  4. TopStrip space. The current TopStrip has scenario name, ROE, comms, mute indicator, scenario menu. Adding a 5th indicator is fine on standard viewports but might wrap on narrow ones. The memo's recommended slot (between ROE and comms) should work; reviewer can verify visually.

None block writing the rest of the package.