Skip to content

Design — @uci-demo/solver (Phase 0 weeks 3-6)

Pre-code design memo for the solver workstream that follows the @uci-demo/game fill-out. This package contains the algorithmic heart of the SBIR proposal: an External-Sampling Monte Carlo Counterfactual Regret Minimization (ES-MCCFR) kernel, regret + average-strategy tables, eight doctrinal subroutines, a softmax strategy bank, and a versioned blueprint serializer.

The point of writing this memo first is the same as for the game memo: fix the algorithm, the table layout, and the correctness gate before parallel implementation lands. Reviewers should be able to point at a specific equation or byte layout and ask "why is this 0.4?" — not discover the choice during code review.

The correctness gate is non-negotiable: the kernel must converge to <0.01 exploitability in 10k iterations on Kuhn poker. Without that test passing no reviewer trusts anything else in this package.


ES-MCCFR algorithm

External-Sampling Monte Carlo CFR. Lanctot et al. 2009. The traverser explores every action at their decision nodes; opponent decisions and chance outcomes are sampled. This is the right variant for our setting: the tactical action space at T1 is small (5×11=55 for Blue), so exhaustive traversal of the traverser's actions is cheap, but the chance node space (sensor confidence, effect outcomes) is large enough that sampling beats enumeration.

Update equation

For traverser i walking history h:

v(h) =
  // Terminal
  u_i(h)                                            if h is terminal

  // Chance node
  v(h ⊕ sample_a~chance)                            if h is chance

  // Opponent decision (sampled)
  v(h ⊕ sample_a~σ_{-i}(I_{-i}(h)))                 if active(h) ≠ i

  // Traverser decision (enumerated)
  Σ_a σ_i(I, a) · v(h ⊕ a)                          if active(h) = i

At each traverser decision node h with information set I = infoSetKey(h, i):

v(h, a) = cfrTraverse(h ⊕ a, i)              # value of taking action a, recursing
v(h)    = Σ_a σ_i(I, a) · v(h, a)            # node value under current strategy
R(I, a) ← max(R(I, a) + (v(h, a) − v(h)), 0) # RM+ regret update
S(I)    ← S(I) + σ_i(I)                       # average-strategy accumulator

R is the cumulative-regret table; S is the average-strategy table. The max(·, 0) clip is RM+ (Tammelin et al. 2014) — the variant that converges faster than vanilla regret matching in practice and is the standard choice for tournament-grade CFR solvers.

Regret matching+ (RM+)

Given regret vector R(I, ·):

σ_i(I, a) = max(R(I, a), 0) / Σ_b max(R(I, b), 0)     if Σ_b max(R(I, b), 0) > 0
          = 1 / |A|                                     otherwise (uniform fallback)

The denominator can only be zero when every regret has been clipped to zero on this info-set. In that case the uniform fallback prevents divide-by-zero and represents "no evidence yet — explore uniformly."

Average strategy

For ES-MCCFR the standard recipe is linear average strategy with arithmetic accumulation:

S(I, a) ← S(I, a) + σ_i(I, a)               # accumulate every traverser visit
σ̄_i(I, a) = S(I, a) / Σ_b S(I, b)            # normalize at extract time

The average strategy converges to a Nash equilibrium under regret-matching; the current strategy oscillates. The solver-daemon exposes the average strategy as its blueprint.

Outer iteration loop

function iterate(N, dynamics, rootStateFactory, rng):
  for t in 1..N:
    traverser = (t mod 2 = 0) ? "blue" : "red"
    h0 = rootStateFactory(rng)               # seeded initial state
    cfrTraverse(h0, traverser, dynamics, rng)

Alternating-traverser is the textbook recipe; both players' regret tables get updated every two iterations.

Why ES-MCCFR (not outcome-sampling, not Deep CFR)

  • Outcome-sampling (OS-MCCFR) samples the traverser's actions too — faster per iteration but higher-variance regret estimates. For our small tactical action space, enumerating the traverser's actions is cheap and gets us better regret signal per iteration.
  • Deep CFR replaces tables with neural networks. We explicitly want to keep the kernel CPU-only and interpretable; tables on a Float32Array give us exactly the auditability the RFP attribute #2 demands.
  • External-sampling is the chosen variant per the SBIR plan (plan/sbir-osw26bz02-dv004-game-theoretic-coa.md line 28).

Regret + strategy table layout

Both tables share storage strategy: a Map<bigint, Float32Array> where the key is infoSetKey(state, viewer) and the value is a fixed-length vector of regrets (or accumulated strategy mass) indexed by action index.

Action canonicalization

The actionIndex is the position of an action in the array returned by dynamics.legalActions(state). The @uci-demo/game createTacticalDynamics returns this array canonically sorted by compareActions (kind, then trackId, then effect, then effectorId) — see packages/uci-game/src/dynamics.ts. Same state ⇒ same array ⇒ same indexing. Stable.

Storage

interface RegretTable {
  /** Number of action slots per info-set; matches dynamics.legalActions max. */
  readonly maxActions: number;

  /** Get the regret value for (info, a). 0 for unseen (I, a). */
  getRegret(info: bigint, a: number): number;

  /** Add `delta` to regret(info, a), then clip to ≥0 (RM+). */
  addRegret(info: bigint, a: number, delta: number): void;

  /** Compute the RM+ strategy distribution over `legalActionCount` actions for this info-set. */
  getStrategy(info: bigint, legalActionCount: number): Float32Array;

  /** Accumulate σ into the average-strategy table, weighted by 1.0 per traverser visit. */
  accumulateAvgStrategy(info: bigint, sigma: Float32Array, legalActionCount: number): void;

  /** Normalized average strategy; the blueprint output. */
  getAverageStrategy(info: bigint, legalActionCount: number): Float32Array;

  /** Number of unique info-sets touched. Anytime-status read. */
  size(): number;
}

Why fixed-length rows. The memo on @uci-demo/game set the T1 action upper bound at ~55 (5 tracks × (1 withhold + 5 effects × 2 effectors)). We allocate maxActions = 64 per row (next power of two) and use the per-call legalActionCount to ignore the tail. Float32Array gives us O(1) random access and tight memory: 256 bytes per info-set × 2 tables × 10⁴ info-sets ≈ 5 MB — fits comfortably in any deployment.

Why bigint keys, not strings. infoSetKey already returns bigint (packages/uci-game/src/hash.ts). BigInt keys in JS Maps are fast enough for tactical-scale tables; if benchmarks show keying overhead later, the same encoding maps trivially to a number via % 2**32 with a collision-acceptance argument analogous to the FNV-1a-64 collision analysis in the game memo.

Lazy allocation. Tables are sparse — most theoretical info-sets are never reached. We allocate a row on first access only. Map.get(info) returning undefined means "no regret yet, treat as zeros, σ is uniform."

Public API surface

export function createRegretTable(maxActions: number): RegretTable;

That's it. No globals. Solver instantiates one regret table for Blue and one for Red.


Kuhn poker correctness gate

The single most important test in this package. Kuhn poker is the classic CFR test case — small enough to compute the exact Nash equilibrium, well-studied enough that any deviation from known convergence behavior signals a kernel bug.

Kuhn poker rules (canonical)

  • 3-card deck: J, Q, K.
  • Each player gets one card; the third stays hidden.
  • Round 1: P1 may check or bet 1 chip.
  • Round 2: P2 responds:
  • If P1 checked: P2 may check (showdown) or bet 1 chip (P1 then folds or calls).
  • If P1 bet: P2 may fold (P1 wins ante) or call (showdown).
  • Showdown: higher card wins the pot; ante = 1 chip each.

Game tree size

  • 6 card-deal permutations × decision tree.
  • 12 information sets total (6 per player; each player sees their card and the action history).
  • Action space ≤ 2 at every decision node.

Test methodology

1. Define KuhnDynamics implementing GameDynamics from @uci-demo/game.
   - GameState carries {dealtCards, history, terminal?, winner?}.
   - legalActions returns ["check", "bet"] or ["fold", "call"].
   - apply transitions deterministically.
   - resolveChance deals cards via rand().
   - infoSetKey hashes (myCard, history).

2. Run escfr.iterate(N=10_000, KuhnDynamics, mulberry32(seed=42)).

3. Compute exploitability via best-response search over the strategy:
   - For each player i, find max_{π_i} E[payoff | π_i, σ̄_{-i}]
   - exploitability = (BR_blue − v̄_blue) + (BR_red − v̄_red)
   - Best-response is exact tree-search (Kuhn is small).

4. assert exploitability < 0.01

Expected convergence

Iterations Expected exploitability (RM+, alternating traversers)
100 ~0.5
1,000 ~0.05
10,000 ~0.005
100,000 ~0.0005

The 10k threshold of 0.01 is loose by a factor of 2 against published benchmarks — chosen so the test runs in under a second of vitest wall time and tolerates RNG-seed variation across machines.

If 10k iterations does not converge to <0.01, the kernel is broken. There is no second test that compensates.

Reference Kuhn Nash strategy (sanity check)

The Nash equilibrium for player 1 (action probability table, for reference — useful when debugging):

Card P(check) P(bet)
J 1 - α α
Q 1 0
K 1 - 3α

where α ∈ [0, 1/3] parameterizes the family of equilibria; the average-strategy convergence picks one out of this family.

A separate unit test in kuhn.test.ts extracts σ̄(K_p1) and asserts σ̄(bet) ∈ [0, 1] (i.e. ≤ 1 and ≥ 0 — basic sanity, not convergence to a specific α).


Subroutines

The regret table is keyed over action indices (per the game's legalActions), but the strategy bank weighting works over subroutines. This is the architectural answer to RFP attribute #2 (interpretability): every output trace decomposes into a per-subroutine contribution.

Subroutine interface

export interface SubroutineContext {
  readonly info: GameState;
  readonly viewer: Player;
  readonly legalActions: readonly Action[];
}

export interface SubroutineTrace {
  readonly id: string;
  readonly weight: number;          // softmax mass assigned to this subroutine
  readonly topAction: Action | null;
  readonly rationale: string;        // ≤120 char operator-facing string
}

export interface Subroutine {
  readonly id: string;

  /** Weight signal — how strongly this subroutine wants to be heard. */
  weight(ctx: SubroutineContext): number;

  /** Probability vector over `ctx.legalActions`. Must sum to 1 (or be zero-mass when no opinion). */
  distribution(ctx: SubroutineContext): Float32Array;

  /** Human-readable trace explaining this subroutine's contribution. */
  explain(ctx: SubroutineContext): SubroutineTrace;
}

weight is a non-negative real number. The bank softmaxes the per-subroutine weights into mixing coefficients, then mixes the per-subroutine distributions.

The eight subroutines

Seeded from the existing scripted agent's factoring (services/copilot/src/scriptedAgent.ts:24-66):

Subroutine Trigger Output
IdentityGate P(FRIEND \| belief) > 0.4 Heavy mass on withhold
RoeEscalation roe === "RED" Mass shifts to kinetic effects (DEFEAT_DESTROY, ATTRIT)
SoftKillFirst roe === "AMBER"threatConfObs < 70 Mass shifts to soft-kill (DISRUPT, DEGRADE, DENY)
ReplanEscalation Last engage on this track was soft-kill ∧ track still in play Kinetic-first
JammerCounter threatTypeObs === "JAMMER" Skip EW effects (DISRUPT/DEGRADE) — mirrors scriptedAgent.ts:82-85
FratricideAvoidance P(FRIEND \| belief) > 0.25 Mass on withhold (softer than IdentityGate)
FuelConservation Any effector at fuel band CRITICAL or LOW Shift mass away from that effector
CommsDegradeHedge commsDegrade.dropPercent > 20 Prefer autonomous-capable effector (kinetic in this codebase)

Each subroutine is a ≤80-LOC TypeScript file under packages/uci-solver/src/subroutines/<id>.ts with a vitest fixture in test/subroutines/<id>.test.ts. All pure — read GameState, output a probability vector. No solver-table reads, no async, no I/O.

Why subroutines, not raw actions

A regret table keyed on actions tells you which action the solver prefers; a subroutine-decorated trace tells you why. The plan (plan/sbir-osw26bz02-dv004-game-theoretic-coa.md line 72-90) treats this as the architectural differentiator from neural CFR. We honor it: the strategy-bank explain method always emits one trace per active subroutine, surfaced to the COP via the existing uci-demo/copilot/reason/<planId> channel.


Strategy bank

The bank consumes the regret table + the subroutines and emits a single mixed policy per (infoSetKey, legalActions) query.

Composition rule

function composedPolicy(ctx: SubroutineContext, regretTable: RegretTable): Float32Array {
  const weights = subroutines.map((s) => s.weight(ctx));      // raw weights
  const mixing = softmax(weights, temperature = 1.0);          // → Σ = 1
  const dists = subroutines.map((s) => s.distribution(ctx));   // per-subroutine distributions
  // Linear mixture
  const N = ctx.legalActions.length;
  const out = new Float32Array(N);
  for (let s = 0; s < subroutines.length; s++) {
    for (let a = 0; a < N; a++) {
      out[a] += mixing[s] * dists[s][a];
    }
  }
  // Mix in the regret-table strategy (the CFR signal) at fixed weight λ = 0.6.
  const cfrStrategy = regretTable.getAverageStrategy(infoSetKey(ctx.info, ctx.viewer), N);
  for (let a = 0; a < N; a++) {
    out[a] = 0.6 * cfrStrategy[a] + 0.4 * out[a];
  }
  return out;
}

The CFR signal dominates (λ = 0.6) — the subroutines are the prior that biases behavior toward operator-doctrine when the CFR table has little or no signal yet (cold-start case). As iterations accumulate and the regret table sharpens, the subroutine contribution becomes a soft regularization rather than the primary action selector.

bank.explain(ctx, regrets)

Returns a readonly SubroutineTrace[] enumerating only the subroutines whose weight > 0.05 after softmaxing. The copilot publishes one MQTT reasoning-line per trace via the existing publishReasoningLine at services/copilot/src/main.ts:120-133.


Blueprint schema + serialization

The blueprint is what the solver-daemon ships to the SolverAgent for fast online lookup. It is a single JSON envelope containing the regret table, the average-strategy table, and version metadata.

export interface Blueprint {
  readonly schemaVersion: 1;
  readonly beliefV: number;      // BELIEF_V at training time
  readonly payoffV: number;      // PAYOFF_V at training time
  readonly infosetV: number;     // INFOSET_V at training time
  readonly iterations: number;
  readonly exploitabilityEstimate: number | null;  // null if not measured
  readonly infoSetCount: number;
  readonly maxActions: number;
  /** base64-encoded Float32Array buffers, keyed by infoSet bigint serialized as decimal string. */
  readonly regret: Record<string, string>;
  readonly avgStrategy: Record<string, string>;
}

Why JSON over Protobuf

  • The blueprint is shipped via MQTT in the live demo (uci-demo/solver/blueprint); MQTT payloads are bytes-with-text content negotiation, and base64-in-JSON is the cheapest interop.
  • Tactical-scale blueprints (10³ info-sets × 64 actions × 4 bytes × 2 tables) are ~500 KB. Compresses well; even uncompressed it fits in one MQTT message.
  • We do not need cross-language compatibility — the only consumer is another TypeScript service.

Version-tag invariant

A blueprint loaded with beliefV ≠ BELIEF_V or infosetV ≠ INFOSET_V must be refused with a clear error. This is the silently-invalidate-cache hazard the game memo also called out. The deserializer enforces the check; the solver-daemon refuses to apply a stale blueprint.

Serialize/deserialize API

export function serializeBlueprint(b: Blueprint): string;        // JSON.stringify with binary buffers base64-encoded
export function deserializeBlueprint(json: string): Blueprint;   // throws on version mismatch

Action-space pruning

Per the game memo: legalActions returns all legal actions; the solver may prune. For Phase 0 we do no pruning at the solver level — the action set is small enough at T1 (max 55 per Blue node) that enumeration is cheaper than the bookkeeping a pruning heuristic would add.

If T2 scaling (10 effectors × 30 tracks) blows the action set past ~500, the first pruning lever is dominated-action pruning: any action a whose regret has been clipped to 0 for ≥1000 consecutive visits is excluded from legalActions enumeration. That's a post-Phase-0 problem; the interface is stable enough to add later without breaking the kernel.


TypeScript module layout

packages/uci-solver/src/
├── index.ts              # re-exports
├── regret.ts             # RegretTable + RM+ + Float32Array storage
├── escfr.ts              # cfrTraverse + iterate
├── bank.ts               # StrategyBank.composedPolicy + explain
├── blueprint.ts          # trainBlueprint, Blueprint shape
├── serialize.ts          # serializeBlueprint / deserializeBlueprint
└── subroutines/
    ├── index.ts          # Subroutine interface + barrel re-exports
    ├── identityGate.ts
    ├── roeEscalation.ts
    ├── softKillFirst.ts
    ├── replanEscalation.ts
    ├── jammerCounter.ts
    ├── fratricideAvoidance.ts
    ├── fuelConservation.ts
    └── commsDegradeHedge.ts

packages/uci-solver/test/
├── regret.test.ts        # RM+ unit tests, accumulation
├── escfr.test.ts         # smoke test on a 2-node toy game
├── kuhn.test.ts          # CORRECTNESS GATE: <0.01 exploitability @ 10k
├── bank.test.ts          # composedPolicy sums to 1, explain shape
├── blueprint.test.ts     # round-trip, version-mismatch refusal
└── subroutines/
    ├── identityGate.test.ts
    ├── roeEscalation.test.ts
    ├── softKillFirst.test.ts
    ├── replanEscalation.test.ts
    ├── jammerCounter.test.ts
    ├── fratricideAvoidance.test.ts
    ├── fuelConservation.test.ts
    └── commsDegradeHedge.test.ts

Tests use only synthetic fixtures — no MQTT, no codec, no real scenarios. The package remains a pure-domain workspace member alongside @uci-demo/game.


Acceptance criteria

  1. All new modules typecheck under the workspace's strict settings.
  2. pnpm --filter @uci-demo/solver test passes including kuhn.test.ts asserting exploitability < 0.01 after 10k iterations.
  3. SCHEMA_VERSION = 1 constant is exported from serialize.ts.
  4. Blueprint version tags (beliefV, payoffV, infosetV) match the exported constants from @uci-demo/game at training time, and the deserializer refuses to load a blueprint whose tags disagree with the runtime values.
  5. pnpm up smoke unaffected. The copilot still consumes only createWorldMirror from @uci-demo/game; nothing in @uci-demo/solver is wired into the running demo in this PR. The solver-daemon + SolverAgent integration is a separate workstream (Phase 0 weeks 5-7 per the SBIR sprint table).

What this memo deliberately does not specify

  • Solver-daemon service (services/solver-daemon/). Weeks 5-7.
  • Red-agent service (services/red-agent/). Weeks 5-7.
  • SolverAgent integration in the copilot. Weeks 6-8.
  • Eval-harness exploitability plotting. Weeks 7-9.
  • PBS subgame-resolving for T3 scale. Phase II Year 1.
  • Neural-CFR ablation. Not in scope — explicitly out per the SBIR plan ("no GPU dependency, no neural-network black boxes").

Open questions to resolve before code lands

  1. Average-strategy weighting variant. The memo specifies linear averaging (each iteration's σ adds with weight 1). Some published variants use iteration-weighted averaging (weight = iteration index) for faster late-stage convergence. Reviewer can flag if the kuhn convergence at 10k looks slow — easy swap with one constant.
  2. CFR-vs-subroutines mixing weight (λ = 0.6). Chosen by analogy to AlphaGo's policy-network/value-function mix. Tunable per scenario family; the bank exports it as a constructor option so the eval-harness can sweep.
  3. Mulberry32 vs xoshiro256. Mulberry32 is fine for Kuhn; for T2/T3 scaling, xoshiro256** has lower correlation between successive samples. Phase 0 ships mulberry32; bench upgrade can land later.
  4. Pluggable rollout policy at chance nodes. The current spec samples chance from the game's resolveChance(state, rand). A more sophisticated solver might bias chance sampling toward high-information outcomes. Out of scope for Phase 0.

These are knowable knobs, not algorithmic risks. None block writing the rest of the package.


Why this memo is shorter than the game memo

The game memo had to fix five separate artifacts (state factoring, belief math, payoff weights, infoset encoding, dynamics) before implementation could safely fan out. The solver memo has one algorithm (ES-MCCFR) and one correctness gate (Kuhn). Once those are locked, the eight subroutines + bank + blueprint are mechanical from the game's GameDynamics and infoSetKey surface.

The lesson from the game workstream: the prose pages of the memo were worth their weight in reviewer time. This memo follows the same shape but commits less ink to material the game memo already covered.