Design — @uci-demo/solver (Phase 0 weeks 3-6)¶
Pre-code design memo for the solver workstream that follows the
@uci-demo/game fill-out. This package
contains the algorithmic heart of the SBIR proposal: an
External-Sampling Monte Carlo Counterfactual Regret Minimization
(ES-MCCFR) kernel, regret + average-strategy tables, eight doctrinal
subroutines, a softmax strategy bank, and a versioned blueprint
serializer.
The point of writing this memo first is the same as for the game memo: fix the algorithm, the table layout, and the correctness gate before parallel implementation lands. Reviewers should be able to point at a specific equation or byte layout and ask "why is this 0.4?" — not discover the choice during code review.
The correctness gate is non-negotiable: the kernel must converge to <0.01 exploitability in 10k iterations on Kuhn poker. Without that test passing no reviewer trusts anything else in this package.
ES-MCCFR algorithm¶
External-Sampling Monte Carlo CFR. Lanctot et al. 2009. The traverser explores every action at their decision nodes; opponent decisions and chance outcomes are sampled. This is the right variant for our setting: the tactical action space at T1 is small (5×11=55 for Blue), so exhaustive traversal of the traverser's actions is cheap, but the chance node space (sensor confidence, effect outcomes) is large enough that sampling beats enumeration.
Update equation¶
For traverser i walking history h:
v(h) =
// Terminal
u_i(h) if h is terminal
// Chance node
v(h ⊕ sample_a~chance) if h is chance
// Opponent decision (sampled)
v(h ⊕ sample_a~σ_{-i}(I_{-i}(h))) if active(h) ≠ i
// Traverser decision (enumerated)
Σ_a σ_i(I, a) · v(h ⊕ a) if active(h) = i
At each traverser decision node h with information set I = infoSetKey(h, i):
v(h, a) = cfrTraverse(h ⊕ a, i) # value of taking action a, recursing
v(h) = Σ_a σ_i(I, a) · v(h, a) # node value under current strategy
R(I, a) ← max(R(I, a) + (v(h, a) − v(h)), 0) # RM+ regret update
S(I) ← S(I) + σ_i(I) # average-strategy accumulator
R is the cumulative-regret table; S is the average-strategy table.
The max(·, 0) clip is RM+ (Tammelin et al. 2014) — the variant
that converges faster than vanilla regret matching in practice and is
the standard choice for tournament-grade CFR solvers.
Regret matching+ (RM+)¶
Given regret vector R(I, ·):
σ_i(I, a) = max(R(I, a), 0) / Σ_b max(R(I, b), 0) if Σ_b max(R(I, b), 0) > 0
= 1 / |A| otherwise (uniform fallback)
The denominator can only be zero when every regret has been clipped to zero on this info-set. In that case the uniform fallback prevents divide-by-zero and represents "no evidence yet — explore uniformly."
Average strategy¶
For ES-MCCFR the standard recipe is linear average strategy with arithmetic accumulation:
S(I, a) ← S(I, a) + σ_i(I, a) # accumulate every traverser visit
σ̄_i(I, a) = S(I, a) / Σ_b S(I, b) # normalize at extract time
The average strategy converges to a Nash equilibrium under regret-matching; the current strategy oscillates. The solver-daemon exposes the average strategy as its blueprint.
Outer iteration loop¶
function iterate(N, dynamics, rootStateFactory, rng):
for t in 1..N:
traverser = (t mod 2 = 0) ? "blue" : "red"
h0 = rootStateFactory(rng) # seeded initial state
cfrTraverse(h0, traverser, dynamics, rng)
Alternating-traverser is the textbook recipe; both players' regret tables get updated every two iterations.
Why ES-MCCFR (not outcome-sampling, not Deep CFR)¶
- Outcome-sampling (OS-MCCFR) samples the traverser's actions too — faster per iteration but higher-variance regret estimates. For our small tactical action space, enumerating the traverser's actions is cheap and gets us better regret signal per iteration.
- Deep CFR replaces tables with neural networks. We explicitly want to keep the kernel CPU-only and interpretable; tables on a Float32Array give us exactly the auditability the RFP attribute #2 demands.
- External-sampling is the chosen variant per the SBIR plan
(
plan/sbir-osw26bz02-dv004-game-theoretic-coa.mdline 28).
Regret + strategy table layout¶
Both tables share storage strategy: a Map<bigint, Float32Array> where
the key is infoSetKey(state, viewer) and the value is a fixed-length
vector of regrets (or accumulated strategy mass) indexed by action
index.
Action canonicalization¶
The actionIndex is the position of an action in the array returned by
dynamics.legalActions(state). The @uci-demo/game createTacticalDynamics
returns this array canonically sorted by compareActions (kind, then
trackId, then effect, then effectorId) — see packages/uci-game/src/dynamics.ts.
Same state ⇒ same array ⇒ same indexing. Stable.
Storage¶
interface RegretTable {
/** Number of action slots per info-set; matches dynamics.legalActions max. */
readonly maxActions: number;
/** Get the regret value for (info, a). 0 for unseen (I, a). */
getRegret(info: bigint, a: number): number;
/** Add `delta` to regret(info, a), then clip to ≥0 (RM+). */
addRegret(info: bigint, a: number, delta: number): void;
/** Compute the RM+ strategy distribution over `legalActionCount` actions for this info-set. */
getStrategy(info: bigint, legalActionCount: number): Float32Array;
/** Accumulate σ into the average-strategy table, weighted by 1.0 per traverser visit. */
accumulateAvgStrategy(info: bigint, sigma: Float32Array, legalActionCount: number): void;
/** Normalized average strategy; the blueprint output. */
getAverageStrategy(info: bigint, legalActionCount: number): Float32Array;
/** Number of unique info-sets touched. Anytime-status read. */
size(): number;
}
Why fixed-length rows. The memo on @uci-demo/game set the T1
action upper bound at ~55 (5 tracks × (1 withhold + 5 effects × 2
effectors)). We allocate maxActions = 64 per row (next power of two)
and use the per-call legalActionCount to ignore the tail. Float32Array
gives us O(1) random access and tight memory: 256 bytes per info-set
× 2 tables × 10⁴ info-sets ≈ 5 MB — fits comfortably in any
deployment.
Why bigint keys, not strings. infoSetKey already returns
bigint (packages/uci-game/src/hash.ts). BigInt keys in JS Maps are
fast enough for tactical-scale tables; if benchmarks show keying
overhead later, the same encoding maps trivially to a number via % 2**32
with a collision-acceptance argument analogous to the FNV-1a-64 collision
analysis in the game memo.
Lazy allocation. Tables are sparse — most theoretical info-sets are
never reached. We allocate a row on first access only. Map.get(info)
returning undefined means "no regret yet, treat as zeros, σ is uniform."
Public API surface¶
export function createRegretTable(maxActions: number): RegretTable;
That's it. No globals. Solver instantiates one regret table for Blue and one for Red.
Kuhn poker correctness gate¶
The single most important test in this package. Kuhn poker is the classic CFR test case — small enough to compute the exact Nash equilibrium, well-studied enough that any deviation from known convergence behavior signals a kernel bug.
Kuhn poker rules (canonical)¶
- 3-card deck: J, Q, K.
- Each player gets one card; the third stays hidden.
- Round 1: P1 may check or bet 1 chip.
- Round 2: P2 responds:
- If P1 checked: P2 may check (showdown) or bet 1 chip (P1 then folds or calls).
- If P1 bet: P2 may fold (P1 wins ante) or call (showdown).
- Showdown: higher card wins the pot; ante = 1 chip each.
Game tree size¶
- 6 card-deal permutations × decision tree.
- 12 information sets total (6 per player; each player sees their card and the action history).
- Action space ≤ 2 at every decision node.
Test methodology¶
1. Define KuhnDynamics implementing GameDynamics from @uci-demo/game.
- GameState carries {dealtCards, history, terminal?, winner?}.
- legalActions returns ["check", "bet"] or ["fold", "call"].
- apply transitions deterministically.
- resolveChance deals cards via rand().
- infoSetKey hashes (myCard, history).
2. Run escfr.iterate(N=10_000, KuhnDynamics, mulberry32(seed=42)).
3. Compute exploitability via best-response search over the strategy:
- For each player i, find max_{π_i} E[payoff | π_i, σ̄_{-i}]
- exploitability = (BR_blue − v̄_blue) + (BR_red − v̄_red)
- Best-response is exact tree-search (Kuhn is small).
4. assert exploitability < 0.01
Expected convergence¶
| Iterations | Expected exploitability (RM+, alternating traversers) |
|---|---|
| 100 | ~0.5 |
| 1,000 | ~0.05 |
| 10,000 | ~0.005 |
| 100,000 | ~0.0005 |
The 10k threshold of 0.01 is loose by a factor of 2 against published benchmarks — chosen so the test runs in under a second of vitest wall time and tolerates RNG-seed variation across machines.
If 10k iterations does not converge to <0.01, the kernel is broken. There is no second test that compensates.
Reference Kuhn Nash strategy (sanity check)¶
The Nash equilibrium for player 1 (action probability table, for reference — useful when debugging):
| Card | P(check) | P(bet) |
|---|---|---|
| J | 1 - α | α |
| Q | 1 | 0 |
| K | 1 - 3α | 3α |
where α ∈ [0, 1/3] parameterizes the family of equilibria; the
average-strategy convergence picks one out of this family.
A separate unit test in kuhn.test.ts extracts σ̄(K_p1) and asserts
σ̄(bet) ∈ [0, 1] (i.e. ≤ 1 and ≥ 0 — basic sanity, not convergence to
a specific α).
Subroutines¶
The regret table is keyed over action indices (per the game's
legalActions), but the strategy bank weighting works over
subroutines. This is the architectural answer to RFP attribute #2
(interpretability): every output trace decomposes into a per-subroutine
contribution.
Subroutine interface¶
export interface SubroutineContext {
readonly info: GameState;
readonly viewer: Player;
readonly legalActions: readonly Action[];
}
export interface SubroutineTrace {
readonly id: string;
readonly weight: number; // softmax mass assigned to this subroutine
readonly topAction: Action | null;
readonly rationale: string; // ≤120 char operator-facing string
}
export interface Subroutine {
readonly id: string;
/** Weight signal — how strongly this subroutine wants to be heard. */
weight(ctx: SubroutineContext): number;
/** Probability vector over `ctx.legalActions`. Must sum to 1 (or be zero-mass when no opinion). */
distribution(ctx: SubroutineContext): Float32Array;
/** Human-readable trace explaining this subroutine's contribution. */
explain(ctx: SubroutineContext): SubroutineTrace;
}
weight is a non-negative real number. The bank softmaxes the
per-subroutine weights into mixing coefficients, then mixes the
per-subroutine distributions.
The eight subroutines¶
Seeded from the existing scripted agent's factoring
(services/copilot/src/scriptedAgent.ts:24-66):
| Subroutine | Trigger | Output |
|---|---|---|
IdentityGate |
P(FRIEND \| belief) > 0.4 |
Heavy mass on withhold |
RoeEscalation |
roe === "RED" |
Mass shifts to kinetic effects (DEFEAT_DESTROY, ATTRIT) |
SoftKillFirst |
roe === "AMBER" ∧ threatConfObs < 70 |
Mass shifts to soft-kill (DISRUPT, DEGRADE, DENY) |
ReplanEscalation |
Last engage on this track was soft-kill ∧ track still in play | Kinetic-first |
JammerCounter |
threatTypeObs === "JAMMER" |
Skip EW effects (DISRUPT/DEGRADE) — mirrors scriptedAgent.ts:82-85 |
FratricideAvoidance |
P(FRIEND \| belief) > 0.25 |
Mass on withhold (softer than IdentityGate) |
FuelConservation |
Any effector at fuel band CRITICAL or LOW |
Shift mass away from that effector |
CommsDegradeHedge |
commsDegrade.dropPercent > 20 |
Prefer autonomous-capable effector (kinetic in this codebase) |
Each subroutine is a ≤80-LOC TypeScript file under
packages/uci-solver/src/subroutines/<id>.ts with a vitest fixture in
test/subroutines/<id>.test.ts. All pure — read GameState, output
a probability vector. No solver-table reads, no async, no I/O.
Why subroutines, not raw actions¶
A regret table keyed on actions tells you which action the solver
prefers; a subroutine-decorated trace tells you why. The plan
(plan/sbir-osw26bz02-dv004-game-theoretic-coa.md line 72-90) treats
this as the architectural differentiator from neural CFR. We honor it:
the strategy-bank explain method always emits one trace per active
subroutine, surfaced to the COP via the existing
uci-demo/copilot/reason/<planId> channel.
Strategy bank¶
The bank consumes the regret table + the subroutines and emits a single
mixed policy per (infoSetKey, legalActions) query.
Composition rule¶
function composedPolicy(ctx: SubroutineContext, regretTable: RegretTable): Float32Array {
const weights = subroutines.map((s) => s.weight(ctx)); // raw weights
const mixing = softmax(weights, temperature = 1.0); // → Σ = 1
const dists = subroutines.map((s) => s.distribution(ctx)); // per-subroutine distributions
// Linear mixture
const N = ctx.legalActions.length;
const out = new Float32Array(N);
for (let s = 0; s < subroutines.length; s++) {
for (let a = 0; a < N; a++) {
out[a] += mixing[s] * dists[s][a];
}
}
// Mix in the regret-table strategy (the CFR signal) at fixed weight λ = 0.6.
const cfrStrategy = regretTable.getAverageStrategy(infoSetKey(ctx.info, ctx.viewer), N);
for (let a = 0; a < N; a++) {
out[a] = 0.6 * cfrStrategy[a] + 0.4 * out[a];
}
return out;
}
The CFR signal dominates (λ = 0.6) — the subroutines are the prior that biases behavior toward operator-doctrine when the CFR table has little or no signal yet (cold-start case). As iterations accumulate and the regret table sharpens, the subroutine contribution becomes a soft regularization rather than the primary action selector.
bank.explain(ctx, regrets)¶
Returns a readonly SubroutineTrace[] enumerating only the
subroutines whose weight > 0.05 after softmaxing. The copilot
publishes one MQTT reasoning-line per trace via the existing
publishReasoningLine at services/copilot/src/main.ts:120-133.
Blueprint schema + serialization¶
The blueprint is what the solver-daemon ships to the SolverAgent for fast online lookup. It is a single JSON envelope containing the regret table, the average-strategy table, and version metadata.
export interface Blueprint {
readonly schemaVersion: 1;
readonly beliefV: number; // BELIEF_V at training time
readonly payoffV: number; // PAYOFF_V at training time
readonly infosetV: number; // INFOSET_V at training time
readonly iterations: number;
readonly exploitabilityEstimate: number | null; // null if not measured
readonly infoSetCount: number;
readonly maxActions: number;
/** base64-encoded Float32Array buffers, keyed by infoSet bigint serialized as decimal string. */
readonly regret: Record<string, string>;
readonly avgStrategy: Record<string, string>;
}
Why JSON over Protobuf¶
- The blueprint is shipped via MQTT in the live demo
(
uci-demo/solver/blueprint); MQTT payloads are bytes-with-text content negotiation, and base64-in-JSON is the cheapest interop. - Tactical-scale blueprints (10³ info-sets × 64 actions × 4 bytes × 2 tables) are ~500 KB. Compresses well; even uncompressed it fits in one MQTT message.
- We do not need cross-language compatibility — the only consumer is another TypeScript service.
Version-tag invariant¶
A blueprint loaded with beliefV ≠ BELIEF_V or infosetV ≠ INFOSET_V
must be refused with a clear error. This is the
silently-invalidate-cache hazard the game memo also called out. The
deserializer enforces the check; the solver-daemon refuses to apply a
stale blueprint.
Serialize/deserialize API¶
export function serializeBlueprint(b: Blueprint): string; // JSON.stringify with binary buffers base64-encoded
export function deserializeBlueprint(json: string): Blueprint; // throws on version mismatch
Action-space pruning¶
Per the game memo: legalActions returns all legal actions; the
solver may prune. For Phase 0 we do no pruning at the solver level —
the action set is small enough at T1 (max 55 per Blue node) that
enumeration is cheaper than the bookkeeping a pruning heuristic would
add.
If T2 scaling (10 effectors × 30 tracks) blows the action set past
~500, the first pruning lever is dominated-action pruning: any
action a whose regret has been clipped to 0 for ≥1000 consecutive
visits is excluded from legalActions enumeration. That's a
post-Phase-0 problem; the interface is stable enough to add later
without breaking the kernel.
TypeScript module layout¶
packages/uci-solver/src/
├── index.ts # re-exports
├── regret.ts # RegretTable + RM+ + Float32Array storage
├── escfr.ts # cfrTraverse + iterate
├── bank.ts # StrategyBank.composedPolicy + explain
├── blueprint.ts # trainBlueprint, Blueprint shape
├── serialize.ts # serializeBlueprint / deserializeBlueprint
└── subroutines/
├── index.ts # Subroutine interface + barrel re-exports
├── identityGate.ts
├── roeEscalation.ts
├── softKillFirst.ts
├── replanEscalation.ts
├── jammerCounter.ts
├── fratricideAvoidance.ts
├── fuelConservation.ts
└── commsDegradeHedge.ts
packages/uci-solver/test/
├── regret.test.ts # RM+ unit tests, accumulation
├── escfr.test.ts # smoke test on a 2-node toy game
├── kuhn.test.ts # CORRECTNESS GATE: <0.01 exploitability @ 10k
├── bank.test.ts # composedPolicy sums to 1, explain shape
├── blueprint.test.ts # round-trip, version-mismatch refusal
└── subroutines/
├── identityGate.test.ts
├── roeEscalation.test.ts
├── softKillFirst.test.ts
├── replanEscalation.test.ts
├── jammerCounter.test.ts
├── fratricideAvoidance.test.ts
├── fuelConservation.test.ts
└── commsDegradeHedge.test.ts
Tests use only synthetic fixtures — no MQTT, no codec, no real
scenarios. The package remains a pure-domain workspace member alongside
@uci-demo/game.
Acceptance criteria¶
- All new modules typecheck under the workspace's strict settings.
pnpm --filter @uci-demo/solver testpasses includingkuhn.test.tsasserting exploitability < 0.01 after 10k iterations.SCHEMA_VERSION = 1constant is exported fromserialize.ts.- Blueprint version tags (
beliefV,payoffV,infosetV) match the exported constants from@uci-demo/gameat training time, and the deserializer refuses to load a blueprint whose tags disagree with the runtime values. pnpm upsmoke unaffected. The copilot still consumes onlycreateWorldMirrorfrom@uci-demo/game; nothing in@uci-demo/solveris wired into the running demo in this PR. The solver-daemon + SolverAgent integration is a separate workstream (Phase 0 weeks 5-7 per the SBIR sprint table).
What this memo deliberately does not specify¶
- Solver-daemon service (
services/solver-daemon/). Weeks 5-7. - Red-agent service (
services/red-agent/). Weeks 5-7. - SolverAgent integration in the copilot. Weeks 6-8.
- Eval-harness exploitability plotting. Weeks 7-9.
- PBS subgame-resolving for T3 scale. Phase II Year 1.
- Neural-CFR ablation. Not in scope — explicitly out per the SBIR plan ("no GPU dependency, no neural-network black boxes").
Open questions to resolve before code lands¶
- Average-strategy weighting variant. The memo specifies linear averaging (each iteration's σ adds with weight 1). Some published variants use iteration-weighted averaging (weight = iteration index) for faster late-stage convergence. Reviewer can flag if the kuhn convergence at 10k looks slow — easy swap with one constant.
- CFR-vs-subroutines mixing weight (λ = 0.6). Chosen by analogy to AlphaGo's policy-network/value-function mix. Tunable per scenario family; the bank exports it as a constructor option so the eval-harness can sweep.
- Mulberry32 vs xoshiro256. Mulberry32 is fine for Kuhn; for T2/T3 scaling, xoshiro256** has lower correlation between successive samples. Phase 0 ships mulberry32; bench upgrade can land later.
- Pluggable rollout policy at chance nodes. The current spec
samples chance from the game's
resolveChance(state, rand). A more sophisticated solver might bias chance sampling toward high-information outcomes. Out of scope for Phase 0.
These are knowable knobs, not algorithmic risks. None block writing the rest of the package.
Why this memo is shorter than the game memo¶
The game memo had to fix five separate artifacts (state factoring,
belief math, payoff weights, infoset encoding, dynamics) before
implementation could safely fan out. The solver memo has one
algorithm (ES-MCCFR) and one correctness gate (Kuhn). Once those are
locked, the eight subroutines + bank + blueprint are mechanical from
the game's GameDynamics and infoSetKey surface.
The lesson from the game workstream: the prose pages of the memo were worth their weight in reviewer time. This memo follows the same shape but commits less ink to material the game memo already covered.