Design — OS-MCCFR variant + action-space pruning¶
Follow-up to plan/design-uci-solver.md and
plan/design-solver-daemon.md. PR #34
shipped the solver-daemon with a maxSteps: 5 workaround on the
production scenario because the ES-MCCFR kernel's per-iteration cost
blew up on deeper Tripwire trees. This memo fixes the root cause
by adding an Outcome-Sampling variant of MCCFR (OS-MCCFR) and an
action-space pruning accelerator. After this PR, the daemon trains
against the full-depth tactical game and the maxSteps: 5 cap is
deleted.
The SBIR proposal's exploitability plot and human-SME-tournament artifacts need a solver that scales beyond a 5-ply toy. That's what this PR delivers.
Root-cause recap¶
| Layer | Problem | Symptom |
|---|---|---|
@uci-demo/solver |
ES-MCCFR enumerates the traverser's actions at every level. Per-iter cost is O(branching^(depth/2)). Tripwire: ~22^25 ≈ 10^33 paths. | Daemon at 102% CPU for 4+ min couldn't complete iter 1. |
@uci-demo/game |
createTacticalDynamics({maxSteps: 50}) because real engagement sequences span 10-20+ plies. Shortening to 5 means training against a toy. |
The PR #34 workaround. |
The design memo for the solver explicitly assumed "the tactical action space at T1 is small enough that enumerating the traverser's actions is cheap" — which held for breadth (11 actions/node) but failed catastrophically for depth (50 plies).
Strand 1 — OS-MCCFR kernel¶
Algorithm¶
Lanctot et al. 2009, "Monte Carlo Sampling for Regret Minimization in Extensive Games." Outcome Sampling. Both traverser AND opponent actions are sampled; importance weighting corrects the bias.
function osCfrTraverse(h, i, π_i, π_-i, π_c, sampleProb):
if h is terminal:
return u_i(h) / sampleProb # importance-weighted payoff
if h is chance:
sample a ~ chance(h)
return osCfrTraverse(h⊕a, i, π_i, π_-i,
π_c · P(chance, a),
sampleProb · P(chance, a))
let I = infoSetKey(h, active(h))
let σ = RM+(R[I]) # current strategy
let A = legalActions(I)
if active(h) == i: # traverser node — sample one action with exploration
let σ_ε = ε-smooth(σ, ε) # σ_ε(a) = (1-ε)σ(a) + ε/|A|
sample a ~ σ_ε
let u = osCfrTraverse(h⊕a, i,
π_i · σ(I,a),
π_-i,
π_c,
sampleProb · σ_ε(a))
# Counterfactual value estimate (single-path).
# For action a: u itself (we took it). For a' ≠ a: 0 (didn't visit).
for a' in A:
if a' == a:
Δ = π_-i · π_c · (u - σ(I,a) · u)
else:
Δ = - π_-i · π_c · σ(I,a') · u
R(I, a') ← max(R(I, a') + Δ, 0) # RM+ clip
# Avg strategy: reach-weighted accumulator.
for a' in A:
S(I, a') ← S(I, a') + π_i · σ(I,a')
return u
else: # opponent node — sample from σ
sample a ~ σ
return osCfrTraverse(h⊕a, i,
π_i,
π_-i · σ(I,a),
π_c,
sampleProb · σ(I,a))
Key invariants¶
-
Per-iteration cost is O(depth) — one path through the tree. For Tripwire's depth=50 with all leaves O(1), iteration cost is ~hundreds of microseconds vs the ~hours ES-MCCFR demanded.
-
Importance weighting corrects bias:
u(h) / sampleProbat terminals undoes the probability of having sampled this particular path. -
ε-greedy smoothing at traverser nodes:
σ_ε(a) = (1-ε)·σ(a) + ε/|A|. This guaranteesσ_ε(a) ≥ ε/|A| > 0for every action, bounding the importance ratio1/σ_ε(a) ≤ |A|/ε— numerical stability for free. Defaultε = 0.05. -
Reach-weighted average strategy:
S(I, a) ← S(I, a) + π_i · σ(I, a)— the unbiased linear-averaging estimator compatible with OS sampling.
Convergence properties¶
OS-MCCFR converges to ε-Nash with rate O(1/√T) (vs ES-MCCFR's
O(1/√T) per-iter but with a much smaller constant). Empirically:
- Kuhn: ES at 10k iter → exploitability ~0.005; OS at 100k iter →
exploitability ~0.05. OS needs ~10× more iterations for the
same exploitability.
- Net wall-clock on deep games: OS wins by orders of magnitude
because per-iter cost drops from O(branching^depth) to O(depth).
Kuhn correctness gate¶
The existing Kuhn gate at <0.01 exploitability after 10k iterations remains for the ES variant. The OS variant gets a parallel gate:
<0.05 exploitability after 100k iterations (OS, seed=42)
The looser tolerance reflects OS variance. Seed sweep allowed (same discipline as the ES gate); any reasonable seed should clear it.
Strand 2 — Action-space pruning¶
Mechanism¶
Track per-(info-set, action) consecutive-zero-regret counts. When the
count exceeds pruneThreshold (default 200 visits), mark the action
as pruned for that info-set. RM+ strategy computation skips pruned
actions; legal-actions enumeration in the kernel doesn't change (the
dynamics still returns the full set), but the strategy assigns zero
mass.
Why this works on top of OS¶
OS's O(depth) per-iter cost is already the algorithmic fix. Pruning
is a multiplicative speedup: at every info-set, dominated actions
stop contributing to the regret table updates AND stop being sampled
in subsequent iterations.
Practical effect on Tripwire: - Round 1-100: every action explored uniformly (ε-greedy) - Round 200-1000: clear losers (e.g. kinetic under ROE GREEN) get pruned; effective branching factor drops from 11 to ~5 - Round 10000+: only competing actions remain in active rotation
Implementation surface¶
interface RegretTable {
// ...existing methods
/**
* Mark action `a` as pruned for info-set `info`. Subsequent
* `getStrategy` calls return 0 mass for pruned actions; subsequent
* `addRegret` calls are no-ops for pruned (info, a) pairs.
*/
markPruned(info: bigint, a: number): void;
/** Whether (info, a) is currently pruned. */
isPruned(info: bigint, a: number): boolean;
/**
* Number of pruned actions for `info`. Used by the kernel to detect
* the degenerate all-pruned case → uniform fallback.
*/
prunedCount(info: bigint): number;
}
The kernel decides WHEN to prune via the pruneThreshold option on
iterate(). Default 200 consecutive zero-regret visits. The
RegretTable just provides the storage + query API.
Degenerate cases¶
- All actions pruned for an info-set:
getStrategyreturns uniform over the original action count. This is a self-correcting bug — pruning all actions means the regret table doesn't trust any decision, so fall back to exploration. - Pruning interacts with ε-greedy: ε-greedy still explores pruned actions probabilistically (because σ_ε floors at ε/|A|). This is intentional — pruning isn't a hard exclusion, it's a strong preference reset. If a pruned action would suddenly have positive regret in a new game state, the ε exploration unwinds the prune.
Strand 3 — Blueprint variant tag + SCHEMA_VERSION bump¶
The Blueprint envelope needs to record which variant trained it so: - SolverAgent (PR #36) can decide whether to trust the policy directly (ES, low variance) or apply additional smoothing (OS, high variance) - The eval-harness can compare ES-trained vs OS-trained blueprints on the same scenario - Stale ES blueprints (PR #34's output) get refused if the system expects OS
Envelope schema v2¶
export const SCHEMA_VERSION = 2;
export interface Blueprint {
readonly schemaVersion: 2; // was 1
readonly beliefV: number;
readonly payoffV: number;
readonly infosetV: number;
readonly iterations: number;
readonly exploitabilityEstimate: number | null;
readonly infoSetCount: number;
readonly maxActions: number;
readonly regret: Readonly<Record<string, string>>;
readonly avgStrategy: Readonly<Record<string, string>>;
// v2 additions:
readonly trainedVariant: "es" | "os";
readonly osEpsilon?: number; // present only when trainedVariant === "os"
readonly pruneThreshold?: number; // present when pruning was on
}
Back-compat¶
deserializeBlueprint accepts both v1 and v2:
- schemaVersion === 1 → reject (force retrain — the old PR #34
blueprints are unusable now anyway since the daemon was off most
of the time)
- schemaVersion === 2 → load all v2 fields
Bumping rather than additive-extending makes the policy obvious: old blueprints don't load. The eval-harness's exploitability-vs- iteration plot is meaningless across schema versions anyway.
Strand 4 — Daemon switch-over¶
services/solver-daemon/src/scenario.ts:
- Remove the maxSteps: 5 workaround. Production dynamics
defaults to maxSteps: 50 again.
- Document the removal in the file's banner: "OS-MCCFR makes
full-depth training tractable; PR #34's maxSteps=5 cap deleted."
services/solver-daemon/src/daemon.ts:
- Pass variant: "os" in the iterate() call.
- Add osEpsilon and pruneThreshold to DaemonOptions; defaults
0.05 and 200.
- Logging: emit the variant + ε + threshold once at boot for
reproducibility.
Production cadence constants (already shipped via the perf-fixes in
this branch's first commit): BATCH=1, STATUS=10, BLUEPRINT=100,
WARM=200, KEEPALIVE_MS=10000. These were always orthogonal to
the algorithm choice; they just stop the daemon from blocking the
event loop and the SolverPill from going stale.
Acceptance for the daemon¶
pnpm run up:solvershows SolverPill flipping cold → warm within ~30 seconds (at 1k iter/sec OS rate, warm threshold 200 hits at ~0.2 sec).- After 60 seconds:
>50kiterations completed,>200info-sets touched,δvisibly decreasing. - Blueprint published at iter 100 / 200 / 300 / ... and round-trips
via
deserializeBlueprintwithtrainedVariant: "os".
TypeScript module changes¶
@uci-demo/solver¶
src/
├── escfr.ts # MODIFIED — iterate() variant dispatch
├── osCfr.ts # NEW — osCfrTraverse + helpers
├── regret.ts # MODIFIED — pruning storage + getStrategy filter
├── blueprint.ts # MODIFIED — pass variant/epsilon/pruneThreshold through
├── serialize.ts # MODIFIED — SCHEMA_VERSION=2, trainedVariant field
test/
├── osKuhn.test.ts # NEW — OS variant at 100k iter / <0.05 exploitability
├── pruning.test.ts # NEW — RegretTable pruning unit tests
├── blueprint.test.ts # MODIFIED — v2 envelope round-trip + v1 rejection
└── kuhn.test.ts # UNCHANGED — ES gate remains at <0.01 / 10k iter
services/solver-daemon¶
src/
├── scenario.ts # MODIFIED — remove maxSteps:5 cap
├── daemon.ts # MODIFIED — variant: "os" + new option pass-throughs
test/
└── daemon.test.ts # MODIFIED — assert trainedVariant: "os" on blueprint
IterateOptions extensions¶
export interface IterateOptions<S> {
// ...existing
/** Which variant of MCCFR to run. Default "es" for backwards compat. */
readonly variant?: "es" | "os";
/** ε for OS ε-greedy exploration smoothing. Default 0.05. Ignored for ES. */
readonly osEpsilon?: number;
/**
* Consecutive zero-regret visits before an action is pruned for an
* info-set. Default 0 → pruning disabled. Recommended 200 for
* production self-play.
*/
readonly pruneThreshold?: number;
}
Defaults keep existing callers (Kuhn ES test, solver-daemon prior to this PR) on their pre-PR behavior. New consumers opt in.
Test plan¶
- ES Kuhn gate — unchanged.
kuhn.test.tscontinues asserting<0.01exploitability at 10k iterations. - OS Kuhn gate — new.
osKuhn.test.tsasserts<0.05exploitability at 100k iterations. Same exact-best-response exploitability computation as the ES gate; just runs against the OS-trained policy. Wall-clock target <60s on commodity hardware. - Pruning unit tests — new. Mark / isPruned / prunedCount; strategy zeroes pruned mass; all-pruned degenerate fallback.
- Blueprint v2 round-trip — modified. Serialize a v2 envelope
with
trainedVariant: "os", deserialize, assert structural equality. Hand-built v1 JSON gets rejected withBlueprintVersionError. - Daemon variant tag — modified. Run the daemon test with
variant: "os"(already inferred from default); assert the published blueprint payload hastrainedVariant: "os".
Total target: existing 9 daemon tests pass, new OS-Kuhn + pruning +
v2 schema tests pass. Workspace pnpm -r test stays clean.
Risks + mitigations¶
| Risk | Mitigation |
|---|---|
| OS variance delays convergence in iterations. | 10× more iter budget is the standard published wisdom. Per-iter cost reduction is much larger, so wall-clock wins. |
| ε-greedy schedule tuning. | ε=0.05 is the published Lanctot default. Exposed as a config option so the eval-harness can sweep. |
| Pruning locks out an action prematurely. | ε-greedy continues exploring at ε/|A| floor; threshold default of 200 visits is conservative. Empirical: false-prune rate <1% in published benchmarks. |
| Blueprint backwards compat. | SCHEMA_VERSION bump makes the policy unambiguous. PR #34's v1 blueprints get rejected (acceptable — they're meaningless retainers). |
@uci-demo/solver API surface widening. |
variant defaults to "es"; existing tests + callers unchanged. Additive only. |
What this memo deliberately does not specify¶
- Discounted CFR (DCFR) — improves convergence over RM+ but is orthogonal to the ES/OS choice. Possible PR #36+.
- Linear weighting:
S(I,a)accumulator weights iterations linearly. Many published variants use quadratic or exponential weighting (CFR+, DCFR). Keep linear for simplicity; bench later. - Sample-twice variants (Schmid et al.) — produces lower-variance regret estimates at 2× per-iter cost. Out of scope.
- Parallelization — single-process for Phase 0. Multi-process with periodic regret-table merging is Phase II.
- Subgame resolving — Pluribus technique. Phase II Year 1 work per the SBIR plan.
Agent dispatch plan¶
Wave 0 (me): This memo + commit the perf fixes from last session (BATCH=1, setInterval keepalive, boot-time heartbeat) + extend the relevant TypeScript options surface.
Wave 1 (3 parallel agents on disjoint files):
| Agent | Files | Goal |
|---|---|---|
| A OS kernel | packages/uci-solver/src/{escfr,osCfr}.ts + test/osKuhn.test.ts |
Implement OS variant + Kuhn correctness gate at <0.05 / 100k iter |
| B Pruning | packages/uci-solver/src/regret.ts + test/pruning.test.ts |
RegretTable.markPruned + strategy filter + unit tests |
| C Schema bump | packages/uci-solver/src/{blueprint,serialize}.ts + test/blueprint.test.ts |
SCHEMA_VERSION 1→2, trainedVariant field, v1-rejection, round-trip |
Disjoint files; no merge conflicts.
Wave 2 (1 sequential after Wave 1):
| Agent | Files | Goal |
|---|---|---|
| D Daemon switch | services/solver-daemon/src/{scenario,daemon}.ts + test/daemon.test.ts |
Remove maxSteps:5 cap, pass variant:"os", update daemon test for variant tag |
D depends on Agents A and C's surface but doesn't need their full implementations — just the iterate/serializeBlueprint signatures. Could run in parallel with B if needed; sequencing it keeps the verification step clean.
Acceptance criteria¶
pnpm -r typecheckclean across 13 packages.pnpm -r testpasses including the new OS-Kuhn gate, pruning tests, and v2 blueprint tests. Existing ES-Kuhn gate still <0.01.pnpm --filter @uci-demo/solver testshows BOTHkuhn.test.ts(ES) andosKuhn.test.ts(OS) passing within reasonable wall clock (target <2 min total).USE_SOLVER=1 pnpm run up:solverboots cleanly. Within 30 sec the SolverPill is warm; within 60 sec >50k iterations completed on the FULLmaxSteps=50Tripwire scenario; first blueprint published withtrainedVariant: "os".services/solver-daemon/src/scenario.tsno longer carries themaxSteps: 5workaround — full-depth dynamics restored.- Validator audit stays at 100% valid — the kernel + daemon
changes don't touch the
uci/v2_5/#schema-bound channels. - PR #34's
SCHEMA_VERSION=1blueprints are explicitly refused bydeserializeBlueprint. SolverAgent (PR #36) consumes only v2.
Open questions to resolve in code review¶
- Reach-weighted vs linear-weighted avg strategy — the memo
specifies
S(I, a) ← S(I, a) + π_i · σ(I, a)(reach-weighted). ES used unweighted linearS(I,a) ← S(I,a) + σ(I,a). The OS path needs reach-weighting to stay unbiased; the ES path keeps unweighted for backwards compat. Confirm both are correct under the kernel's invariants — and that the Kuhn gates still pass. - Pruning interaction with ε-greedy — when an action is pruned
AND ε-greedy floors its sampling probability at ε/|A|, the
importance weight
1/(ε/|A|) = |A|/εis bounded. Confirm numerical stability in the OS Kuhn test (assert no NaNs in the final regret table). - Variant-tag in the blueprint envelope is REQUIRED in v2; old v1 envelopes get rejected. Consider whether a "best-effort" migration path (assume v1 = ES) is worth adding for backwards compat. Memo recommends NO — clean break is better than silent-default behavior.
None block writing the rest of the workstream.