Design — OS-MCCFR variant + action-space pruning¶

Follow-up to plan/design-uci-solver.md and plan/design-solver-daemon.md. PR #34 shipped the solver-daemon with a maxSteps: 5 workaround on the production scenario because the ES-MCCFR kernel's per-iteration cost blew up on deeper Tripwire trees. This memo fixes the root cause by adding an Outcome-Sampling variant of MCCFR (OS-MCCFR) and an action-space pruning accelerator. After this PR, the daemon trains against the full-depth tactical game and the maxSteps: 5 cap is deleted.

The SBIR proposal's exploitability plot and human-SME-tournament artifacts need a solver that scales beyond a 5-ply toy. That's what this PR delivers.

Root-cause recap¶

Layer	Problem	Symptom
`@uci-demo/solver`	ES-MCCFR enumerates the traverser's actions at every level. Per-iter cost is O(branching^(depth/2)). Tripwire: ~22^25 ≈ 10^33 paths.	Daemon at 102% CPU for 4+ min couldn't complete iter 1.
`@uci-demo/game`	`createTacticalDynamics({maxSteps: 50})` because real engagement sequences span 10-20+ plies. Shortening to 5 means training against a toy.	The PR #34 workaround.

The design memo for the solver explicitly assumed "the tactical action space at T1 is small enough that enumerating the traverser's actions is cheap" — which held for breadth (11 actions/node) but failed catastrophically for depth (50 plies).

Strand 1 — OS-MCCFR kernel¶

Algorithm¶

Lanctot et al. 2009, "Monte Carlo Sampling for Regret Minimization in Extensive Games." Outcome Sampling. Both traverser AND opponent actions are sampled; importance weighting corrects the bias.

function osCfrTraverse(h, i, π_i, π_-i, π_c, sampleProb):
  if h is terminal:
    return u_i(h) / sampleProb         # importance-weighted payoff

  if h is chance:
    sample a ~ chance(h)
    return osCfrTraverse(h⊕a, i, π_i, π_-i,
                         π_c · P(chance, a),
                         sampleProb · P(chance, a))

  let I  = infoSetKey(h, active(h))
  let σ  = RM+(R[I])                    # current strategy
  let A  = legalActions(I)

  if active(h) == i:                    # traverser node — sample one action with exploration
    let σ_ε = ε-smooth(σ, ε)            # σ_ε(a) = (1-ε)σ(a) + ε/|A|
    sample a ~ σ_ε
    let u = osCfrTraverse(h⊕a, i,
                          π_i · σ(I,a),
                          π_-i,
                          π_c,
                          sampleProb · σ_ε(a))
    # Counterfactual value estimate (single-path).
    # For action a: u itself (we took it). For a' ≠ a: 0 (didn't visit).
    for a' in A:
      if a' == a:
        Δ = π_-i · π_c · (u - σ(I,a) · u)
      else:
        Δ = - π_-i · π_c · σ(I,a') · u
      R(I, a') ← max(R(I, a') + Δ, 0)   # RM+ clip
    # Avg strategy: reach-weighted accumulator.
    for a' in A:
      S(I, a') ← S(I, a') + π_i · σ(I,a')
    return u

  else:                                  # opponent node — sample from σ
    sample a ~ σ
    return osCfrTraverse(h⊕a, i,
                         π_i,
                         π_-i · σ(I,a),
                         π_c,
                         sampleProb · σ(I,a))

Key invariants¶

Per-iteration cost is O(depth) — one path through the tree. For Tripwire's depth=50 with all leaves O(1), iteration cost is ~hundreds of microseconds vs the ~hours ES-MCCFR demanded.
Importance weighting corrects bias: u(h) / sampleProb at terminals undoes the probability of having sampled this particular path.
ε-greedy smoothing at traverser nodes: σ_ε(a) = (1-ε)·σ(a) + ε/|A|. This guarantees σ_ε(a) ≥ ε/|A| > 0 for every action, bounding the importance ratio 1/σ_ε(a) ≤ |A|/ε — numerical stability for free. Default ε = 0.05.
Reach-weighted average strategy: S(I, a) ← S(I, a) + π_i · σ(I, a) — the unbiased linear-averaging estimator compatible with OS sampling.

Convergence properties¶

OS-MCCFR converges to ε-Nash with rate O(1/√T) (vs ES-MCCFR's O(1/√T) per-iter but with a much smaller constant). Empirically: - Kuhn: ES at 10k iter → exploitability ~0.005; OS at 100k iter → exploitability ~0.05. OS needs ~10× more iterations for the same exploitability. - Net wall-clock on deep games: OS wins by orders of magnitude because per-iter cost drops from O(branching^depth) to O(depth).

Kuhn correctness gate¶

The existing Kuhn gate at <0.01 exploitability after 10k iterations remains for the ES variant. The OS variant gets a parallel gate:

<0.05 exploitability after 100k iterations (OS, seed=42)

The looser tolerance reflects OS variance. Seed sweep allowed (same discipline as the ES gate); any reasonable seed should clear it.

Strand 2 — Action-space pruning¶

Mechanism¶

Track per-(info-set, action) consecutive-zero-regret counts. When the count exceeds pruneThreshold (default 200 visits), mark the action as pruned for that info-set. RM+ strategy computation skips pruned actions; legal-actions enumeration in the kernel doesn't change (the dynamics still returns the full set), but the strategy assigns zero mass.

Why this works on top of OS¶

OS's O(depth) per-iter cost is already the algorithmic fix. Pruning is a multiplicative speedup: at every info-set, dominated actions stop contributing to the regret table updates AND stop being sampled in subsequent iterations.

Practical effect on Tripwire: - Round 1-100: every action explored uniformly (ε-greedy) - Round 200-1000: clear losers (e.g. kinetic under ROE GREEN) get pruned; effective branching factor drops from 11 to ~5 - Round 10000+: only competing actions remain in active rotation

Implementation surface¶

interface RegretTable {
  // ...existing methods

  /**
   * Mark action `a` as pruned for info-set `info`. Subsequent
   * `getStrategy` calls return 0 mass for pruned actions; subsequent
   * `addRegret` calls are no-ops for pruned (info, a) pairs.
   */
  markPruned(info: bigint, a: number): void;

  /** Whether (info, a) is currently pruned. */
  isPruned(info: bigint, a: number): boolean;

  /**
   * Number of pruned actions for `info`. Used by the kernel to detect
   * the degenerate all-pruned case → uniform fallback.
   */
  prunedCount(info: bigint): number;
}

The kernel decides WHEN to prune via the pruneThreshold option on iterate(). Default 200 consecutive zero-regret visits. The RegretTable just provides the storage + query API.

Degenerate cases¶

All actions pruned for an info-set: getStrategy returns uniform over the original action count. This is a self-correcting bug — pruning all actions means the regret table doesn't trust any decision, so fall back to exploration.
Pruning interacts with ε-greedy: ε-greedy still explores pruned actions probabilistically (because σ_ε floors at ε/|A|). This is intentional — pruning isn't a hard exclusion, it's a strong preference reset. If a pruned action would suddenly have positive regret in a new game state, the ε exploration unwinds the prune.

Strand 3 — Blueprint variant tag + SCHEMA_VERSION bump¶

The Blueprint envelope needs to record which variant trained it so: - SolverAgent (PR #36) can decide whether to trust the policy directly (ES, low variance) or apply additional smoothing (OS, high variance) - The eval-harness can compare ES-trained vs OS-trained blueprints on the same scenario - Stale ES blueprints (PR #34's output) get refused if the system expects OS

Envelope schema v2¶

export const SCHEMA_VERSION = 2;

export interface Blueprint {
  readonly schemaVersion: 2;            // was 1
  readonly beliefV: number;
  readonly payoffV: number;
  readonly infosetV: number;
  readonly iterations: number;
  readonly exploitabilityEstimate: number | null;
  readonly infoSetCount: number;
  readonly maxActions: number;
  readonly regret: Readonly<Record<string, string>>;
  readonly avgStrategy: Readonly<Record<string, string>>;

  // v2 additions:
  readonly trainedVariant: "es" | "os";
  readonly osEpsilon?: number;          // present only when trainedVariant === "os"
  readonly pruneThreshold?: number;     // present when pruning was on
}

Back-compat¶

deserializeBlueprint accepts both v1 and v2: - schemaVersion === 1 → reject (force retrain — the old PR #34 blueprints are unusable now anyway since the daemon was off most of the time) - schemaVersion === 2 → load all v2 fields

Bumping rather than additive-extending makes the policy obvious: old blueprints don't load. The eval-harness's exploitability-vs- iteration plot is meaningless across schema versions anyway.

Strand 4 — Daemon switch-over¶

services/solver-daemon/src/scenario.ts: - Remove the maxSteps: 5 workaround. Production dynamics defaults to maxSteps: 50 again. - Document the removal in the file's banner: "OS-MCCFR makes full-depth training tractable; PR #34's maxSteps=5 cap deleted."

services/solver-daemon/src/daemon.ts: - Pass variant: "os" in the iterate() call. - Add osEpsilon and pruneThreshold to DaemonOptions; defaults 0.05 and 200. - Logging: emit the variant + ε + threshold once at boot for reproducibility.

Production cadence constants (already shipped via the perf-fixes in this branch's first commit): BATCH=1, STATUS=10, BLUEPRINT=100, WARM=200, KEEPALIVE_MS=10000. These were always orthogonal to the algorithm choice; they just stop the daemon from blocking the event loop and the SolverPill from going stale.

Acceptance for the daemon¶

pnpm run up:solver shows SolverPill flipping cold → warm within ~30 seconds (at 1k iter/sec OS rate, warm threshold 200 hits at ~0.2 sec).
After 60 seconds: >50k iterations completed, >200 info-sets touched, δ visibly decreasing.
Blueprint published at iter 100 / 200 / 300 / ... and round-trips via deserializeBlueprint with trainedVariant: "os".

TypeScript module changes¶

`@uci-demo/solver`¶

src/
├── escfr.ts              # MODIFIED — iterate() variant dispatch
├── osCfr.ts              # NEW — osCfrTraverse + helpers
├── regret.ts             # MODIFIED — pruning storage + getStrategy filter
├── blueprint.ts          # MODIFIED — pass variant/epsilon/pruneThreshold through
├── serialize.ts          # MODIFIED — SCHEMA_VERSION=2, trainedVariant field

test/
├── osKuhn.test.ts        # NEW — OS variant at 100k iter / <0.05 exploitability
├── pruning.test.ts       # NEW — RegretTable pruning unit tests
├── blueprint.test.ts     # MODIFIED — v2 envelope round-trip + v1 rejection
└── kuhn.test.ts          # UNCHANGED — ES gate remains at <0.01 / 10k iter

`services/solver-daemon`¶

src/
├── scenario.ts           # MODIFIED — remove maxSteps:5 cap
├── daemon.ts             # MODIFIED — variant: "os" + new option pass-throughs

test/
└── daemon.test.ts        # MODIFIED — assert trainedVariant: "os" on blueprint

IterateOptions extensions¶

export interface IterateOptions<S> {
  // ...existing

  /** Which variant of MCCFR to run. Default "es" for backwards compat. */
  readonly variant?: "es" | "os";

  /** ε for OS ε-greedy exploration smoothing. Default 0.05. Ignored for ES. */
  readonly osEpsilon?: number;

  /**
   * Consecutive zero-regret visits before an action is pruned for an
   * info-set. Default 0 → pruning disabled. Recommended 200 for
   * production self-play.
   */
  readonly pruneThreshold?: number;
}

Defaults keep existing callers (Kuhn ES test, solver-daemon prior to this PR) on their pre-PR behavior. New consumers opt in.

Test plan¶

ES Kuhn gate — unchanged. kuhn.test.ts continues asserting <0.01 exploitability at 10k iterations.
OS Kuhn gate — new. osKuhn.test.ts asserts <0.05 exploitability at 100k iterations. Same exact-best-response exploitability computation as the ES gate; just runs against the OS-trained policy. Wall-clock target <60s on commodity hardware.
Pruning unit tests — new. Mark / isPruned / prunedCount; strategy zeroes pruned mass; all-pruned degenerate fallback.
Blueprint v2 round-trip — modified. Serialize a v2 envelope with trainedVariant: "os", deserialize, assert structural equality. Hand-built v1 JSON gets rejected with BlueprintVersionError.
Daemon variant tag — modified. Run the daemon test with variant: "os" (already inferred from default); assert the published blueprint payload has trainedVariant: "os".

Total target: existing 9 daemon tests pass, new OS-Kuhn + pruning + v2 schema tests pass. Workspace pnpm -r test stays clean.

Risks + mitigations¶

Risk	Mitigation
OS variance delays convergence in iterations.	10× more iter budget is the standard published wisdom. Per-iter cost reduction is much larger, so wall-clock wins.
ε-greedy schedule tuning.	ε=0.05 is the published Lanctot default. Exposed as a config option so the eval-harness can sweep.
Pruning locks out an action prematurely.	ε-greedy continues exploring at ε/\|A\| floor; threshold default of 200 visits is conservative. Empirical: false-prune rate <1% in published benchmarks.
Blueprint backwards compat.	SCHEMA_VERSION bump makes the policy unambiguous. PR #34's v1 blueprints get rejected (acceptable — they're meaningless retainers).
`@uci-demo/solver` API surface widening.	`variant` defaults to "es"; existing tests + callers unchanged. Additive only.

What this memo deliberately does not specify¶

Discounted CFR (DCFR) — improves convergence over RM+ but is orthogonal to the ES/OS choice. Possible PR #36+.
Linear weighting: S(I,a) accumulator weights iterations linearly. Many published variants use quadratic or exponential weighting (CFR+, DCFR). Keep linear for simplicity; bench later.
Sample-twice variants (Schmid et al.) — produces lower-variance regret estimates at 2× per-iter cost. Out of scope.
Parallelization — single-process for Phase 0. Multi-process with periodic regret-table merging is Phase II.
Subgame resolving — Pluribus technique. Phase II Year 1 work per the SBIR plan.

Agent dispatch plan¶

Wave 0 (me): This memo + commit the perf fixes from last session (BATCH=1, setInterval keepalive, boot-time heartbeat) + extend the relevant TypeScript options surface.

Wave 1 (3 parallel agents on disjoint files):

Agent	Files	Goal
A OS kernel	`packages/uci-solver/src/{escfr,osCfr}.ts` + `test/osKuhn.test.ts`	Implement OS variant + Kuhn correctness gate at <0.05 / 100k iter
B Pruning	`packages/uci-solver/src/regret.ts` + `test/pruning.test.ts`	RegretTable.markPruned + strategy filter + unit tests
C Schema bump	`packages/uci-solver/src/{blueprint,serialize}.ts` + `test/blueprint.test.ts`	SCHEMA_VERSION 1→2, trainedVariant field, v1-rejection, round-trip

Disjoint files; no merge conflicts.

Wave 2 (1 sequential after Wave 1):

Agent	Files	Goal
D Daemon switch	`services/solver-daemon/src/{scenario,daemon}.ts` + `test/daemon.test.ts`	Remove maxSteps:5 cap, pass variant:"os", update daemon test for variant tag

D depends on Agents A and C's surface but doesn't need their full implementations — just the iterate/serializeBlueprint signatures. Could run in parallel with B if needed; sequencing it keeps the verification step clean.

Acceptance criteria¶

pnpm -r typecheck clean across 13 packages.
pnpm -r test passes including the new OS-Kuhn gate, pruning tests, and v2 blueprint tests. Existing ES-Kuhn gate still <0.01.
pnpm --filter @uci-demo/solver test shows BOTH kuhn.test.ts (ES) and osKuhn.test.ts (OS) passing within reasonable wall clock (target <2 min total).
USE_SOLVER=1 pnpm run up:solver boots cleanly. Within 30 sec the SolverPill is warm; within 60 sec >50k iterations completed on the FULL maxSteps=50 Tripwire scenario; first blueprint published with trainedVariant: "os".
services/solver-daemon/src/scenario.ts no longer carries the maxSteps: 5 workaround — full-depth dynamics restored.
Validator audit stays at 100% valid — the kernel + daemon changes don't touch the uci/v2_5/# schema-bound channels.
PR #34's SCHEMA_VERSION=1 blueprints are explicitly refused by deserializeBlueprint. SolverAgent (PR #36) consumes only v2.

Open questions to resolve in code review¶

Reach-weighted vs linear-weighted avg strategy — the memo specifies S(I, a) ← S(I, a) + π_i · σ(I, a) (reach-weighted). ES used unweighted linear S(I,a) ← S(I,a) + σ(I,a). The OS path needs reach-weighting to stay unbiased; the ES path keeps unweighted for backwards compat. Confirm both are correct under the kernel's invariants — and that the Kuhn gates still pass.
Pruning interaction with ε-greedy — when an action is pruned AND ε-greedy floors its sampling probability at ε/|A|, the importance weight 1/(ε/|A|) = |A|/ε is bounded. Confirm numerical stability in the OS Kuhn test (assert no NaNs in the final regret table).
Variant-tag in the blueprint envelope is REQUIRED in v2; old v1 envelopes get rejected. Consider whether a "best-effort" migration path (assume v1 = ES) is worth adding for backwards compat. Memo recommends NO — clean break is better than silent-default behavior.

None block writing the rest of the workstream.