Design — `SolverAgent` + `BlueprintHolder` + three-way agent selection¶

Follow-up to plan/design-solver-daemon.md and plan/design-os-mccfr.md. The daemon ships (PR #34), OS-MCCFR makes the deep tree tractable (PR #35), and v2 blueprints flow over uci-demo/solver/blueprint. This PR makes the copilot actually consume that blueprint — Blue decisions are driven by the trained policy instead of the scripted heuristic or the LLM.

The SBIR proposal's human-SME-tournament artifact requires a real SolverAgent playing through the bus. That's what this PR delivers.

What ships¶

Surface	File	Goal
`BlueprintHolder`	`services/copilot/src/blueprintHolder.ts` (new)	Subscribes `uci-demo/solver/blueprint`, deserializes, holds in memory, exposes `current()`
`SolverAgent`	`services/copilot/src/solverAgent.ts` (new)	Implements `Agent` interface using the blueprint + `StrategyBank`
Three-way agent selection	`services/copilot/src/service.ts` (modify)	`USE_SOLVER=1` > `LLM_PROVIDER` set > scripted

`BlueprintHolder` contract¶

export interface BlueprintHolder {
  /** Most recent successfully-deserialized blueprint, or null if none arrived yet. */
  current(): TrainedBlueprint | null;
  /** Iteration count from the most recent blueprint; 0 if none. */
  iterationsSeen(): number;
  /** True once at least one blueprint has been received and verified. */
  warm(): boolean;
  /** Disconnect the bus subscription + clear in-memory state. */
  dispose(): void;
}

export interface TrainedBlueprint {
  readonly blueprint: Blueprint;        // from @uci-demo/solver
  readonly blueRegret: RegretTable;     // reconstructed from blueprint.regret + avgStrategy
  readonly receivedAt: number;          // epoch ms
}

export function createBlueprintHolder(bus: BusLike): BlueprintHolder;

Subscription mechanics¶

Subscribes once at construction to uci-demo/solver/blueprint (QoS 1 to match the daemon's publish QoS, retained payloads delivered immediately on subscribe).
On message:
Defer to a microtask via queueMicrotask so a 5 MB blueprint payload doesn't block the event loop during deserialization (~50 ms typical).
Call deserializeBlueprint(json) from @uci-demo/solver. The v2 schema enforcer + version-tag check fires here.
On success: reconstruct the RegretTable (populate from blueprint.regret + blueprint.avgStrategy Float32Array buffers). Atomic swap of the held state.
On BlueprintVersionError: log + keep the prior blueprint (or null if no prior). Don't crash the copilot on a stale daemon emission.
dispose() unsubscribes and clears the held blueprint. Idempotent.

Cold-start semantics¶

When current() is null, callers should treat the SolverAgent as running in "subroutine-prior-only" mode — the bank still produces a useful policy via the doctrinal weights, just without the trained regret signal. The agent logs this state.

`SolverAgent` contract¶

Implements the Agent interface from services/copilot/src/types.ts:

interface Agent {
  evaluate(
    track: TrackSnapshot,
    world: WorldSnapshot,
    context?: EvaluationContext,
  ): Promise<AgentDecision>;
  name: string;
}

Decision flow¶

async evaluate(track, world, ctx):
  1. Build GameState via buildGameState({world, truth: EMPTY_SCENARIO_TRUTH, ...})
     (the live mirror doesn't carry truth — the agent sees what it sees)
  2. legalActions = createTacticalDynamics().legalActions(gameState)
  3. Build SubroutineContext = { info: gameState, viewer: "blue", legalActions }
  4. blueprint = blueprintHolder.current()
     regretTable = blueprint?.blueRegret ?? createRegretTable(64)  // cold-start uniform
  5. policy = bank.composedPolicy(ctx, regretTable)
     // Float32Array of probabilities over legalActions
  6. Pick the highest-mass action a* (argmax of policy)
     (Could also sample — argmax is the textbook "exploit" pick for online play.)
  7. If a*.kind === "withhold": return { kind: "withhold", reason: rationale }
     Else: return { kind: "propose", effect, effector, rationale, predictedOutcome }
  8. Rationale: bank.explain(ctx) → SubroutineTrace[] → map each trace.rationale
     to a string in the decision's rationale[] field. Prepend a one-line
     "solver-blueprint <warm|cold> · <N> iter" header.
  9. predictedOutcome: policy[argmax_index] — the probability mass on the chosen action

Filtering legalActions to "engage" / "withhold"¶

The bank operates over the FULL action space (engage × effects × effectors + withhold). The copilot's AgentDecision shape only carries effect + effector for engage proposals OR a withhold reason. Map:

Argmax {kind: "engage", effect, effectorId} → {kind: "propose", effect, effector: effectorId, ...}
Argmax {kind: "withhold"} → {kind: "withhold", reason: "policy says withhold"}
Red-side actions never appear (we only ever evaluate Blue's decision node)

`evaluate()` timing budget¶

Per services/copilot/src/types.ts: "resolve within a few seconds and never throw." The SolverAgent's path is O(legal_actions × subroutines × 1) per evaluate — a few hundred microseconds. No risk of timing out.

Cold-start logging¶

When blueprintHolder.current() === null, prepend rationale with:

"solver-blueprint cold; subroutine-prior only — daemon hasn't published yet"

When warm, prepend with:

"solver-blueprint @ <iterations> iter · ε=<osEpsilon>"

The operator-facing intel stream gets this verbatim; helpful for telling whether the solver is in cold-start mode during the demo.

Three-way agent selection¶

Replace the existing block at services/copilot/src/service.ts:104-130:

let agent: Agent;
if (opts.agent) {
  // Injected (eval-harness path)
  agent = opts.agent;
  console.log(c.dim("[copilot]"), `agent ▸ ${c.violet(agent.name)}  ${c.dim("(injected)")}`);
} else if (process.env.USE_SOLVER === "1") {
  // SolverAgent path. Daemon may or may not be online; cold-start tolerated.
  const holder = createBlueprintHolder({ raw: bus.raw });
  agent = createSolverAgent({ blueprintHolder: holder });
  console.log(
    c.dim("[copilot]"),
    `agent ▸ ${c.violet(agent.name)}  ${c.dim(`(USE_SOLVER=1; ${holder.warm() ? "warm" : "cold-start"})`)}`,
  );
} else {
  // ... existing LLM / scripted selection unchanged
}

Notes: - USE_SOLVER=1 is the explicit opt-in. Default pnpm run up → LLM/scripted as before. - pnpm run up:solver script (added in PR #34) spawns the daemon. The copilot also needs USE_SOLVER=1 to consume it. - BlueprintHolder is constructed AFTER the bus connect (where the existing code is). The subscription happens in the holder's constructor. - On copilot shutdown, handle.dispose() cascades to the BlueprintHolder. The existing copilot teardown path needs one new line.

Why a fourth env var?¶

We already have LLM_PROVIDER. Adding USE_SOLVER keeps the selection priority explicit and prevents accidental selection (e.g. solver-daemon up but operator wanted LLM-only). The flag is a deliberate engagement signal.

Tests¶

`services/copilot/test/blueprintHolder.test.ts`¶

Round-trip happy path: synthesize a serialized v2 blueprint, feed via a mock bus, assert current() returns the deserialized object within one microtask tick.
Version refusal: feed a hand-built v1 JSON payload, assert current() stays at the prior state (or null), no throw.
Iteration tracking: after one valid blueprint, iterationsSeen() returns the embedded iter count.
Warm transition: starts warm() === false; flips after first valid blueprint.
Dispose: after dispose(), current() returns null and new messages are ignored.

`services/copilot/test/solverAgent.test.ts`¶

Cold-start path: with blueprintHolder.current() === null, evaluate() returns a propose decision (subroutine prior alone). Rationale contains the cold-start tag.
Warm path: with a hand-built TrainedBlueprint whose regret table has all mass on action[3] for the test info-set, evaluate() picks that exact action.
Withhold path: with a track whose belief has P(FRIEND) > 0.4 (IdentityGate triggers), evaluate() returns withhold (because the bank's composed policy heavily prefers withhold).
Error safety: a malformed world (e.g. missing assets) causes the agent to return withhold with a reason, NEVER throws.

Use vi.fn() spies + structural BusLike from the existing copilot test pattern.

Module layout¶

services/copilot/
├── package.json                          # already depends on @uci-demo/solver (PR #34)
├── src/
│   ├── blueprintHolder.ts                # NEW
│   ├── solverAgent.ts                    # NEW
│   ├── service.ts                        # MODIFIED — three-way selection
│   └── ...existing
├── test/
│   ├── blueprintHolder.test.ts           # NEW
│   ├── solverAgent.test.ts               # NEW
│   └── ...existing

Acceptance criteria¶

pnpm -r typecheck + pnpm -r test clean.
Default pnpm run up (no USE_SOLVER) — agent selection unchanged. LLM or scripted as before.
USE_SOLVER=1 pnpm run up:solver — copilot logs agent ▸ SolverAgent (USE_SOLVER=1; cold-start) at boot. SolverPill in COP shows iteration count climbing. Copilot's first proposal carries "solver-blueprint cold..." rationale; subsequent proposals (after the first blueprint arrives) carry "solver-blueprint @ N iter...".
Validator audit stays 100% valid. The new agent emits the same UCI MTs as the existing agents (EffectPlanCommandMT etc.); no schema-bound channels touched.
Pre-existing services/copilot/test/{beliefMirror,scoreMirror,doctrinePublisher}.test.ts + the new tests all pass.

Open questions¶

Argmax vs sample at decision time. Memo specifies argmax (textbook "exploit" pick). Sampling from the policy distribution gives a stochastic agent that probes harder during human-SME play — interesting variant but adds variance to a single-game evaluation. Stick with argmax for now; sweep later.
Blueprint provenance check. When a blueprint arrives, we verify the trainedVariant is "os" (or "es" — either should work for online consumption). What if a future Red-side blueprint shows up on the same topic? Current scope only consumes Blue's regret table. Document inline; PR #37 (red-agent) will likely separate per-side blueprint topics.
Microtask deserialize on a 5 MB payload. Even one microtask is sync work that blocks the loop for ~50 ms. If that's too long, switch to setImmediate (next tick boundary). Bench in the smoke; tune if needed.

None block writing the rest of the workstream.

What this memo deliberately does not specify¶

Red-side SolverAgent — PR #37 (red-agent service).
Blueprint serving over a separate RPC channel — out of scope; we use the retained MQTT topic as the single source of truth.
Multi-blueprint A/B comparison — the holder holds one current blueprint; if the daemon switches scenarios mid-stream, the holder swaps. Eval-harness will run scenario × agent matrices offline (weeks 7-9).
Solver-vs-LLM ensemble — combining the solver's policy with an LLM-generated rationale is interesting but adds complexity. The cold-start path already gracefully degrades; ensemble is a Phase II ergonomic improvement.

Design — SolverAgent + BlueprintHolder + three-way agent selection¶