Design — SolverAgent + BlueprintHolder + three-way agent selection¶
Follow-up to plan/design-solver-daemon.md
and plan/design-os-mccfr.md. The daemon ships
(PR #34), OS-MCCFR makes the deep tree tractable (PR #35), and v2
blueprints flow over uci-demo/solver/blueprint. This PR makes the
copilot actually consume that blueprint — Blue decisions are driven by
the trained policy instead of the scripted heuristic or the LLM.
The SBIR proposal's human-SME-tournament artifact requires a real SolverAgent playing through the bus. That's what this PR delivers.
What ships¶
| Surface | File | Goal |
|---|---|---|
BlueprintHolder |
services/copilot/src/blueprintHolder.ts (new) |
Subscribes uci-demo/solver/blueprint, deserializes, holds in memory, exposes current() |
SolverAgent |
services/copilot/src/solverAgent.ts (new) |
Implements Agent interface using the blueprint + StrategyBank |
| Three-way agent selection | services/copilot/src/service.ts (modify) |
USE_SOLVER=1 > LLM_PROVIDER set > scripted |
BlueprintHolder contract¶
export interface BlueprintHolder {
/** Most recent successfully-deserialized blueprint, or null if none arrived yet. */
current(): TrainedBlueprint | null;
/** Iteration count from the most recent blueprint; 0 if none. */
iterationsSeen(): number;
/** True once at least one blueprint has been received and verified. */
warm(): boolean;
/** Disconnect the bus subscription + clear in-memory state. */
dispose(): void;
}
export interface TrainedBlueprint {
readonly blueprint: Blueprint; // from @uci-demo/solver
readonly blueRegret: RegretTable; // reconstructed from blueprint.regret + avgStrategy
readonly receivedAt: number; // epoch ms
}
export function createBlueprintHolder(bus: BusLike): BlueprintHolder;
Subscription mechanics¶
- Subscribes once at construction to
uci-demo/solver/blueprint(QoS 1 to match the daemon's publish QoS, retained payloads delivered immediately on subscribe). - On message:
- Defer to a microtask via
queueMicrotaskso a 5 MB blueprint payload doesn't block the event loop during deserialization (~50 ms typical). - Call
deserializeBlueprint(json)from@uci-demo/solver. The v2 schema enforcer + version-tag check fires here. - On success: reconstruct the
RegretTable(populate fromblueprint.regret+blueprint.avgStrategyFloat32Array buffers). Atomic swap of the held state. - On
BlueprintVersionError: log + keep the prior blueprint (or null if no prior). Don't crash the copilot on a stale daemon emission. dispose()unsubscribes and clears the held blueprint. Idempotent.
Cold-start semantics¶
When current() is null, callers should treat the SolverAgent as
running in "subroutine-prior-only" mode — the bank still produces a
useful policy via the doctrinal weights, just without the trained
regret signal. The agent logs this state.
SolverAgent contract¶
Implements the Agent interface from services/copilot/src/types.ts:
interface Agent {
evaluate(
track: TrackSnapshot,
world: WorldSnapshot,
context?: EvaluationContext,
): Promise<AgentDecision>;
name: string;
}
Decision flow¶
async evaluate(track, world, ctx):
1. Build GameState via buildGameState({world, truth: EMPTY_SCENARIO_TRUTH, ...})
(the live mirror doesn't carry truth — the agent sees what it sees)
2. legalActions = createTacticalDynamics().legalActions(gameState)
3. Build SubroutineContext = { info: gameState, viewer: "blue", legalActions }
4. blueprint = blueprintHolder.current()
regretTable = blueprint?.blueRegret ?? createRegretTable(64) // cold-start uniform
5. policy = bank.composedPolicy(ctx, regretTable)
// Float32Array of probabilities over legalActions
6. Pick the highest-mass action a* (argmax of policy)
(Could also sample — argmax is the textbook "exploit" pick for online play.)
7. If a*.kind === "withhold": return { kind: "withhold", reason: rationale }
Else: return { kind: "propose", effect, effector, rationale, predictedOutcome }
8. Rationale: bank.explain(ctx) → SubroutineTrace[] → map each trace.rationale
to a string in the decision's rationale[] field. Prepend a one-line
"solver-blueprint <warm|cold> · <N> iter" header.
9. predictedOutcome: policy[argmax_index] — the probability mass on the chosen action
Filtering legalActions to "engage" / "withhold"¶
The bank operates over the FULL action space (engage × effects × effectors + withhold). The copilot's AgentDecision shape only carries effect + effector for engage proposals OR a withhold reason. Map:
- Argmax
{kind: "engage", effect, effectorId}→{kind: "propose", effect, effector: effectorId, ...} - Argmax
{kind: "withhold"}→{kind: "withhold", reason: "policy says withhold"} - Red-side actions never appear (we only ever evaluate Blue's decision node)
evaluate() timing budget¶
Per services/copilot/src/types.ts: "resolve within a few seconds and never throw." The SolverAgent's path is O(legal_actions × subroutines × 1) per evaluate — a few hundred microseconds. No risk of timing out.
Cold-start logging¶
When blueprintHolder.current() === null, prepend rationale with:
"solver-blueprint cold; subroutine-prior only — daemon hasn't published yet"
When warm, prepend with:
"solver-blueprint @ <iterations> iter · ε=<osEpsilon>"
The operator-facing intel stream gets this verbatim; helpful for telling whether the solver is in cold-start mode during the demo.
Three-way agent selection¶
Replace the existing block at services/copilot/src/service.ts:104-130:
let agent: Agent;
if (opts.agent) {
// Injected (eval-harness path)
agent = opts.agent;
console.log(c.dim("[copilot]"), `agent ▸ ${c.violet(agent.name)} ${c.dim("(injected)")}`);
} else if (process.env.USE_SOLVER === "1") {
// SolverAgent path. Daemon may or may not be online; cold-start tolerated.
const holder = createBlueprintHolder({ raw: bus.raw });
agent = createSolverAgent({ blueprintHolder: holder });
console.log(
c.dim("[copilot]"),
`agent ▸ ${c.violet(agent.name)} ${c.dim(`(USE_SOLVER=1; ${holder.warm() ? "warm" : "cold-start"})`)}`,
);
} else {
// ... existing LLM / scripted selection unchanged
}
Notes:
- USE_SOLVER=1 is the explicit opt-in. Default pnpm run up → LLM/scripted as before.
- pnpm run up:solver script (added in PR #34) spawns the daemon. The copilot also needs USE_SOLVER=1 to consume it.
- BlueprintHolder is constructed AFTER the bus connect (where the existing code is). The subscription happens in the holder's constructor.
- On copilot shutdown, handle.dispose() cascades to the BlueprintHolder. The existing copilot teardown path needs one new line.
Why a fourth env var?¶
We already have LLM_PROVIDER. Adding USE_SOLVER keeps the selection priority explicit and prevents accidental selection (e.g. solver-daemon up but operator wanted LLM-only). The flag is a deliberate engagement signal.
Tests¶
services/copilot/test/blueprintHolder.test.ts¶
- Round-trip happy path: synthesize a serialized v2 blueprint, feed via a mock bus, assert
current()returns the deserialized object within one microtask tick. - Version refusal: feed a hand-built v1 JSON payload, assert
current()stays at the prior state (or null), no throw. - Iteration tracking: after one valid blueprint,
iterationsSeen()returns the embedded iter count. - Warm transition: starts
warm() === false; flips after first valid blueprint. - Dispose: after
dispose(),current()returns null and new messages are ignored.
services/copilot/test/solverAgent.test.ts¶
- Cold-start path: with
blueprintHolder.current() === null, evaluate() returns aproposedecision (subroutine prior alone). Rationale contains the cold-start tag. - Warm path: with a hand-built TrainedBlueprint whose regret table has all mass on action[3] for the test info-set, evaluate() picks that exact action.
- Withhold path: with a track whose belief has
P(FRIEND) > 0.4(IdentityGate triggers), evaluate() returnswithhold(because the bank's composed policy heavily prefers withhold). - Error safety: a malformed
world(e.g. missing assets) causes the agent to returnwithholdwith a reason, NEVER throws.
Use vi.fn() spies + structural BusLike from the existing copilot test pattern.
Module layout¶
services/copilot/
├── package.json # already depends on @uci-demo/solver (PR #34)
├── src/
│ ├── blueprintHolder.ts # NEW
│ ├── solverAgent.ts # NEW
│ ├── service.ts # MODIFIED — three-way selection
│ └── ...existing
├── test/
│ ├── blueprintHolder.test.ts # NEW
│ ├── solverAgent.test.ts # NEW
│ └── ...existing
Acceptance criteria¶
pnpm -r typecheck+pnpm -r testclean.- Default
pnpm run up(noUSE_SOLVER) — agent selection unchanged. LLM or scripted as before. USE_SOLVER=1 pnpm run up:solver— copilot logsagent ▸ SolverAgent (USE_SOLVER=1; cold-start)at boot. SolverPill in COP shows iteration count climbing. Copilot's first proposal carries"solver-blueprint cold..."rationale; subsequent proposals (after the first blueprint arrives) carry"solver-blueprint @ N iter...".- Validator audit stays 100% valid. The new agent emits the same UCI MTs as the existing agents (
EffectPlanCommandMTetc.); no schema-bound channels touched. - Pre-existing
services/copilot/test/{beliefMirror,scoreMirror,doctrinePublisher}.test.ts+ the new tests all pass.
Open questions¶
- Argmax vs sample at decision time. Memo specifies argmax (textbook "exploit" pick). Sampling from the policy distribution gives a stochastic agent that probes harder during human-SME play — interesting variant but adds variance to a single-game evaluation. Stick with argmax for now; sweep later.
- Blueprint provenance check. When a blueprint arrives, we verify the
trainedVariantis"os"(or"es"— either should work for online consumption). What if a future Red-side blueprint shows up on the same topic? Current scope only consumes Blue's regret table. Document inline; PR #37 (red-agent) will likely separate per-side blueprint topics. - Microtask deserialize on a 5 MB payload. Even one microtask is sync work that blocks the loop for ~50 ms. If that's too long, switch to
setImmediate(next tick boundary). Bench in the smoke; tune if needed.
None block writing the rest of the workstream.
What this memo deliberately does not specify¶
- Red-side SolverAgent — PR #37 (red-agent service).
- Blueprint serving over a separate RPC channel — out of scope; we use the retained MQTT topic as the single source of truth.
- Multi-blueprint A/B comparison — the holder holds one current blueprint; if the daemon switches scenarios mid-stream, the holder swaps. Eval-harness will run scenario × agent matrices offline (weeks 7-9).
- Solver-vs-LLM ensemble — combining the solver's policy with an LLM-generated rationale is interesting but adds complexity. The cold-start path already gracefully degrades; ensemble is a Phase II ergonomic improvement.