Plan — Game-Theoretic COA Engine on `uci-demo` for SBIR OSW26BZ02-DV004 (D2P2)¶

Context¶

SBIR topic OSW26BZ02-DV004 (SCO, Direct-to-Phase-II) asks for a mature, scalable, robust game-theoretic AI that computes approximate Nash equilibria in imperfect-information multi-domain wargames; beats expert red teams; stays human-interpretable; scales tactical → operational; runs CPU-only; is anytime. The "non-responsive" floor for the proposal is a working prototype with quantified performance against expert humans or recognized AI benchmarks — narrative does not clear the bar.

uci-demo today (post v1.3.0 main, 21/596 UCI v2.5 MTs) is a strong demo substrate but is not the engine the RFP wants. The Agent interface in services/copilot/src/types.ts:73-86 and the modular scriptedAgent.ts ↔ claudeAgent.ts slot exactly fit the "modular doctrinal subroutines, not monolithic NN" attribute. Comms-degrade injection (apps/cop-ui/lib/degrade.ts + services/world-sim/src/sim.ts:137-218) and replay (apps/cop-ui/lib/{messageBuffer,replay}.ts) partially answer Phase II V&V. Validator audit (services/validator/) and CI gate exist. But there is no solver, no Red agent, no utility function, no headless eval harness anywhere in the repo — every scoring story is net-new code.

A reviewer audit of the prior chat plan found three structural weaknesses: (1) the 4–6-week pre-proposal sprint is too short for a credible D2P2 prior-art package — realistic is 12 weeks; (2) benchmarking only against scriptedAgent does not address the "paramount evaluation criterion" of defeating experienced human red teams; (3) AFSIM access realistically takes 6–12 months post-award (ITAR/JCP), so M&S transition must lead with Command: Professional Edition. This plan re-scopes Phase 0 accordingly.

The intended outcome: a 12-week pre-proposal sprint that produces a CPU-only ES-MCCFR + Public Belief State solver, a programmatic Red agent, a headless eval harness, a 4–6 SME human micro-tournament, and the white-paper / plot pair / video that make the D2P2 proposal responsive on every required attribute. The architecture choices preserve the demo's existing "smart-but-readable" code character: pure TypeScript, single MQTT transport, single XML parser, no GPU dependency, no neural-network black boxes.

Standards posture (UCI v2.5 + OMS v2.5)¶

The demo substrate is already shaped like an OMS v2.5 Mission Package (released 2026-01-22, governed by the OACWG): Mosquitto is the Abstract Service Bus, packages/uci-bus + packages/uci-codec are the de facto Critical Abstraction Layer, services/* are OMS Services, and services/adsb-bridge is shaped like an OMS Isolator. Phase 0 does not add OMS conformance work — that's a Phase II Year-1 deliverable in the companion plan — but the proposal text names OMS v2.5 as the framing standard so reviewers see the system positioned for an OMS-compliant Phase II without overpromising. The pre-proposal sprint stays focused on the solver, the Red agent, and the SME tournament.

Recommended approach¶

Architecture (one paragraph)¶

Five new workspace members. Two library packages that build to dist/ like @uci-demo/bus and @uci-demo/codec: @uci-demo/game (pure domain types — GameState, InformationSet, PublicBeliefState, Payoff, GameDynamics, no I/O) and @uci-demo/solver (ES-MCCFR core, Float32Array-backed regret tables, StrategyBank of modular subroutines, AnytimeBlueprint query API, Kuhn-poker correctness test). Three services: services/solver-daemon/ (long-lived self-play, owns the only RegretTable instance, answers info-set queries over MQTT side-channel uci-demo/solver/query/+/+ with a uci-demo/solver/reply/<requestId> response and uci-demo/solver/status retained heartbeat), services/red-agent/ (programmatic adversary publishing EntityNotificationMT + PositionReportMT exactly like services/adsb-bridge/src/bridge.ts:120-275 — zero changes to world-sim or scenario.ts), and services/eval-harness/ (headless N scenarios × M agents × K degrade presets runner emitting a versioned EvalReport JSON). The third SolverAgent is not a new package — it lives at services/copilot/src/solverAgent.ts alongside the existing scripted/claude impls, queries the daemon over MQTT, returns an AgentDecision, and the copilot's existing orchestration (services/copilot/src/main.ts:88-109) publishes everything to the wire. Solver core is pure TypeScript on tsx with typed-array regret tables — Rust/N-API is deferred behind a RegretTable interface escape hatch, only invoked if the operational benchmark workflow demands it.

Language & runtime constraints¶

TypeScript only (no Rust core in Phase 0). V8 typed arrays sustain ~10⁷–10⁸ regret updates/sec/core, covering tactical scale.
One MQTT client per service via @uci-demo/bus connectBus. Solver query/reply rides MQTT, not gRPC.
One XML parser: solver-daemon reuses the XMLParser pattern from services/copilot/src/worldState.ts:4-10. Red-agent only builds XML via codec, never parses.
ESM + verbatimModuleSyntax + .js import extensions for all new TS code (matches tsconfig.base.json).
No schema/UCI_v2_5/ changes. Every new MT use already has a builder in packages/uci-codec/src/builders/.
No @uci-demo/codec top-level import from apps/cop-ui/ — if cop-ui ever needs solver visualization, use @uci-demo/codec/browser.
LLM rationalization layer is fully model-agnostic; Claude stays first-class. New packages/uci-llm/ defines a LanguageModelClient interface with structured tool-use, completion, streaming, prompt caching (where supported), and arbitrary sampling-param overrides. The operator can use any backend they want — every client is a single ≤200-LOC adapter under packages/uci-llm/src/clients/<name>.ts implementing the same interface. Headline set shipped on day 1: anthropic (Claude — the preferred default wherever reachable, prompt-caching on, current demo behavior preserved), ollama (local models — preferred default in air-gapped deployments), bedrock (Claude / Nova / Llama / Mistral on AWS, including GovCloud), and openai-compat (one client covers OpenAI, Azure OpenAI, Together, Groq, Fireworks, vLLM with OpenAI-compatible endpoint, llama.cpp server, and any OpenAI-shaped HTTP service). Provider chosen via env: LLM_PROVIDER, LLM_BASE_URL, LLM_MODEL, LLM_API_KEY. Default selection: LLM_PROVIDER if set, else anthropic when ANTHROPIC_API_KEY is present (current demo behavior), else ollama. Capability flags (supportsToolUse, supportsPromptCache, supportsStreaming, supportsGrammar) declared per client; where a backend lacks native tool-use the interface layer transparently falls back to JSON-mode + grammar or instruction-prompted structured output, so the same Agent contract works against any backend. A registerClient(name, factory) registry lets downstream consumers add custom backends without forking packages/uci-llm/. The existing services/copilot/src/claudeAgent.ts is renamed llmAgent.ts in Phase 0 week 1-2 and refactored to consume the abstraction — the system prompt, tool schemas, prompt-cache strategy, and tool-use semantics are preserved verbatim; only the SDK call site moves behind the interface. No service may import any vendor SDK directly outside packages/uci-llm/src/clients/. A shared structured-output parity test (packages/uci-llm/test/parity.test.ts) runs the same prompt + tool-schema against every shipped backend so swapping providers never silently changes agent behavior. Optional narrate?: (trace) => Promise<string[]> on SolverAgent accepts a LanguageModelClient and natural-languages a regret decomposition; the action is always the solver's.

Utility function (`packages/uci-game/src/payoff.ts`)¶

Zero-sum: Red's payoff is -U_B. Version-tagged PAYOFF_V = 1 so weight changes don't silently invalidate historic runs. Computed online by solver-daemon's worldMirror.ts and offline by services/eval-harness/src/scoreReplay.ts.

U_B = + 1.0  · neutralized_hostiles               // EntityLostMT for HOSTILE/SUSPECT trackIds
      - 5.0  · fratricide_events                   // EntityLostMT for FRIEND inside any CapabilityCoverageAreaMT polygon
      - 0.2  · roe_violations                      // proposals violating uci-demo/world/roe band
      - 0.05 · fuel_fraction_burned_total          // integrated SubsystemStatusMT.state bands
      - 0.3  · failed_effects                      // EffectStatusMT.state = FAILED
      - 0.001· comms_degrade_seconds               // integral over uci-demo/world/degrade window
      - 0.002· mean_time_to_decision_ms / 1000     // copilot evaluate() wall time, capped 5s

Weights are constants, not learned — they are the operator's doctrinal preferences surface.

Information-set / PBS factoring¶

Wire field	Visibility	Goes into
`PositionReportMT.{lat,lng,alt}`	public	`belief.publicPositions`
`EntityNotificationMT.Severity` + `EntityMT.Identity.Platform.ThreatType/Confidence`	public (noisy)	drives Bayesian update of `identityBelief`
true `Identity` enum	hidden from Blue	`GameState.hidden.trueIdentity`
`SubsystemStatusMT.state` band	public	`belief.fuelBelief`
exact fuel fraction	hidden from Red	`GameState.hidden.trueFuel`
`uci-demo/world/roe` (retained)	public	`belief.roe`
`uci-demo/world/degrade`	public	`belief.commsDegrade`

InfoSet key = FNV-1a-64 over canonical encoding of bucketed (roe, commsBucket, ∀trackId: identityBelief decile + threatType bucket, ∀effectorId: fuel band, recent_actions[last 8]). Bucketing holds tactical info-set count near 10³.

Modular doctrinal subroutines (`packages/uci-solver/src/subroutines/`)¶

The regret table is keyed over subroutine IDs, not raw actions. This is the architectural answer to RFP attribute #2 (interpretability) — nothing is a neural blob, every component is reviewable TS.

Seeded from scriptedAgent.ts:24-66's existing factoring:

Subroutine	Doctrine
`IdentityGate`	withhold when P(FRIEND) > 0.4
`RoeEscalation`	ROE RED ⇒ kinetic-first
`SoftKillFirst`	AMBER + low PID confidence ⇒ EW-first
`ReplanEscalation`	soft-kill failed ⇒ kinetic
`JammerCounter`	`threatType==="JAMMER"` ⇒ skip EW (mirrors `scriptedAgent.ts:82-85`)
`FratricideAvoidance`	withhold if FRIEND inside `CapabilityCoverageAreaMT` polygon
`FuelConservation`	degrade effector preference when fuel band CRITICAL
`CommsDegradeHedge`	high `belief.commsDegrade` ⇒ prefer autonomous-capable effector

StrategyBank.composedPolicy softmax-weights each subroutine's distribution by regret, mixes, renormalizes. bank.explain(info, regrets) emits one SubroutineTrace per active subroutine — these become the decision.rationale[] strings the copilot publishes to uci-demo/copilot/reason/<planId> via existing publishReasoningLine (services/copilot/src/main.ts:120-133).

Anytime semantics¶

Blueprint-in-daemon model. Solver-daemon runs ES-MCCFR continuously; blueprint exploitability decreases monotonically with daemon uptime. SolverAgent.evaluate() is O(1) info-set lookup + ~50ms MQTT RPC, comfortably inside the 5s budget at services/copilot/src/types.ts:73-86. The retained uci-demo/solver/status heartbeat carries {iterations, exploitability, infoSetCount} — that retained message is the anytime guarantee. Cold-start falls back to scriptedAgent and tags the rationale with "solver-blueprint cold; using scripted fallback".

Red agent¶

Identical pattern to services/adsb-bridge/src/bridge.ts:120-275. Publishes EntityNotificationMT, EntityMT, PositionReportMT, EntityLostMT with a distinct senderSystemId and RED- prefixed topic ids. No changes to services/world-sim/src/sim.ts or services/world-sim/src/scenario.ts. Two policy backends: scripted (heuristic baseline for Phase 0 ladder) and solver-driven (queries solver-daemon for Red-side policy via the same RPC surface).

Headless eval harness¶

CLI: tsx services/eval-harness/src/main.ts --scenarios ... --agents ... --degrade ... --episodes-per-cell N --report out/eval/<ts>.json. Boots Mosquitto via existing docker-compose.yml, then imports extracted startWorldSim() and startCopilot() functions in-process — no cop-ui. Deterministic seeds for scenario, Red, MCCFR. Emits EvalReport JSON + per-episode NDJSON timeline + bus log to out/eval/<runId>/<scenario>-<agent>-<degrade>-<idx>/.

Phase 0 — 12-week sprint (not 4-6)¶

The 4–6-week estimate in the prior chat plan was unrealistic. Re-scoped against the actual D2P2 evidence list:

Week	Workstream
1-2	Refactor for extractability. Extract `startCopilot()` / `startWorldSim()`; promote `worldState.ts` to `@uci-demo/game/worldMirror`; extract `runWithConcurrency` from `adsb-bridge/bridge.ts:89-108`; introduce `packages/uci-llm/` (`LanguageModelClient` + `AnthropicClient` + `OllamaClient`) and refactor `claudeAgent.ts` → `llmAgent.ts` behind it. Every refactor PR runs an end-to-end smoke test — `pnpm up`, validator audit at `http://127.0.0.1:7700/audit?n=50` is 100% valid, approval card / MODIFY round-trip / comms-degrade injection / replay reconstruction all still work. Typecheck + unit tests alone do not gate refactor merges. PR-mergeable to `main` independently.
2-4	`@uci-demo/game` types + dynamics + PBS belief update + payoff. Vitest suite for belief Bayesian update + payoff math.
3-6	`@uci-demo/solver` ES-MCCFR + regret/strategy tables + strategy bank + blueprint. Kuhn poker correctness test at `packages/uci-solver/test/kuhn.test.ts` — converges to <0.01 exploitability in 10k iterations. Without this no review trusts the kernel.
5-7	`services/red-agent/` + `services/solver-daemon/`. Subprocess gated on `USE_SOLVER=1` so default `pnpm up` is unchanged.
6-8	`services/copilot/src/solverAgent.ts` + three-way agent selection at `services/copilot/src/main.ts:58-68`.
7-9	`services/eval-harness/` + `.github/workflows/eval.yml`. Tactical bench runs on every PR labeled `solver-perf`; operational bench nightly on self-hosted.
8-10	Scaling study. 3-effector Tripwire → 10/30 synthetic → 30/100 synthetic. Plot wall-clock-to-target-exploitability vs problem size (3 data points = credible log-log fit).
9-11	Human SME micro-tournament. 4–6 retired O-3/O-4 with C-UAS or air-ops backgrounds, 3 games each vs solver on Tripwire/Vanguard. n≈15-20 games with logged decisions. Budget ~$25k in honoraria. Highest-leverage single line item in the entire sprint.
10-12	White paper (≥15 pages: model, algorithm, abstraction, exploitability plot, scaling plot, tournament results, integration story, transition path) + demo video (3–5 min, solver-vs-scripted side-by-side on same scenario seed) + interpretability case studies (3 worked decisions with regret decomposition + mixed strategy + outcome).

Phase II Year 1 (post-award)¶

Primary M&S integration: Command: Professional Edition (~$3–10k/seat, commercially licensable, Lua API, achievable on SBIR budget). New services/cpe-bridge/ mirrors the adsb-bridge pattern.

Secondary, stretch: AFSIM via JCP/DD2345 sponsorship — apply for JCP now, before proposal submission, since realistic access timeline is 6–12 months. Fallback if government access slips a quarter: hardened services/world-sim/ as the V&V environment.

Multi-domain expansion of services/world-sim/src/scenario.ts schema (sea + land asset kinds; corresponding MT builders). Scaling work via PBS subgame-resolving (Pluribus technique). Containerization: Dockerfile per service + docker-compose.prod.yml. TDP: architecture doc + interpretability case-book + V&V report.

Phase III (commercialization)¶

Cut the breadth claim in the prior chat plan ("cyber, supply chain, market-making"). Cyber and supply chain are not naturally two-player zero-sum; the claim signals the team has not thought through the framing limits. Replace with one defensible vertical: Counter-UAS in joint coalition contexts (or naval surface engagement planning). One grounded paragraph beats three handwaves.

Critical files¶

New files (exhaustive)¶

Phase 0 refactor PR (independent of solver work): - packages/uci-game/ — package.json, tsconfig.json, src/{index,types,dynamics,belief,actions,hash,payoff,worldMirror,report}.ts, test/{dynamics,belief,payoff,hash}.test.ts, fixtures/assets.json - packages/uci-llm/ — package.json, tsconfig.json, src/{index,types,registry,toolUse,fallbackStructuredOutput,select}.ts, src/clients/{anthropic,ollama,bedrock,openai-compat}.ts, test/{registry,parity,toolUse,fallback}.test.ts. LanguageModelClient interface + 4 shipped clients + registerClient(name, factory) registry for user-defined backends. Parity test runs the same prompt + tool schema against every shipped backend and asserts structurally equivalent output. - services/copilot/src/llmAgent.ts — refactor of claudeAgent.ts:1-207 consuming @uci-demo/llm; original file is renamed in this PR. - services/copilot/src/service.ts — extracted startCopilot(opts) - services/world-sim/src/service.ts — extracted startWorldSim(opts) - packages/uci-bus/src/concurrency.ts — extracted runWithConcurrency

Solver core PR: - packages/uci-solver/ — package.json, tsconfig.json, src/{index,escfr,regret,bank,blueprint,serialize}.ts, src/subroutines/{identityGate,roeEscalation,softKillFirst,replanEscalation,jammerCounter,fratricideAvoidance,fuelConservation,commsDegradeHedge,index}.ts, test/{escfr,regret,bank,blueprint,kuhn}.test.ts, test/subroutines/*.test.ts

Red + daemon PR: - services/red-agent/ — package.json, tsconfig.json, src/{main,redLoop,scenarios}.ts, src/policies/{scripted,solverDriven,index}.ts, test/redLoop.test.ts - services/solver-daemon/ — package.json, tsconfig.json, src/{main,selfPlay,rpc}.ts, test/rpc.test.ts

SolverAgent + eval PR: - services/copilot/src/solverAgent.ts, services/copilot/src/solverAgent.test.ts - services/eval-harness/ — package.json, tsconfig.json, src/{main,runner,scoreReplay,scenarios}.ts, regression.config.json, test/runner.test.ts - .github/workflows/eval.yml

Whitepaper / artifacts PR: - docs/whitepaper/ — markdown source + figures - docs/benchmarks/ — exploitability-vs-iterations plot, wall-clock-vs-problem-size plot, tournament results JSON

Edited files¶

services/copilot/src/main.ts — lines 54-786 body extracted to service.ts; lines 58-68 agent selection becomes three-way (USE_SOLVER=1 > LLM_PROVIDER set > scripted). Selection no longer keys on ANTHROPIC_API_KEY specifically — that env var is just one input to packages/uci-llm provider selection.
services/copilot/src/claudeAgent.ts → renamed services/copilot/src/llmAgent.ts; body refactored to consume LanguageModelClient from @uci-demo/llm. No direct @anthropic-ai/sdk import remains anywhere in services/copilot/.
services/copilot/src/worldState.ts — re-export from @uci-demo/game/worldMirror
services/copilot/package.json — add @uci-demo/game, @uci-demo/solver workspace deps
services/world-sim/src/main.ts — body extracted to service.ts
services/adsb-bridge/src/bridge.ts — import runWithConcurrency from @uci-demo/bus/concurrency
package.json (root) — pnpm up script adds RED + SOLVER to concurrently list, gated on USE_SOLVER=1 (default off — keeps existing demo behavior)
README.md — add Solver Quickstart section pointing at USE_SOLVER=1 pnpm up
BUILD.md — Day 12+ entries

Existing utilities to reuse (do not duplicate)¶

Agent contract: services/copilot/src/types.ts:73-86 (Agent interface, AgentDecision union, EvaluationContext).
External-source publishing pattern: services/adsb-bridge/src/bridge.ts:120-275 (template for red-agent — emit EntityNotification + PositionReport + EntityLost lifecycle).
World state mirroring: services/copilot/src/worldState.ts:4-10 (XMLParser instance) and :49-139 (entity ingestion) — promote to @uci-demo/game/worldMirror, both copilot and solver-daemon import from there.
Reasoning-line streaming: services/copilot/src/main.ts:120-133 (publishReasoningLine). SolverAgent populates decision.rationale[]; copilot publishes one line per subroutine via existing loop at :447-453.
Codec builders: buildEntityNotification, buildEntity, buildPositionReport, buildEntityLost from @uci-demo/codec (Red agent reuses; never hand-rolls XML).
Bus client: connectBus from @uci-demo/bus (every new service).
Comms-degrade injection: apps/cop-ui/lib/degrade.ts publishDegrade() is callable from the eval harness (not UI-coupled).
Validator audit: http://127.0.0.1:7700/audit?n=N — eval harness consumes for schema-validity row in EvalReport. Note sampling regime (first 20/MT + 1-in-10) — extend validator with VALIDATOR_FULL_AUDIT=1 env for benchmark runs.
Scenario YAML schema: services/world-sim/src/scenario.ts:1-122 — extend with new event types (runtime_spawn, red_inject) only if non-bridge Red is needed; default approach uses bridge pattern.
Playwright extensibility: apps/cop-ui/playwright.config.ts — can wrap E2E in scenario × degrade × agent matrix later; not Phase 0 critical.

Verification¶

Algorithmic correctness (CI)¶

pnpm -F @uci-demo/solver test
# Must pass: packages/uci-solver/test/kuhn.test.ts
# (ES-MCCFR converges to <0.01 exploitability in 10k iterations on Kuhn poker)

This is the gate that says "the kernel is real." Reviewer will look for this.

End-to-end smoke¶

# 1. Baseline matrix.
pnpm -F @uci-demo/eval run bench -- \
  --scenarios tripwire,vanguard,stillwater \
  --agents scripted \
  --degrade none,light,heavy \
  --episodes-per-cell 20 \
  --report out/eval/baseline-scripted.json

# 2. Run solver-daemon to convergence (~10 min wall-clock on tactical).
USE_SOLVER=1 pnpm -F @uci-demo/solver-daemon start &
until [ "$(mosquitto_sub -t 'uci-demo/solver/status' -C 1 | jq '.exploitability < 0.05')" = "true" ]; do sleep 30; done

# 3. Solver-agent matrix.
pnpm -F @uci-demo/eval run bench -- \
  --scenarios tripwire,vanguard,stillwater \
  --agents solver \
  --degrade none,light,heavy \
  --episodes-per-cell 20 \
  --report out/eval/solver.json

# 4. Compare. Pass condition: median ΔU_B >= +0.5, Welch's t p < 0.05.
pnpm -F @uci-demo/eval run compare \
  --baseline out/eval/baseline-scripted.json \
  --candidate out/eval/solver.json \
  --metric blueUtility

Manual demo verification¶

USE_SOLVER=1 RED_AGENT=solver-driven pnpm up. Browser: proposals appear within ~3s of each Red contact. Reasoning panel shows subroutine-weighted explanations (e.g., "ReplanEscalation 0.72 — soft-kill failed twice; escalating to kinetic"). Validator audit feed at http://127.0.0.1:7700/audit?n=20 shows 100% valid.
mosquitto_pub -t uci-demo/world/degrade -m '{"dropPercent":80,"latencyMs":500,"durationMs":20000}'. Within one self-play epoch (visible on uci-demo/solver/status retained heartbeat), Blue blueprint shifts mass toward CommsDegradeHedge. Reasoning trace reflects the shift.
Force a HAWK-2 fuel exhaustion. Observe FuelConservation subroutine gain weight on the next evaluation; copilot recommends GUARDIAN-3 handoff in the reasoning rail.
Click MODIFY on an active proposal. Copilot publishes UPDATE trio; reasoning rail logs OPERATOR // MODIFY; solver-daemon's worldMirror records the operator action as a public observation, which the next info-set's belief reflects.

Plot/artifact verification (proposal deliverables)¶

docs/benchmarks/exploitability-vs-iterations.png exists, has a monotone-decreasing curve on Kuhn (and on Tripwire abstraction).
docs/benchmarks/wallclock-vs-problemsize.{png,json} exists with 3 data points (3/10/30 effectors).
docs/benchmarks/sme-tournament.json exists with n≥15 logged games; aggregate Blue-utility result and per-SME breakdown.
docs/whitepaper/main.md ≥15 pages covering model, algorithm, abstraction, results, scaling, integration, transition.
docs/video/solver-vs-scripted-tripwire.mp4 exists, 3–5 min, side-by-side on identical scenario seed.

Compliance gate against RFP attributes¶

Before submission, every row in this table must be answerable with a specific repo artifact:

RFP attribute	Required artifact	Source
Dominant Performance	SME tournament JSON (n≥15) + scripted-baseline delta with CI95	`docs/benchmarks/sme-tournament.json` + `out/eval/*.json`
Human-Interpretability	3 worked case studies showing regret decomposition + mixed strategy + outcome	`docs/whitepaper/case-studies/`
Scalability	Wall-clock-vs-problem-size plot, 3 data points, log-log fit	`docs/benchmarks/wallclock-vs-problemsize.{png,json}`
Computational Efficiency	Single-workstation benchmark, no GPU, named CPU spec	`docs/benchmarks/host.json` + reproduce instructions
Anytime	Exploitability-vs-iterations plot + retained `uci-demo/solver/status` heartbeat in live demo	`docs/benchmarks/exploitability-vs-iterations.png` + video

Plan — Game-Theoretic COA Engine on uci-demo for SBIR OSW26BZ02-DV004 (D2P2)¶