agentsfirst-radar — Operational Runbook
Companion to the design doc at docs/plans/2026-05-15-agentsfirst-radar-design.md and the rules at tools/agentsfirst-radar/AGENTS.md. Reach for this file when something goes wrong in production.
Quick reference
| Symptom | First thing to check |
|---|---|
| No briefing this morning | Healthchecks.io agentsfirst-radar-daily slug — is it red? Then check GHA workflow runs. |
| Dead-man’s-switch paged | state/last-run.txt timestamp + the joshhome runner status |
| iMessage didn’t land but briefing committed | state/imsg-last-success.json age + chat.db lock + LaunchAgent on joshhome |
agent_paused flag is true |
state/radar-state.json — read pause_reason. Don’t unpause without RCA. |
| Cold-start banner in briefing | state/radar-state.json is missing or unparseable. See “Cold start” below. |
| Schema rollback needed (v2 → v1) | See “Schema rollback” below. |
| Multipov spend exhausted | state.multipov_spend_today_usd ≥ $10. Resets at UTC midnight. |
| Listener daemon dead | LaunchAgent on joshhome; see “imsg-listener” below. |
Infrastructure inventory (set up once at deploy)
These are the TODOs flagged in tools/agentsfirst-radar/README.md — they don’t exist until you create them.
| Item | Owner | Purpose |
|---|---|---|
Healthchecks.io slug agentsfirst-radar-daily |
ops | Heartbeat for the daily cron run; alert if >28h stale |
Healthchecks.io slug agentsfirst-radar-paused |
ops | Distinct alert path when agent_paused = true (so a paused state isn’t read as a dead cron) |
Healthchecks.io slug agentsfirst-radar-imsg |
ops | iMessage delivery — alert if state/imsg-last-success.json >25h stale even when daily slug is green |
| Cloudflare Worker dead-man’s-switch | ops, joshshop account | Polls state/last-run.txt on the agentsfirst repo via GitHub API every 6h; pages Josh (Pushover) if file >28h old |
GHA secrets (5): ANTHROPIC_API_KEY, XAI_API_KEY, MULTIPOV_API_KEY, SOCIAL_WORKER_KEY, HC_RADAR_PING_URL |
ops | All 5 must be set on the agentsfirst repo before the first cron fires |
| “Radar: Triage” Asana project | COS team | Unassigned tasks land here; COS routes |
| Self-hosted RSSHub Docker container on joshhome | ops | Public rsshub.app is untrusted; we self-host per P2-R3 |
| LaunchAgent for imsg-listener on joshhome | ops | Always-on daemon that watches chat.db for replies. Plist at ~/Library/LaunchAgents/com.capitalthought.agentsfirst-radar-listener.plist; load with launchctl bootstrap gui/$(id -u) |
state/cos-skip-dates.json first commit |
COS team | List of dates to skip (SXSW prep, board prep, family travel) |
Cold start (first-ever run, or recovery from lost state)
cd ~/Xcode/agentsfirst/tools/agentsfirst-radar
npm install
npm run state:init # writes a fresh v1 state to ../../state/radar-state.json
npm run radar:dry-run # smoke test — no commits, no iMessage; prints briefing preview
If dry-run looks reasonable, kick a real run from the GHA workflow_dispatch UI with dry_run = false and skip_tz_guard = true. Watch the heartbeat slug.
Schema rollback (v2 deployed; need to revert to v1)
The radar refuses to run on a state file whose schema_version > CURRENT_SCHEMA_VERSION of the binary. Per design §5.1, rollback is NOT automatic.
Two scenarios:
A) Rollback the code, keep the v2 state (NOT POSSIBLE)
Don’t do this. The v1 code will fail-closed on the v2 state and you’ll get a cold-start banner forever. See path B.
B) Rollback both code AND state (the supported path)
- On joshhome, find the most recent pre-bump backup:
ls -t state/radar-state.json.v1.bak | head -1 - Restore it over the live state:
cp state/radar-state.json.v1.bak state/radar-state.json - Revert the radar package to a tag/commit known to be on v1:
git checkout <pre-v2-commit> -- tools/agentsfirst-radar/ - Smoke-test:
cd tools/agentsfirst-radar && npm run radar:dry-run - Commit the rollback (don’t
--forcepush — make a real revert commit):git -C ~/Xcode/agentsfirst add tools/agentsfirst-radar/ git -C ~/Xcode/agentsfirst commit -m "chore(radar): rollback to v1 schema (RCA: <reason>)" git -C ~/Xcode/agentsfirst push
If the v2 migration was tagged irreversible: true, the down() doesn’t exist and you have to:
- Either accept the loss of new fields added in v2 (acceptable for most fields)
- Or wait for v1.5 code that knows how to read v2 state with backward compatibility (preferred)
The GHA workflow’s “State backup” step automatically writes state/radar-state.json.v${SCHEMA_VERSION}.bak before every run, so the most recent backup is at most 24h old.
If no backup exists, the recovery path is npm run state:reset -- --keep-seen-urls (NOT IMPLEMENTED YET — TODO for v2 work).
State corruption (zod validation fails)
The radar’s prep gate runs readState() which zod-validates. On failure it writes a degraded briefing (“🚨 Cold start — state was missing or unparseable”) and exits.
Inspect the bad state:
cd ~/Xcode/agentsfirst
jq . state/radar-state.json | head -100 # surface JSON parse errors
node -e "const z = require('./tools/agentsfirst-radar/node_modules/zod'); const { RadarState } = require('./tools/agentsfirst-radar/dist/state.js'); RadarState.parse(JSON.parse(require('fs').readFileSync('state/radar-state.json', 'utf8')))" 2>&1 | head -20
Most common causes:
- Manual edit introduced a typo (e.g., a
status: 'acceptedd'typo invalidates the RecStatus enum) - A migration was started but didn’t complete (schema_version bumped but field shape didn’t)
- A truncated write (shouldn’t happen with atomic write, but disk-full mid-rename is possible)
Recovery: copy from the most recent .v${N}.bak backup, or re-init from scratch with npm run state:init.
Pause / resume (the kill switch)
Activate the kill switch when the radar produces a bad recommendation:
cd ~/Xcode/agentsfirst
jq '.agent_paused = true | .pause_reason = "<RCA description>"' state/radar-state.json > /tmp/s.json
mv /tmp/s.json state/radar-state.json
git add state/radar-state.json # NOTE: state/radar-state.json is gitignored.
# Pause is enforced via state read at prep gate, but state itself is not committed.
# To make the pause persistent across joshhome restarts that lose the working tree,
# the better path is to set `agent_paused: true` in the workflow_dispatch input
# (and revert via the same UI). Or use the snapshot-weekly.json path which IS committed.
Then the next cron run reads the flag, writes a paused briefing, pings agentsfirst-radar-paused HC slug, and exits clean.
To resume: set agent_paused: false and clear pause_reason. Next cron fires normally.
iMessage delivery debugging
Symptoms: briefing committed (docs/checks/<date>-radar.md exists), HC agentsfirst-radar-daily is green, but no iMessage in Josh’s 1:1 with [email protected].
- Check
state/imsg-last-success.json— what’s the most recent successful send? - If stale: did the iMessage account on joshhome log out? (Open Messages.app, check accounts.)
- Is chat.db locked?
lsof | grep chat.dbfrom joshhome. - Did
~/bin/imsgget deleted? It’s the wrapper; fall-back is osascript via Messages.app. - Did the chat get archived? Resolution depends on the chat being style=45 (1:1) and active.
Manual smoke test from joshhome:
~/bin/imsg send [email protected] "🛰️ Radar smoke test from $(hostname)"
If the manual test works but the radar’s send doesn’t, check RADAR_DRY_RUN isn’t set in the GHA env.
imsg-listener daemon
LaunchAgent on joshhome runs npm run listener from tools/agentsfirst-radar/ as a long-lived process. Restart:
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.capitalthought.agentsfirst-radar-listener.plist
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.capitalthought.agentsfirst-radar-listener.plist
Tail the daemon’s log:
tail -F ~/Library/Logs/agentsfirst-radar-listener.log
Manually replay a parsed mutation (if the daemon dropped it):
cd ~/Xcode/agentsfirst
echo '{"iso":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","verb":"accept","rec_id":"22ea794e","raw_reply":"manual replay","imsg_guid":"manual"}' >> state/mutations.jsonl
# Next cron run will fold this into state.
Daemon’s seen-guids cache (prevents replay after restart):
cat ~/.cache/agentsfirst-radar/seen-imsg-guids.json | jq .
# Wipe to force re-processing of recent replies (rarely needed):
rm ~/.cache/agentsfirst-radar/seen-imsg-guids.json
Multipov cap exceeded mid-run
Symptoms: website-lane recs missing from the briefing, “Website lane: degraded — multipov cap exceeded” in the source-health line.
The cap (MULTIPOV_DAILY_CAP_USD, default $10) resets at UTC midnight. To override for one run:
RADAR_MULTIPOV_CAP_USD=25 npm run radar
To override permanently: set RADAR_MULTIPOV_CAP_USD in the GHA workflow env block.
Cron didn’t fire
- Check GHA Actions tab — is the scheduled workflow run listed?
- If yes but with skip output (“Chicago hour is X, not 07”): that’s the DST guard working as designed. Manually fire via workflow_dispatch with
skip_tz_guard = trueif you need an immediate run. - If no run listed at all: GHA’s scheduled triggers can lag during high-load periods. Wait 30 minutes; if still nothing, fire manually.
- If the run started but never completed: check joshhome runner status (
~/actions-runner-org/svc.sh status).
Source-list churn
The radar requires source-list maintenance (per design Month-1 acceptance criterion: ≥1 added/week, ≥1 retired/replaced/month). Without this, sources decay silently into noise.
Edit tools/agentsfirst-radar/sources.json and commit. Next run picks it up. To preview the impact of a sources change before committing, use --dry-run.
Where things live (path cheat sheet)
| What | Where |
|---|---|
| Design doc | docs/plans/2026-05-15-agentsfirst-radar-design.md |
| Agent rules (Contract First) | tools/agentsfirst-radar/AGENTS.md |
| Source TypeScript | tools/agentsfirst-radar/src/ |
| Listener daemon entry | tools/agentsfirst-radar/bin/imsg-listener.ts |
| Main cron entry | tools/agentsfirst-radar/src/radar.ts |
| GHA workflow | .github/workflows/agentsfirst-radar.yml |
| MCP server (radar_overview tool) | tools/agentsfirst-mcp/src/server.ts |
| State file (gitignored) | state/radar-state.json |
| Mutations log (gitignored) | state/mutations.jsonl |
| Unparsed-reply log (gitignored) | state/imsg-unparsed.jsonl |
| LLM anomaly log (gitignored) | state/llm-anomalies.jsonl |
| Dead-man’s-switch input (committed) | state/last-run.txt |
| Cold-start anchor (committed) | state/snapshot-weekly.json |
| COS skip-dates (committed) | state/cos-skip-dates.json |
| Previous state backups | state/radar-state.json.v${N}.bak (gitignored) |
When to retire the radar
Per design §9 quarter-1 acceptance: if month-3 acceptance rate is <50%, retire the radar and revert to manual /agentsfirst-check. Don’t keep a noisy agent alive.
Retirement procedure:
- Disable the GHA workflow (Settings → Actions → Workflows → Disable).
- Bootout the listener LaunchAgent.
- Set
agent_paused: truewithpause_reason: 'retired-month-3-noise'(so a manualnpm run radaris also a no-op). - Archive
state/to~/icloud/agentsfirst-radar-retired/<date>/(preserve the audit trail). - Remove the package:
rm -rf tools/agentsfirst-radar. Keep the design doc + this RUNBOOK as the post-mortem record.