Nightly Maintenance Architecture

Status: 8 active workers, 4 planned Last updated: 2026-03-10

Overview

The nightly maintenance pipeline is a set of focused, single-purpose workers that run sequentially after hours (UTC). Each worker answers one question about the codebase, makes one type of change, and produces output reviewable in under a minute.

Goal: Automate the work that would otherwise require dedicated engineers — code review, documentation, dependency management, tech debt reduction, and task execution — without adding headcount.

Pipeline Timeline

UTC   Worker                       Status      Slack Channel      Depends On
────  ─────────────────────────  ──────────  ──────────────     ──────────
00:00 ① Daily Overview           ✅ ACTIVE    #daily-overview    —
02:00 ② Dead Code Cleanup        ✅ ACTIVE    #ai-janitor        —
03:00 ③ Dependency Health (daily) ✅ ACTIVE    #ai-janitor        —
04:00 ④ Kanban Hygiene           ✅ ACTIVE    #ai-janitor        ①
04:30 ⑤ Task Worker              ✅ ACTIVE    #ai-janitor        agent-queue.json
05:00 ⑥ Architecture Review      ✅ ACTIVE    #ai-janitor        —
05:30 ⑦ Planning Worker          ✅ ACTIVE    #ai-janitor        ⑤ (runs after)
Sun 05:00
      ⑧ Dep Health (weekly)      ✅ ACTIVE    #ai-janitor        ③
──── PLANNED ────────────────────────────────────────────────────────────
      ⑨ New Code Reviewer        🔲 PLANNED   #ai-janitor        ①
      ⑩ Boy Scout Scanner        🔲 PLANNED   #ai-janitor        —
      ⑪ Performance Baseline     🔲 PLANNED   #ai-janitor        —
      ⑫ Documentation Generator  🔲 PLANNED   TBD                —

Workers run staggered to avoid CI resource contention. The task worker and planning worker form a producer-consumer pair connected by agent-queue.json.

Orchestrated Manual Trigger

All workers can be triggered individually via workflow_dispatch. To run the full pipeline manually (e.g. after a big merge), use the orchestrator workflow:

File: .github/workflows/nightly-run-all.yml

Manual trigger → runs all active workers in sequence with proper wait times

See Manual Orchestration section below.

Universal Worker Pattern

Every nightly worker follows the same 3-step architecture:

┌─────────────────────────────────────────────────────┐
│                   GitHub Actions                     │
│                                                      │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────┐ │
│  │  Step 1       │   │  Step 2       │   │  Step 3   │ │
│  │  PRE-SCAN     │──▶│  LLM REVIEW   │──▶│  APPLY    │ │
│  │              │   │              │   │          │ │
│  │ Deterministic │   │ Claude (if    │   │ Create PR │ │
│  │ TypeScript    │   │ candidates    │   │ Post Slack│ │
│  │ Zero LLM cost│   │ exist)        │   │ Update    │ │
│  │              │   │              │   │ memory    │ │
│  └──────────────┘   └──────────────┘   └──────────┘ │
│        │                   │                  │      │
│        ▼                   ▼                  ▼      │
│   candidates.json    decisions.json    PR + Slack    │
└─────────────────────────────────────────────────────┘

Why this pattern?

Zero cost when clean — Pre-scan gates the LLM. No candidates = no Claude API call.
Testable — Deterministic steps can be unit tested without LLM mocking.
Reviewable — Each step produces a JSON artifact that's inspectable.
Bounded cost — LLM step has explicit --max-turns cap.

Worker Details

① Daily Overview (00:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	What did we ship today?
Pre-scan	`get-daily-overview-window.ts` computes 24h window, `gh` fetches merged PRs + commits
LLM	Classifies changes by impact, generates HEAD (Slack headline) + COMPACT (detail)
Output	`.kanbn/daily-overview/YYYY-MM-DD-daily-overview.md` → PR → Slack `#daily-overview`
Workflow	`.github/workflows/daily-overview.yml`

② Dead Code Cleanup (02:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	Is there unused code in workspace X?
Pre-scan	Knip static analysis on rotation target (35 workspaces)
LLM	Reviews Knip findings, removes dead code with skip rules
Output	Auto-merge PR + Slack `#ai-janitor`
Memory	`.kanbn/memory/dead-code-rotation.json`
Workflow	`.github/workflows/nightly-dead-code-cleanup.yml`

Rotation: 35 workspaces, one per night. Full rotation = ~5 weeks.

③ Dependency Health — Daily (03:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	Are there safe bumps to apply? Any NEW vulnerabilities?
Pre-scan	`pnpm audit` + `pnpm outdated` + diff against known vulns registry
LLM	Haiku (~10 turns): apply bumps, handle failures, compose Slack summary, flag new critical
Output	Auto-merge PR for safe bumps, human-friendly Slack summary
Memory	`.kanbn/memory/dependency-health.json` (known vulns + daily log)
Workflow	`.github/workflows/nightly-dependency-health.yml`
Prompt	`.github/prompts/nightly-dependency-health.md`

Daily design: The collect step diffs today's vulns against knownVulnerabilities in memory. Only NEW vulns get flagged. Known transitive vulns (e.g., 35 minimatch/tar/rollup issues waiting on upstream) are not re-investigated. The daily LLM uses Haiku (cheap, fast) — its job is mechanical: apply bumps, revert failures with detailed error logging, compose a Slack summary. No full report written to the repo.

Auto-merge: Safe-only PRs (lockfile + memory) get the auto-merge label. The dependency-health-safe pattern in check-auto-merge-eligibility.ts enables this.

⑧ Dependency Health — Weekly Digest (Sunday 05:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	What happened with dependencies this week? What should humans act on?
Pre-scan	Fresh `pnpm audit` + `pnpm outdated` + week's daily log from memory
LLM	Sonnet (~50 turns): consolidate week, write report, produce actionable items
Output	`docs/reports/dependency-health/YYYY-WXX.md` + PR + Slack digest
Workflow	`.github/workflows/weekly-dependency-health.yml`
Prompt	`.github/prompts/weekly-dependency-health.md`

Weekly design: ONE report per week (not per day). Consolidates all daily bump results, failures, new/resolved vulns into a single human-readable digest. The "You should review" section is deduplicated — if "upgrade posthog-js" appeared 5 times in daily logs, it appears once. Cleans up daily log entries older than 14 days and deletes old daily report files.

Daily (Mon-Sun 3AM)              Weekly (Sunday 5AM)
├── Collect + diff known vulns   ├── Fresh collect
├── Haiku: apply bumps           ├── Sonnet: consolidate week
├── Log to dailyLog[]            ├── Write YYYY-WXX.md report
├── Auto-merge safe bumps        ├── "You should review" digest
├── Slack one-liner              ├── Clean up old daily data
└── NO report file               └── Detailed Slack digest

④ Kanban Hygiene (04:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	Is the kanban board accurate?
Pre-scan	`nightly-pv-collect.ts` — cross-reference daily overview with task statuses
LLM	Review candidates — approve/reject status changes and duplicates
Output	Single-purpose PR with clear commit messages per change type
Workflow	`.github/workflows/nightly-kanban-hygiene.yml`

Focused on 4 checks:

Status sync (task status matches merged PR state)
Staleness detection (>60 days without activity in Backlog)
Duplicate flagging (shared impactedApps + overlapping title keywords)
Archive Done tasks (move to .kanbn/archived-tasks/)

⑤ Task Worker (04:30 UTC) — ✅ ACTIVE

Aspect	Detail
Question	What approved task should the agent implement tonight?
Pre-scan	`nightly-task-worker-collect.ts` — reads agent queue, picks first approved task (FIFO)
LLM	Claude Opus implements the task (~80 turns, 60 min timeout)
Output	Implementation PR + Slack `#ai-janitor` with human review instructions
Memory	`.kanbn/memory/task-worker-history.json` (7-day cooldown)
Memory	`.kanbn/memory/agent-queue.json` (removes completed task)
Workflow	`.github/workflows/nightly-task-worker.yml`
Prompt	`.github/prompts/nightly-task-worker.md`

Selection: Strict queue-only. Reads agent-queue.json, picks first "approved" entry. No fallback. If no approved tasks, reports to Slack with clear instructions for the human.

History persistence: After execution, creates an auto-merge PR to push task-worker-history.json + agent-queue.json updates to main. This ensures the 7-day cooldown works across nightly runs.

Slack report always includes:

What was done (task ID, changed files)
What was validated (lint, typecheck)
What the human should do next (review PR, check acceptance criteria)

agent-queue.json ──▶ collect (FIFO) ──▶ Claude Opus implements
                                              │
                              ┌────────────────┼───────────────┐
                              ▼                ▼               ▼
                         Code changes     No changes       Failed
                              │                │               │
                              ▼                ▼               ▼
                         Create PR        Record in       Record in
                         + labels         history         history
                              │                │               │
                              └────────┬───────┘               │
                                       ▼                       │
                                Auto-merge PR               Slack report
                                for memory update           with guidance
                                       │
                                       ▼
                                Slack report with
                                human review steps

⑥ Architecture Review (05:00 UTC) — ✅ ACTIVE

Aspect	Detail
Question	Do our architecture docs match the actual code?
Pre-scan	`nightly-architecture-review-collect.ts` — rotation through 22 targets
LLM	Reads code and docs, verifies alignment, updates with Mermaid diagrams
Output	PR with doc updates + kanban tasks for improvements
Memory	`.kanbn/memory/architecture-review-rotation.json`
Workflow	`.github/workflows/nightly-architecture-review.yml`

Rotation: 22 targets (11 apps + 11 packages), one per night. Full rotation ~3 weeks.

⑦ Planning Worker (05:30 UTC) — ✅ ACTIVE

Aspect	Detail
Question	What should the task worker work on next?
Pre-scan	`nightly-planning-worker-collect.ts` — gathers agent-eligible tasks, verifies file references against codebase, checks open PRs, loads current queue
LLM	Claude Sonnet curates queue: proposes verified tasks, flags stale tasks, blocks tasks needing human judgment
Output	PR with updated `agent-queue.json` for human review
Workflow	`.github/workflows/nightly-planning-worker.yml`
Prompt	`.github/prompts/nightly-planning-worker.md`

Producer-consumer pattern with verification:

Planning Worker (05:30)              Human Review                Task Worker (04:30 next day)
├── Verify task refs vs codebase     ├── /approve-tasks skill    ├── Read agent-queue.json
├── Evaluate agent-eligible tasks    │   or review PR directly   ├── 3 parallel slots (matrix)
├── Flag stale/outdated tasks        ├── Approve/reject/reorder  ├── Each picks next approved (FIFO)
├── Propose execution order          ├── Handle flagged tasks    ├── Implement with Claude
├── Block tasks needing human        │   (close/demote/update)   ├── Create implementation PRs
└── Create PR with queue update      └── Merge PR to main        └── Auto-merge memory updates

Queue lifecycle:

Planning worker verifies task references against codebase (files exist? recently changed?)
Verified tasks proposed (status: "proposed"), stale tasks moved to "flagged" list
Human uses /approve-tasks skill to review: approve, reject, reorder, or handle flagged tasks
Flagged tasks: human decides — close, demote to user request registry (.kanbn/user-requests/*.md), or update description
Human merges PR to main
Next night, task worker runs 3 parallel slots — each picks the next approved task (slot 0 = 1st, slot 1 = 2nd, slot 2 = 3rd)

Checks existing PRs: Tasks with open nightly-task/ PRs are reported as "ongoing" and not re-queued, preventing the worker from proposing tasks that are already being worked on.

Task Worker + Planning Worker: Agent Queue

The agent queue (agent-queue.json) is the coordination point between the planning worker and the task worker.

{
  "queue": [
    { "taskId": "perf-investigate-hello-start", "reason": "...", "status": "approved" },
    { "taskId": "bvf-font-loading-optimization", "reason": "...", "status": "proposed" }
  ],
  "blocked": [
    { "taskId": "convex-monorepo-setup", "reason": "Needs human decisions on schema" }
  ],
  "flagged": [
    { "taskId": "stale-task", "reason": "3/4 files deleted", "recommendation": "close",
      "verificationFindings": ["file X no longer exists"], "flaggedAt": "2026-03-12" }
  ],
  "completed": [
    { "taskId": "some-done-task", "completedAt": "2026-03-09", "outcome": "pr_created:1234" }
  ]
}

Why not priority-based selection? The old approach sorted by P0→P3 + effort, which caused a P0 task needing human judgment (like convex-monorepo-setup) to block all other work indefinitely. The queue gives humans explicit control over execution order while letting the LLM propose sensible defaults.

Manual Orchestration

Run individual workers

Each worker has workflow_dispatch enabled:

# Run specific worker
gh workflow run nightly-task-worker.yml
gh workflow run nightly-planning-worker.yml
gh workflow run nightly-kanban-hygiene.yml
gh workflow run nightly-dead-code-cleanup.yml
gh workflow run nightly-dependency-health.yml
gh workflow run nightly-architecture-review.yml
gh workflow run daily-overview.yml

# Task worker with specific task override
gh workflow run nightly-task-worker.yml -f task_id=perf-investigate-hello-start

Run entire pipeline

Use the orchestrator workflow to run all workers in sequence:

# Run the full nightly pipeline now
gh workflow run nightly-run-all.yml

# Run with specific workers only
gh workflow run nightly-run-all.yml -f workers="task-worker,planning-worker"

Scaling Considerations: GitHub Actions vs. Alternatives

Current approach: GitHub Actions

Pros:

Zero infrastructure to maintain
Built-in secrets management
Native gh CLI access
Cron scheduling
Workflow dispatch for manual triggers
Concurrency controls
Each worker is independently deployable

Cons as it scales:

7 workflows = 7 independent cron schedules with no coordination
No shared state between runs (each checkout is fresh)
Hard to express "run B after A completes" across workflows
Billing: 4vCPU runners for 60-min task worker runs add up
Log visibility: must click into each workflow run separately

When to consider alternatives

Signal	What it means
>10 scheduled workflows	Cron management becomes unwieldy
Workers need each other's output	GitHub Actions can't pass artifacts between workflows easily
Cost exceeds ~$200/month	Self-hosted runners or alternative CI would be cheaper
Need real-time queue processing	Cron is too coarse — need event-driven triggers
Need human-in-the-loop during execution	GitHub Actions has no interactive approval mid-run

Possible migration paths

Short term (now → 15 workflows): Stay on GitHub Actions. Add orchestrator workflow for manual full-pipeline runs. Use workflow_run triggers for chaining.
Medium term (15+ workflows): Extract to a lightweight orchestrator (e.g. Temporal, Inngest, or a simple Express app on Cloud Run) that:
- Manages the schedule centrally
- Passes artifacts between steps natively
- Provides a dashboard for all worker runs
- Still triggers Claude via claude-code-action or direct API
Long term: If workers become event-driven (e.g. "run task worker when a queue entry is approved"), consider a message queue (Cloud Tasks, PubSub) + Cloud Run workers.

Recommendation: Stay on GitHub Actions for now. The 7-workflow count is manageable. Add the orchestrator workflow for manual runs. Revisit when you hit 12+ scheduled workflows or need inter-workflow artifact passing.

Memory & State

Nightly workers persist state in .kanbn/memory/:

File	Purpose	Updated by
`dead-code-rotation.json`	Rotation index, scan history	② Dead Code Cleanup
`dependency-health.json`	Known vulns registry, daily log, trend history	③ Daily + ⑧ Weekly
`board-health.json`	Board metrics, recurring flags	④ Kanban Hygiene
`task-worker-history.json`	Execution history, 7-day cooldown	⑤ Task Worker
`agent-queue.json`	Curated task execution queue + flagged list	⑤ Task Worker + ⑦ Planning Worker
`architecture-review-rotation.json`	Rotation index, review history	⑥ Architecture Review

User request demand signals live in .kanbn/user-requests/*.md (markdown, not JSON in memory/). See CLAUDE-extended § "User Request Registry".

Protected main: Memory updates are pushed to main via auto-merge PRs (not direct commits). The kanban-memory pattern in check-auto-merge-eligibility.ts enables this.

File Index

.github/
├── workflows/
│   ├── daily-overview.yml                 ① trigger
│   ├── daily-overview-post.yml            ① Slack relay
│   ├── nightly-dead-code-cleanup.yml      ② trigger
│   ├── nightly-dependency-health.yml      ③ daily trigger
│   ├── weekly-dependency-health.yml       ⑧ weekly trigger
│   ├── nightly-kanban-hygiene.yml         ④ trigger
│   ├── nightly-task-worker.yml            ⑤ trigger
│   ├── nightly-architecture-review.yml    ⑥ trigger
│   ├── nightly-planning-worker.yml        ⑦ trigger
│   ├── nightly-run-all.yml               orchestrator (manual)
│   └── auto-merge-eligible.yml            auto-merge gating
├── prompts/
│   ├── daily-overview.md                  ① prompt
│   ├── nightly-dead-code-cleanup.md       ② prompt
│   ├── nightly-dependency-health.md       ③ daily prompt
│   ├── weekly-dependency-health.md        ⑧ weekly prompt
│   ├── nightly-kanban-hygiene.md          ④ prompt
│   ├── nightly-task-worker.md             ⑤ prompt
│   ├── nightly-architecture-review.md     ⑥ prompt
│   └── nightly-planning-worker.md         ⑦ prompt

packages/ci-scripts/src/
├── get-daily-overview-window.ts                 ① helper
├── find-daily-overview-file.ts                  ① helper
├── post-daily-overview-to-slack.ts              ① Slack
├── post-dead-code-cleanup-to-slack.ts           ② Slack
├── nightly-dependency-health-collect.ts         ③⑧ Step 1
├── post-dependency-health-to-slack.ts           ③⑧ Slack
├── nightly-pv-collect.ts                        ④ Step 1
├── nightly-pv-apply.ts                          ④ Step 3
├── post-product-verification-to-slack.ts        ④ Slack
├── nightly-task-worker-collect.ts               ⑤ Step 1
├── nightly-task-worker-apply.ts                 ⑤ Step 3
├── post-task-worker-to-slack.ts                 ⑤ Slack
├── nightly-architecture-review-collect.ts       ⑥ Step 1
├── post-architecture-review-to-slack.ts         ⑥ Slack
├── nightly-planning-worker-collect.ts           ⑦ Step 1
├── nightly-planning-worker-apply.ts             ⑦ Step 3
├── post-planning-worker-to-slack.ts             ⑦ Slack
└── check-auto-merge-eligibility.ts              auto-merge patterns

.kanbn/memory/
├── dead-code-rotation.json              ② state
├── dependency-health.json               ③⑧ state (known vulns + daily log + history)
├── board-health.json                    ④ state
├── task-worker-history.json             ⑤ state
├── agent-queue.json                     ⑤⑦ shared state (queue + flagged)
└── architecture-review-rotation.json    ⑥ state

.kanbn/user-requests/
└── *.md                                 demand signals (markdown, /approve-tasks skill)