All docs/Operations

docs/architecture/nightly-maintenance-architecture.md

Nightly Maintenance Architecture

Status: 8 active workers, 4 planned Last updated: 2026-03-10

Overview

The nightly maintenance pipeline is a set of focused, single-purpose workers that run sequentially after hours (UTC). Each worker answers one question about the codebase, makes one type of change, and produces output reviewable in under a minute.

Goal: Automate the work that would otherwise require dedicated engineers — code review, documentation, dependency management, tech debt reduction, and task execution — without adding headcount.

Pipeline Timeline

UTC   Worker                       Status      Slack Channel      Depends On
────  ─────────────────────────  ──────────  ──────────────     ──────────
00:00 ① Daily Overview           ✅ ACTIVE    #daily-overview    —
02:00 ② Dead Code Cleanup        ✅ ACTIVE    #ai-janitor        —
03:00 ③ Dependency Health (daily) ✅ ACTIVE    #ai-janitor        —
04:00 ④ Kanban Hygiene           ✅ ACTIVE    #ai-janitor        ①
04:30 ⑤ Task Worker              ✅ ACTIVE    #ai-janitor        agent-queue.json
05:00 ⑥ Architecture Review      ✅ ACTIVE    #ai-janitor        —
05:30 ⑦ Planning Worker          ✅ ACTIVE    #ai-janitor        ⑤ (runs after)
Sun 05:00
      ⑧ Dep Health (weekly)      ✅ ACTIVE    #ai-janitor        ③
──── PLANNED ────────────────────────────────────────────────────────────
      ⑨ New Code Reviewer        🔲 PLANNED   #ai-janitor        ①
      ⑩ Boy Scout Scanner        🔲 PLANNED   #ai-janitor        —
      ⑪ Performance Baseline     🔲 PLANNED   #ai-janitor        —
      ⑫ Documentation Generator  🔲 PLANNED   TBD                —

Workers run staggered to avoid CI resource contention. The task worker and planning worker form a producer-consumer pair connected by agent-queue.json.

Orchestrated Manual Trigger

All workers can be triggered individually via workflow_dispatch. To run the full pipeline manually (e.g. after a big merge), use the orchestrator workflow:

File: .github/workflows/nightly-run-all.yml

Manual trigger → runs all active workers in sequence with proper wait times

See Manual Orchestration section below.

Universal Worker Pattern

Every nightly worker follows the same 3-step architecture:

┌─────────────────────────────────────────────────────┐
│                   GitHub Actions                     │
│                                                      │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────┐ │
│  │  Step 1       │   │  Step 2       │   │  Step 3   │ │
│  │  PRE-SCAN     │──▶│  LLM REVIEW   │──▶│  APPLY    │ │
│  │              │   │              │   │          │ │
│  │ Deterministic │   │ Claude (if    │   │ Create PR │ │
│  │ TypeScript    │   │ candidates    │   │ Post Slack│ │
│  │ Zero LLM cost│   │ exist)        │   │ Update    │ │
│  │              │   │              │   │ memory    │ │
│  └──────────────┘   └──────────────┘   └──────────┘ │
│        │                   │                  │      │
│        ▼                   ▼                  ▼      │
│   candidates.json    decisions.json    PR + Slack    │
└─────────────────────────────────────────────────────┘

Why this pattern?

  1. Zero cost when clean — Pre-scan gates the LLM. No candidates = no Claude API call.
  2. Testable — Deterministic steps can be unit tested without LLM mocking.
  3. Reviewable — Each step produces a JSON artifact that's inspectable.
  4. Bounded cost — LLM step has explicit --max-turns cap.

Worker Details

① Daily Overview (00:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionWhat did we ship today?
Pre-scanget-daily-overview-window.ts computes 24h window, gh fetches merged PRs + commits
LLMClassifies changes by impact, generates HEAD (Slack headline) + COMPACT (detail)
Output.kanbn/daily-overview/YYYY-MM-DD-daily-overview.md → PR → Slack #daily-overview
Workflow.github/workflows/daily-overview.yml

② Dead Code Cleanup (02:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionIs there unused code in workspace X?
Pre-scanKnip static analysis on rotation target (35 workspaces)
LLMReviews Knip findings, removes dead code with skip rules
OutputAuto-merge PR + Slack #ai-janitor
Memory.kanbn/memory/dead-code-rotation.json
Workflow.github/workflows/nightly-dead-code-cleanup.yml

Rotation: 35 workspaces, one per night. Full rotation = ~5 weeks.

③ Dependency Health — Daily (03:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionAre there safe bumps to apply? Any NEW vulnerabilities?
Pre-scanpnpm audit + pnpm outdated + diff against known vulns registry
LLMHaiku (~10 turns): apply bumps, handle failures, compose Slack summary, flag new critical
OutputAuto-merge PR for safe bumps, human-friendly Slack summary
Memory.kanbn/memory/dependency-health.json (known vulns + daily log)
Workflow.github/workflows/nightly-dependency-health.yml
Prompt.github/prompts/nightly-dependency-health.md

Daily design: The collect step diffs today's vulns against knownVulnerabilities in memory. Only NEW vulns get flagged. Known transitive vulns (e.g., 35 minimatch/tar/rollup issues waiting on upstream) are not re-investigated. The daily LLM uses Haiku (cheap, fast) — its job is mechanical: apply bumps, revert failures with detailed error logging, compose a Slack summary. No full report written to the repo.

Auto-merge: Safe-only PRs (lockfile + memory) get the auto-merge label. The dependency-health-safe pattern in check-auto-merge-eligibility.ts enables this.

⑧ Dependency Health — Weekly Digest (Sunday 05:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionWhat happened with dependencies this week? What should humans act on?
Pre-scanFresh pnpm audit + pnpm outdated + week's daily log from memory
LLMSonnet (~50 turns): consolidate week, write report, produce actionable items
Outputdocs/reports/dependency-health/YYYY-WXX.md + PR + Slack digest
Workflow.github/workflows/weekly-dependency-health.yml
Prompt.github/prompts/weekly-dependency-health.md

Weekly design: ONE report per week (not per day). Consolidates all daily bump results, failures, new/resolved vulns into a single human-readable digest. The "You should review" section is deduplicated — if "upgrade posthog-js" appeared 5 times in daily logs, it appears once. Cleans up daily log entries older than 14 days and deletes old daily report files.

Daily (Mon-Sun 3AM)              Weekly (Sunday 5AM)
├── Collect + diff known vulns   ├── Fresh collect
├── Haiku: apply bumps           ├── Sonnet: consolidate week
├── Log to dailyLog[]            ├── Write YYYY-WXX.md report
├── Auto-merge safe bumps        ├── "You should review" digest
├── Slack one-liner              ├── Clean up old daily data
└── NO report file               └── Detailed Slack digest

④ Kanban Hygiene (04:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionIs the kanban board accurate?
Pre-scannightly-pv-collect.ts — cross-reference daily overview with task statuses
LLMReview candidates — approve/reject status changes and duplicates
OutputSingle-purpose PR with clear commit messages per change type
Workflow.github/workflows/nightly-kanban-hygiene.yml

Focused on 4 checks:

  1. Status sync (task status matches merged PR state)
  2. Staleness detection (>60 days without activity in Backlog)
  3. Duplicate flagging (shared impactedApps + overlapping title keywords)
  4. Archive Done tasks (move to .kanbn/archived-tasks/)

⑤ Task Worker (04:30 UTC) — ✅ ACTIVE

AspectDetail
QuestionWhat approved task should the agent implement tonight?
Pre-scannightly-task-worker-collect.ts — reads agent queue, picks first approved task (FIFO)
LLMClaude Opus implements the task (~80 turns, 60 min timeout)
OutputImplementation PR + Slack #ai-janitor with human review instructions
Memory.kanbn/memory/task-worker-history.json (7-day cooldown)
Memory.kanbn/memory/agent-queue.json (removes completed task)
Workflow.github/workflows/nightly-task-worker.yml
Prompt.github/prompts/nightly-task-worker.md

Selection: Strict queue-only. Reads agent-queue.json, picks first "approved" entry. No fallback. If no approved tasks, reports to Slack with clear instructions for the human.

History persistence: After execution, creates an auto-merge PR to push task-worker-history.json + agent-queue.json updates to main. This ensures the 7-day cooldown works across nightly runs.

Slack report always includes:

  • What was done (task ID, changed files)
  • What was validated (lint, typecheck)
  • What the human should do next (review PR, check acceptance criteria)
agent-queue.json ──▶ collect (FIFO) ──▶ Claude Opus implements
                                              │
                              ┌────────────────┼───────────────┐
                              ▼                ▼               ▼
                         Code changes     No changes       Failed
                              │                │               │
                              ▼                ▼               ▼
                         Create PR        Record in       Record in
                         + labels         history         history
                              │                │               │
                              └────────┬───────┘               │
                                       ▼                       │
                                Auto-merge PR               Slack report
                                for memory update           with guidance
                                       │
                                       ▼
                                Slack report with
                                human review steps

⑥ Architecture Review (05:00 UTC) — ✅ ACTIVE

AspectDetail
QuestionDo our architecture docs match the actual code?
Pre-scannightly-architecture-review-collect.ts — rotation through 22 targets
LLMReads code and docs, verifies alignment, updates with Mermaid diagrams
OutputPR with doc updates + kanban tasks for improvements
Memory.kanbn/memory/architecture-review-rotation.json
Workflow.github/workflows/nightly-architecture-review.yml

Rotation: 22 targets (11 apps + 11 packages), one per night. Full rotation ~3 weeks.

⑦ Planning Worker (05:30 UTC) — ✅ ACTIVE

AspectDetail
QuestionWhat should the task worker work on next?
Pre-scannightly-planning-worker-collect.ts — gathers agent-eligible tasks, verifies file references against codebase, checks open PRs, loads current queue
LLMClaude Sonnet curates queue: proposes verified tasks, flags stale tasks, blocks tasks needing human judgment
OutputPR with updated agent-queue.json for human review
Workflow.github/workflows/nightly-planning-worker.yml
Prompt.github/prompts/nightly-planning-worker.md

Producer-consumer pattern with verification:

Planning Worker (05:30)              Human Review                Task Worker (04:30 next day)
├── Verify task refs vs codebase     ├── /approve-tasks skill    ├── Read agent-queue.json
├── Evaluate agent-eligible tasks    │   or review PR directly   ├── 3 parallel slots (matrix)
├── Flag stale/outdated tasks        ├── Approve/reject/reorder  ├── Each picks next approved (FIFO)
├── Propose execution order          ├── Handle flagged tasks    ├── Implement with Claude
├── Block tasks needing human        │   (close/demote/update)   ├── Create implementation PRs
└── Create PR with queue update      └── Merge PR to main        └── Auto-merge memory updates

Queue lifecycle:

  1. Planning worker verifies task references against codebase (files exist? recently changed?)
  2. Verified tasks proposed (status: "proposed"), stale tasks moved to "flagged" list
  3. Human uses /approve-tasks skill to review: approve, reject, reorder, or handle flagged tasks
  4. Flagged tasks: human decides — close, demote to user request registry (.kanbn/user-requests/*.md), or update description
  5. Human merges PR to main
  6. Next night, task worker runs 3 parallel slots — each picks the next approved task (slot 0 = 1st, slot 1 = 2nd, slot 2 = 3rd)

Checks existing PRs: Tasks with open nightly-task/ PRs are reported as "ongoing" and not re-queued, preventing the worker from proposing tasks that are already being worked on.

Task Worker + Planning Worker: Agent Queue

The agent queue (agent-queue.json) is the coordination point between the planning worker and the task worker.

{
  "queue": [
    { "taskId": "perf-investigate-hello-start", "reason": "...", "status": "approved" },
    { "taskId": "bvf-font-loading-optimization", "reason": "...", "status": "proposed" }
  ],
  "blocked": [
    { "taskId": "convex-monorepo-setup", "reason": "Needs human decisions on schema" }
  ],
  "flagged": [
    { "taskId": "stale-task", "reason": "3/4 files deleted", "recommendation": "close",
      "verificationFindings": ["file X no longer exists"], "flaggedAt": "2026-03-12" }
  ],
  "completed": [
    { "taskId": "some-done-task", "completedAt": "2026-03-09", "outcome": "pr_created:1234" }
  ]
}

Why not priority-based selection? The old approach sorted by P0→P3 + effort, which caused a P0 task needing human judgment (like convex-monorepo-setup) to block all other work indefinitely. The queue gives humans explicit control over execution order while letting the LLM propose sensible defaults.

Manual Orchestration

Run individual workers

Each worker has workflow_dispatch enabled:

# Run specific worker
gh workflow run nightly-task-worker.yml
gh workflow run nightly-planning-worker.yml
gh workflow run nightly-kanban-hygiene.yml
gh workflow run nightly-dead-code-cleanup.yml
gh workflow run nightly-dependency-health.yml
gh workflow run nightly-architecture-review.yml
gh workflow run daily-overview.yml

# Task worker with specific task override
gh workflow run nightly-task-worker.yml -f task_id=perf-investigate-hello-start

Run entire pipeline

Use the orchestrator workflow to run all workers in sequence:

# Run the full nightly pipeline now
gh workflow run nightly-run-all.yml

# Run with specific workers only
gh workflow run nightly-run-all.yml -f workers="task-worker,planning-worker"

Scaling Considerations: GitHub Actions vs. Alternatives

Current approach: GitHub Actions

Pros:

  • Zero infrastructure to maintain
  • Built-in secrets management
  • Native gh CLI access
  • Cron scheduling
  • Workflow dispatch for manual triggers
  • Concurrency controls
  • Each worker is independently deployable

Cons as it scales:

  • 7 workflows = 7 independent cron schedules with no coordination
  • No shared state between runs (each checkout is fresh)
  • Hard to express "run B after A completes" across workflows
  • Billing: 4vCPU runners for 60-min task worker runs add up
  • Log visibility: must click into each workflow run separately

When to consider alternatives

SignalWhat it means
>10 scheduled workflowsCron management becomes unwieldy
Workers need each other's outputGitHub Actions can't pass artifacts between workflows easily
Cost exceeds ~$200/monthSelf-hosted runners or alternative CI would be cheaper
Need real-time queue processingCron is too coarse — need event-driven triggers
Need human-in-the-loop during executionGitHub Actions has no interactive approval mid-run

Possible migration paths

  1. Short term (now → 15 workflows): Stay on GitHub Actions. Add orchestrator workflow for manual full-pipeline runs. Use workflow_run triggers for chaining.

  2. Medium term (15+ workflows): Extract to a lightweight orchestrator (e.g. Temporal, Inngest, or a simple Express app on Cloud Run) that:

    • Manages the schedule centrally
    • Passes artifacts between steps natively
    • Provides a dashboard for all worker runs
    • Still triggers Claude via claude-code-action or direct API
  3. Long term: If workers become event-driven (e.g. "run task worker when a queue entry is approved"), consider a message queue (Cloud Tasks, PubSub) + Cloud Run workers.

Recommendation: Stay on GitHub Actions for now. The 7-workflow count is manageable. Add the orchestrator workflow for manual runs. Revisit when you hit 12+ scheduled workflows or need inter-workflow artifact passing.

Memory & State

Nightly workers persist state in .kanbn/memory/:

FilePurposeUpdated by
dead-code-rotation.jsonRotation index, scan history② Dead Code Cleanup
dependency-health.jsonKnown vulns registry, daily log, trend history③ Daily + ⑧ Weekly
board-health.jsonBoard metrics, recurring flags④ Kanban Hygiene
task-worker-history.jsonExecution history, 7-day cooldown⑤ Task Worker
agent-queue.jsonCurated task execution queue + flagged list⑤ Task Worker + ⑦ Planning Worker
architecture-review-rotation.jsonRotation index, review history⑥ Architecture Review

User request demand signals live in .kanbn/user-requests/*.md (markdown, not JSON in memory/). See CLAUDE-extended § "User Request Registry".

Protected main: Memory updates are pushed to main via auto-merge PRs (not direct commits). The kanban-memory pattern in check-auto-merge-eligibility.ts enables this.

File Index

.github/
├── workflows/
│   ├── daily-overview.yml                 ① trigger
│   ├── daily-overview-post.yml            ① Slack relay
│   ├── nightly-dead-code-cleanup.yml      ② trigger
│   ├── nightly-dependency-health.yml      ③ daily trigger
│   ├── weekly-dependency-health.yml       ⑧ weekly trigger
│   ├── nightly-kanban-hygiene.yml         ④ trigger
│   ├── nightly-task-worker.yml            ⑤ trigger
│   ├── nightly-architecture-review.yml    ⑥ trigger
│   ├── nightly-planning-worker.yml        ⑦ trigger
│   ├── nightly-run-all.yml               orchestrator (manual)
│   └── auto-merge-eligible.yml            auto-merge gating
├── prompts/
│   ├── daily-overview.md                  ① prompt
│   ├── nightly-dead-code-cleanup.md       ② prompt
│   ├── nightly-dependency-health.md       ③ daily prompt
│   ├── weekly-dependency-health.md        ⑧ weekly prompt
│   ├── nightly-kanban-hygiene.md          ④ prompt
│   ├── nightly-task-worker.md             ⑤ prompt
│   ├── nightly-architecture-review.md     ⑥ prompt
│   └── nightly-planning-worker.md         ⑦ prompt

packages/ci-scripts/src/
├── get-daily-overview-window.ts                 ① helper
├── find-daily-overview-file.ts                  ① helper
├── post-daily-overview-to-slack.ts              ① Slack
├── post-dead-code-cleanup-to-slack.ts           ② Slack
├── nightly-dependency-health-collect.ts         ③⑧ Step 1
├── post-dependency-health-to-slack.ts           ③⑧ Slack
├── nightly-pv-collect.ts                        ④ Step 1
├── nightly-pv-apply.ts                          ④ Step 3
├── post-product-verification-to-slack.ts        ④ Slack
├── nightly-task-worker-collect.ts               ⑤ Step 1
├── nightly-task-worker-apply.ts                 ⑤ Step 3
├── post-task-worker-to-slack.ts                 ⑤ Slack
├── nightly-architecture-review-collect.ts       ⑥ Step 1
├── post-architecture-review-to-slack.ts         ⑥ Slack
├── nightly-planning-worker-collect.ts           ⑦ Step 1
├── nightly-planning-worker-apply.ts             ⑦ Step 3
├── post-planning-worker-to-slack.ts             ⑦ Slack
└── check-auto-merge-eligibility.ts              auto-merge patterns

.kanbn/memory/
├── dead-code-rotation.json              ② state
├── dependency-health.json               ③⑧ state (known vulns + daily log + history)
├── board-health.json                    ④ state
├── task-worker-history.json             ⑤ state
├── agent-queue.json                     ⑤⑦ shared state (queue + flagged)
└── architecture-review-rotation.json    ⑥ state

.kanbn/user-requests/
└── *.md                                 demand signals (markdown, /approve-tasks skill)