Nightly Maintenance Architecture
Status: 8 active workers, 4 planned Last updated: 2026-03-10
Overview
The nightly maintenance pipeline is a set of focused, single-purpose workers that run sequentially after hours (UTC). Each worker answers one question about the codebase, makes one type of change, and produces output reviewable in under a minute.
Goal: Automate the work that would otherwise require dedicated engineers — code review, documentation, dependency management, tech debt reduction, and task execution — without adding headcount.
Pipeline Timeline
UTC Worker Status Slack Channel Depends On
──── ───────────────────────── ────────── ────────────── ──────────
00:00 ① Daily Overview ✅ ACTIVE #daily-overview —
02:00 ② Dead Code Cleanup ✅ ACTIVE #ai-janitor —
03:00 ③ Dependency Health (daily) ✅ ACTIVE #ai-janitor —
04:00 ④ Kanban Hygiene ✅ ACTIVE #ai-janitor ①
04:30 ⑤ Task Worker ✅ ACTIVE #ai-janitor agent-queue.json
05:00 ⑥ Architecture Review ✅ ACTIVE #ai-janitor —
05:30 ⑦ Planning Worker ✅ ACTIVE #ai-janitor ⑤ (runs after)
Sun 05:00
⑧ Dep Health (weekly) ✅ ACTIVE #ai-janitor ③
──── PLANNED ────────────────────────────────────────────────────────────
⑨ New Code Reviewer 🔲 PLANNED #ai-janitor ①
⑩ Boy Scout Scanner 🔲 PLANNED #ai-janitor —
⑪ Performance Baseline 🔲 PLANNED #ai-janitor —
⑫ Documentation Generator 🔲 PLANNED TBD —
Workers run staggered to avoid CI resource contention. The task worker and planning worker form a producer-consumer pair connected by agent-queue.json.
Orchestrated Manual Trigger
All workers can be triggered individually via workflow_dispatch. To run the full pipeline manually (e.g. after a big merge), use the orchestrator workflow:
File: .github/workflows/nightly-run-all.yml
Manual trigger → runs all active workers in sequence with proper wait times
See Manual Orchestration section below.
Universal Worker Pattern
Every nightly worker follows the same 3-step architecture:
┌─────────────────────────────────────────────────────┐
│ GitHub Actions │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Step 1 │ │ Step 2 │ │ Step 3 │ │
│ │ PRE-SCAN │──▶│ LLM REVIEW │──▶│ APPLY │ │
│ │ │ │ │ │ │ │
│ │ Deterministic │ │ Claude (if │ │ Create PR │ │
│ │ TypeScript │ │ candidates │ │ Post Slack│ │
│ │ Zero LLM cost│ │ exist) │ │ Update │ │
│ │ │ │ │ │ memory │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ candidates.json decisions.json PR + Slack │
└─────────────────────────────────────────────────────┘
Why this pattern?
- Zero cost when clean — Pre-scan gates the LLM. No candidates = no Claude API call.
- Testable — Deterministic steps can be unit tested without LLM mocking.
- Reviewable — Each step produces a JSON artifact that's inspectable.
- Bounded cost — LLM step has explicit
--max-turnscap.
Worker Details
① Daily Overview (00:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | What did we ship today? |
| Pre-scan | get-daily-overview-window.ts computes 24h window, gh fetches merged PRs + commits |
| LLM | Classifies changes by impact, generates HEAD (Slack headline) + COMPACT (detail) |
| Output | .kanbn/daily-overview/YYYY-MM-DD-daily-overview.md → PR → Slack #daily-overview |
| Workflow | .github/workflows/daily-overview.yml |
② Dead Code Cleanup (02:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | Is there unused code in workspace X? |
| Pre-scan | Knip static analysis on rotation target (35 workspaces) |
| LLM | Reviews Knip findings, removes dead code with skip rules |
| Output | Auto-merge PR + Slack #ai-janitor |
| Memory | .kanbn/memory/dead-code-rotation.json |
| Workflow | .github/workflows/nightly-dead-code-cleanup.yml |
Rotation: 35 workspaces, one per night. Full rotation = ~5 weeks.
③ Dependency Health — Daily (03:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | Are there safe bumps to apply? Any NEW vulnerabilities? |
| Pre-scan | pnpm audit + pnpm outdated + diff against known vulns registry |
| LLM | Haiku (~10 turns): apply bumps, handle failures, compose Slack summary, flag new critical |
| Output | Auto-merge PR for safe bumps, human-friendly Slack summary |
| Memory | .kanbn/memory/dependency-health.json (known vulns + daily log) |
| Workflow | .github/workflows/nightly-dependency-health.yml |
| Prompt | .github/prompts/nightly-dependency-health.md |
Daily design: The collect step diffs today's vulns against knownVulnerabilities in memory. Only NEW vulns get flagged. Known transitive vulns (e.g., 35 minimatch/tar/rollup issues waiting on upstream) are not re-investigated. The daily LLM uses Haiku (cheap, fast) — its job is mechanical: apply bumps, revert failures with detailed error logging, compose a Slack summary. No full report written to the repo.
Auto-merge: Safe-only PRs (lockfile + memory) get the auto-merge label. The dependency-health-safe pattern in check-auto-merge-eligibility.ts enables this.
⑧ Dependency Health — Weekly Digest (Sunday 05:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | What happened with dependencies this week? What should humans act on? |
| Pre-scan | Fresh pnpm audit + pnpm outdated + week's daily log from memory |
| LLM | Sonnet (~50 turns): consolidate week, write report, produce actionable items |
| Output | docs/reports/dependency-health/YYYY-WXX.md + PR + Slack digest |
| Workflow | .github/workflows/weekly-dependency-health.yml |
| Prompt | .github/prompts/weekly-dependency-health.md |
Weekly design: ONE report per week (not per day). Consolidates all daily bump results, failures, new/resolved vulns into a single human-readable digest. The "You should review" section is deduplicated — if "upgrade posthog-js" appeared 5 times in daily logs, it appears once. Cleans up daily log entries older than 14 days and deletes old daily report files.
Daily (Mon-Sun 3AM) Weekly (Sunday 5AM)
├── Collect + diff known vulns ├── Fresh collect
├── Haiku: apply bumps ├── Sonnet: consolidate week
├── Log to dailyLog[] ├── Write YYYY-WXX.md report
├── Auto-merge safe bumps ├── "You should review" digest
├── Slack one-liner ├── Clean up old daily data
└── NO report file └── Detailed Slack digest
④ Kanban Hygiene (04:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | Is the kanban board accurate? |
| Pre-scan | nightly-pv-collect.ts — cross-reference daily overview with task statuses |
| LLM | Review candidates — approve/reject status changes and duplicates |
| Output | Single-purpose PR with clear commit messages per change type |
| Workflow | .github/workflows/nightly-kanban-hygiene.yml |
Focused on 4 checks:
- Status sync (task status matches merged PR state)
- Staleness detection (>60 days without activity in Backlog)
- Duplicate flagging (shared impactedApps + overlapping title keywords)
- Archive Done tasks (move to
.kanbn/archived-tasks/)
⑤ Task Worker (04:30 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | What approved task should the agent implement tonight? |
| Pre-scan | nightly-task-worker-collect.ts — reads agent queue, picks first approved task (FIFO) |
| LLM | Claude Opus implements the task (~80 turns, 60 min timeout) |
| Output | Implementation PR + Slack #ai-janitor with human review instructions |
| Memory | .kanbn/memory/task-worker-history.json (7-day cooldown) |
| Memory | .kanbn/memory/agent-queue.json (removes completed task) |
| Workflow | .github/workflows/nightly-task-worker.yml |
| Prompt | .github/prompts/nightly-task-worker.md |
Selection: Strict queue-only. Reads agent-queue.json, picks first "approved" entry. No fallback. If no approved tasks, reports to Slack with clear instructions for the human.
History persistence: After execution, creates an auto-merge PR to push task-worker-history.json + agent-queue.json updates to main. This ensures the 7-day cooldown works across nightly runs.
Slack report always includes:
- What was done (task ID, changed files)
- What was validated (lint, typecheck)
- What the human should do next (review PR, check acceptance criteria)
agent-queue.json ──▶ collect (FIFO) ──▶ Claude Opus implements
│
┌────────────────┼───────────────┐
▼ ▼ ▼
Code changes No changes Failed
│ │ │
▼ ▼ ▼
Create PR Record in Record in
+ labels history history
│ │ │
└────────┬───────┘ │
▼ │
Auto-merge PR Slack report
for memory update with guidance
│
▼
Slack report with
human review steps
⑥ Architecture Review (05:00 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | Do our architecture docs match the actual code? |
| Pre-scan | nightly-architecture-review-collect.ts — rotation through 22 targets |
| LLM | Reads code and docs, verifies alignment, updates with Mermaid diagrams |
| Output | PR with doc updates + kanban tasks for improvements |
| Memory | .kanbn/memory/architecture-review-rotation.json |
| Workflow | .github/workflows/nightly-architecture-review.yml |
Rotation: 22 targets (11 apps + 11 packages), one per night. Full rotation ~3 weeks.
⑦ Planning Worker (05:30 UTC) — ✅ ACTIVE
| Aspect | Detail |
|---|---|
| Question | What should the task worker work on next? |
| Pre-scan | nightly-planning-worker-collect.ts — gathers agent-eligible tasks, verifies file references against codebase, checks open PRs, loads current queue |
| LLM | Claude Sonnet curates queue: proposes verified tasks, flags stale tasks, blocks tasks needing human judgment |
| Output | PR with updated agent-queue.json for human review |
| Workflow | .github/workflows/nightly-planning-worker.yml |
| Prompt | .github/prompts/nightly-planning-worker.md |
Producer-consumer pattern with verification:
Planning Worker (05:30) Human Review Task Worker (04:30 next day)
├── Verify task refs vs codebase ├── /approve-tasks skill ├── Read agent-queue.json
├── Evaluate agent-eligible tasks │ or review PR directly ├── 3 parallel slots (matrix)
├── Flag stale/outdated tasks ├── Approve/reject/reorder ├── Each picks next approved (FIFO)
├── Propose execution order ├── Handle flagged tasks ├── Implement with Claude
├── Block tasks needing human │ (close/demote/update) ├── Create implementation PRs
└── Create PR with queue update └── Merge PR to main └── Auto-merge memory updates
Queue lifecycle:
- Planning worker verifies task references against codebase (files exist? recently changed?)
- Verified tasks proposed (status:
"proposed"), stale tasks moved to"flagged"list - Human uses
/approve-tasksskill to review: approve, reject, reorder, or handle flagged tasks - Flagged tasks: human decides — close, demote to user request registry (
.kanbn/user-requests/*.md), or update description - Human merges PR to main
- Next night, task worker runs 3 parallel slots — each picks the next approved task (slot 0 = 1st, slot 1 = 2nd, slot 2 = 3rd)
Checks existing PRs: Tasks with open nightly-task/ PRs are reported as "ongoing" and not re-queued, preventing the worker from proposing tasks that are already being worked on.
Task Worker + Planning Worker: Agent Queue
The agent queue (agent-queue.json) is the coordination point between the planning worker and the task worker.
{
"queue": [
{ "taskId": "perf-investigate-hello-start", "reason": "...", "status": "approved" },
{ "taskId": "bvf-font-loading-optimization", "reason": "...", "status": "proposed" }
],
"blocked": [
{ "taskId": "convex-monorepo-setup", "reason": "Needs human decisions on schema" }
],
"flagged": [
{ "taskId": "stale-task", "reason": "3/4 files deleted", "recommendation": "close",
"verificationFindings": ["file X no longer exists"], "flaggedAt": "2026-03-12" }
],
"completed": [
{ "taskId": "some-done-task", "completedAt": "2026-03-09", "outcome": "pr_created:1234" }
]
}
Why not priority-based selection? The old approach sorted by P0→P3 + effort, which caused a P0 task needing human judgment (like convex-monorepo-setup) to block all other work indefinitely. The queue gives humans explicit control over execution order while letting the LLM propose sensible defaults.
Manual Orchestration
Run individual workers
Each worker has workflow_dispatch enabled:
# Run specific worker
gh workflow run nightly-task-worker.yml
gh workflow run nightly-planning-worker.yml
gh workflow run nightly-kanban-hygiene.yml
gh workflow run nightly-dead-code-cleanup.yml
gh workflow run nightly-dependency-health.yml
gh workflow run nightly-architecture-review.yml
gh workflow run daily-overview.yml
# Task worker with specific task override
gh workflow run nightly-task-worker.yml -f task_id=perf-investigate-hello-start
Run entire pipeline
Use the orchestrator workflow to run all workers in sequence:
# Run the full nightly pipeline now
gh workflow run nightly-run-all.yml
# Run with specific workers only
gh workflow run nightly-run-all.yml -f workers="task-worker,planning-worker"
Scaling Considerations: GitHub Actions vs. Alternatives
Current approach: GitHub Actions
Pros:
- Zero infrastructure to maintain
- Built-in secrets management
- Native
ghCLI access - Cron scheduling
- Workflow dispatch for manual triggers
- Concurrency controls
- Each worker is independently deployable
Cons as it scales:
- 7 workflows = 7 independent cron schedules with no coordination
- No shared state between runs (each checkout is fresh)
- Hard to express "run B after A completes" across workflows
- Billing: 4vCPU runners for 60-min task worker runs add up
- Log visibility: must click into each workflow run separately
When to consider alternatives
| Signal | What it means |
|---|---|
| >10 scheduled workflows | Cron management becomes unwieldy |
| Workers need each other's output | GitHub Actions can't pass artifacts between workflows easily |
| Cost exceeds ~$200/month | Self-hosted runners or alternative CI would be cheaper |
| Need real-time queue processing | Cron is too coarse — need event-driven triggers |
| Need human-in-the-loop during execution | GitHub Actions has no interactive approval mid-run |
Possible migration paths
-
Short term (now → 15 workflows): Stay on GitHub Actions. Add orchestrator workflow for manual full-pipeline runs. Use
workflow_runtriggers for chaining. -
Medium term (15+ workflows): Extract to a lightweight orchestrator (e.g. Temporal, Inngest, or a simple Express app on Cloud Run) that:
- Manages the schedule centrally
- Passes artifacts between steps natively
- Provides a dashboard for all worker runs
- Still triggers Claude via
claude-code-actionor direct API
-
Long term: If workers become event-driven (e.g. "run task worker when a queue entry is approved"), consider a message queue (Cloud Tasks, PubSub) + Cloud Run workers.
Recommendation: Stay on GitHub Actions for now. The 7-workflow count is manageable. Add the orchestrator workflow for manual runs. Revisit when you hit 12+ scheduled workflows or need inter-workflow artifact passing.
Memory & State
Nightly workers persist state in .kanbn/memory/:
| File | Purpose | Updated by |
|---|---|---|
dead-code-rotation.json | Rotation index, scan history | ② Dead Code Cleanup |
dependency-health.json | Known vulns registry, daily log, trend history | ③ Daily + ⑧ Weekly |
board-health.json | Board metrics, recurring flags | ④ Kanban Hygiene |
task-worker-history.json | Execution history, 7-day cooldown | ⑤ Task Worker |
agent-queue.json | Curated task execution queue + flagged list | ⑤ Task Worker + ⑦ Planning Worker |
architecture-review-rotation.json | Rotation index, review history | ⑥ Architecture Review |
User request demand signals live in .kanbn/user-requests/*.md (markdown, not JSON in memory/). See CLAUDE-extended § "User Request Registry".
Protected main: Memory updates are pushed to main via auto-merge PRs (not direct commits). The kanban-memory pattern in check-auto-merge-eligibility.ts enables this.
File Index
.github/
├── workflows/
│ ├── daily-overview.yml ① trigger
│ ├── daily-overview-post.yml ① Slack relay
│ ├── nightly-dead-code-cleanup.yml ② trigger
│ ├── nightly-dependency-health.yml ③ daily trigger
│ ├── weekly-dependency-health.yml ⑧ weekly trigger
│ ├── nightly-kanban-hygiene.yml ④ trigger
│ ├── nightly-task-worker.yml ⑤ trigger
│ ├── nightly-architecture-review.yml ⑥ trigger
│ ├── nightly-planning-worker.yml ⑦ trigger
│ ├── nightly-run-all.yml orchestrator (manual)
│ └── auto-merge-eligible.yml auto-merge gating
├── prompts/
│ ├── daily-overview.md ① prompt
│ ├── nightly-dead-code-cleanup.md ② prompt
│ ├── nightly-dependency-health.md ③ daily prompt
│ ├── weekly-dependency-health.md ⑧ weekly prompt
│ ├── nightly-kanban-hygiene.md ④ prompt
│ ├── nightly-task-worker.md ⑤ prompt
│ ├── nightly-architecture-review.md ⑥ prompt
│ └── nightly-planning-worker.md ⑦ prompt
packages/ci-scripts/src/
├── get-daily-overview-window.ts ① helper
├── find-daily-overview-file.ts ① helper
├── post-daily-overview-to-slack.ts ① Slack
├── post-dead-code-cleanup-to-slack.ts ② Slack
├── nightly-dependency-health-collect.ts ③⑧ Step 1
├── post-dependency-health-to-slack.ts ③⑧ Slack
├── nightly-pv-collect.ts ④ Step 1
├── nightly-pv-apply.ts ④ Step 3
├── post-product-verification-to-slack.ts ④ Slack
├── nightly-task-worker-collect.ts ⑤ Step 1
├── nightly-task-worker-apply.ts ⑤ Step 3
├── post-task-worker-to-slack.ts ⑤ Slack
├── nightly-architecture-review-collect.ts ⑥ Step 1
├── post-architecture-review-to-slack.ts ⑥ Slack
├── nightly-planning-worker-collect.ts ⑦ Step 1
├── nightly-planning-worker-apply.ts ⑦ Step 3
├── post-planning-worker-to-slack.ts ⑦ Slack
└── check-auto-merge-eligibility.ts auto-merge patterns
.kanbn/memory/
├── dead-code-rotation.json ② state
├── dependency-health.json ③⑧ state (known vulns + daily log + history)
├── board-health.json ④ state
├── task-worker-history.json ⑤ state
├── agent-queue.json ⑤⑦ shared state (queue + flagged)
└── architecture-review-rotation.json ⑥ state
.kanbn/user-requests/
└── *.md demand signals (markdown, /approve-tasks skill)