All docs/general

docs/architecture/hls-migration-findings.md

HLS Migration Findings — Preview Environment

Date: 2026-03-05 Branch: emdash/migration-hls-47j (PR #1259) Environment: Preview (Cloud Run video-processor-preview, Upstash Redis, Cloudflare R2)

Overview

Backfilling HLS streams for existing video responses that only have raw .mp4 URLs. The backfill endpoint (POST /api/backfill-hls on admin) discovers responses with videoUrl but no streamUrl, sends them to the video-processor for HLS encoding, then writes the resulting streamUrl back to the session in Redis.

Infrastructure

Endpoint

  • Admin API: apps/admin/app/api/backfill-hls/route.ts
  • Auth: Bearer <CRON_SECRET> (Clerk middleware bypassed for this route)
  • Max duration: 300s (Vercel function timeout)
  • Parameters: { flowId?, sessionId?, limit?, startFrom? }
  • limit: 0 = dry-run (discovery only, no processing)

Video Processor (Cloud Run)

  • Service: video-processor-preview in europe-west1
  • Image: Express.js + FFmpeg, Docker on Cloud Run gen2
  • HLS encode: 3 renditions (800k, 1500k, 3000k) of input video → HLS segments + playlists + master manifest
  • Output: 7 files per video (3 segments + 3 playlists + 1 master) uploaded to Cloudflare R2

Data Flow

Admin endpoint → VideoProcessingClient.hlsEncode() → Cloud Run video-processor
  → Download .mp4 from Vercel Blob
  → ffprobe metadata
  → ffmpeg HLS encode (3 renditions)
  → Upload segments to R2
  → Return manifest URL
Admin endpoint → Write streamUrl back to Redis session

Dataset (Preview)

  • Total sessions: 110
  • Unprocessed videos: 137 (after initial test processing)
  • Max file size: 34 MB
  • Avg file size: 4.8 MB
  • No videos over 50 MB in preview

Size Distribution

BucketCount
<1 MB20
1–5 MB76
5–10 MB22
10–50 MB19
50–100 MB0
>100 MB0

Memory Experiments

All tests with concurrency=1, max instances=20.

MemoryCPUResult
512 Mi1OOM — FFmpeg SIGKILL on all videos
1 Gi1OOM — FFmpeg SIGKILL on all videos
2 Gi1Success — stable for all tested videos (up to 11.4 MB)
2 Gi2Success — stable, faster encoding

Conclusion: 2 Gi is the minimum for HLS encoding with 3 renditions at 1080p.

CPU Benchmark: 1 CPU vs 2 CPU

1 CPU, 2 Gi (revision 00498)

VideoSizeDurationFFmpeg TimeRatio
0.97 MB2.7s33.8s12.5x realtime
2.01 MB6.5s66.3s10.2x realtime

2 CPU, 2 Gi (revision 00499)

VideoSizeDurationFFmpeg TimeRatio
1.77 MB5.5s23.5s4.3x realtime
11.43 MB20.4s112.9s5.5x realtime
6.92 MB6.9s27.0s3.9x realtime
5.98 MB6.1s22.3s3.6x realtime
2.01 MB6.5s31.4s4.8x realtime
10.22 MB9.0s28.6s3.2x realtime

Observations

  • 2 CPU is ~2.5–3x faster per encode than 1 CPU
  • ffprobe latency drops from ~1.8s to ~0.2s on warm instances
  • Upload to R2 is negligible: consistently <1s
  • Download from Vercel Blob is negligible: <0.5s even for 11 MB
  • FFmpeg is 95%+ of total time
  • Encoding scales roughly linearly with video duration

Processing Mode

  • Original: Sequential for loop — one video at a time
  • Current: Promise.allSettled — all videos in the batch fire in parallel
  • With concurrency=1 on Cloud Run, parallel requests trigger auto-scaling to multiple instances
  • Cloud Run instance spin-up time observed: ~8 seconds (cold start with startup probe)

Estimated Backfill Time (137 remaining)

ConfigAvg encode (6s video)With max 20 instancesEst. total
1 CPU, 2 Gi~50s~7 batches~6–7 min
2 CPU, 2 Gi~25s~7 batches~3–4 min

Issues Encountered

Clerk middleware blocking API route

  • /api/backfill-hls was not in the isPublicRoute list in apps/admin/middleware.ts
  • Clerk's auth.protect() intercepted unauthenticated API calls → 404
  • Fix: Added /api/backfill-hls(.*) to public routes (route has its own Bearer <CRON_SECRET> auth)

Vercel admin not deploying

  • turbo-ignore skipped admin builds on rebased branches (unreachable VERCEL_GIT_PREVIOUS_SHA)
  • Fix: Rebase onto latest main gave turbo-ignore a valid comparison base

Sequential processing timeout

  • 20 videos sequentially exceeded the 300s Vercel function timeout
  • Fix: Switched to Promise.allSettled for parallel processing

4 CPU Benchmark (revision 00498, 4 CPU / 2 Gi)

VideoSizeDurationFFmpeg TimeRatio
6.92 MB6.9s14.2s2.1x realtime
5.98 MB6.1s11.8s1.9x realtime
10.22 MB9.0s16.3s1.8x realtime

4 CPU is ~2x faster than 2 CPU, roughly halving encode time again.

Large Video OOM (4 CPU / 2 Gi)

  • Video: 200+ MB, 179s duration (session session_1772627716500, flow vf_nejiqtqdxtjc)
  • Result: OOM — FFmpeg SIGKILL after 12.4s of encoding
  • Log: [encodeToHls] ffmpeg failed | dur=179.178667s, error=ffmpeg process exited with code null, signal SIGKILL. Possible OOM
  • HLS failure handled as non-fatal; thumbnails, audio extraction, silence/quality detection still completed
  • Conclusion: 2 Gi is NOT sufficient for 200MB+ / 3-minute videos even with 4 CPU. Need 4 Gi or 8 Gi for production.

4 Gi Memory Test (200MB+ video)

  • Video: 200+ MB, 179s duration (session session_1772627716500)
  • Config: 4 CPU, 4 Gi (revision 00502)
  • Result: Success — HLS encoding completed without OOM
  • Conclusion: 4 Gi is sufficient for 200MB+ / 3-minute videos

Memory Requirements Summary

MemorySmall (<34 MB)Large (200MB+, 179s)
512 MiOOMOOM
1 GiOOMOOM
2 GiOKOOM
4 GiOKOK

Production Backfill — 2026-03-10

Dataset (Production)

  • Total unprocessed videos: 2,019 (at start)
  • After normalizeResponse fix: 1,491 (452 had been processed but streamUrl was being stripped — see bug below)

Bug: normalizeResponse dropping streamUrl

The normalizeResponse function in packages/registries/src/server/video-processing-utils.ts explicitly constructs its return object but did not include streamUrl, even though:

  1. QuestionResponse type declares it: streamUrl?: string | null (video-processing-types.ts:111)
  2. It's correctly saved to Redis by updateSession
  3. It exists in the raw JSON when read from Redis

Impact: The backfill endpoint checks !response.streamUrl to find unprocessed videos — but since normalizeResponse strips it, every video always appears unprocessed. The backfill re-encoded the same videos in an infinite loop. HLS playback was also broken everywhere since no consumer could see streamUrl.

Fix: Added streamUrl extraction to the return object in normalizeResponse (same PR).

Cloud Run Configuration (Production)

SettingValueEffect
containerConcurrency1Each instance handles exactly 1 request at a time
minScale0No warm instances when idle — everything cold-starts
maxScale20Up to 20 instances can exist
cpu-throttlingtrueCPU is reduced when instance has no active request
cpu88 vCPU per instance
memory4 Gi4 GB RAM per instance
startup-cpu-boosttrueFull CPU during startup
Startup probe3 attempts, 5s apart, 3s timeout~16s max to pass health check

Issue: 429 Rate Limiting & 503 Startup Failures

When sending 20 parallel requests, 4 out of 20 failed:

  • 3x 429 Too Many Requests ("Rate exceeded")
  • 1x 503 Service Unavailable (readiness check failure)

Root Cause: Thundering Herd During Instance Recycling

With concurrency: 1 and minScale: 0, there's a critical gap between batches where no instances are available.

How Cloud Run routes requests with concurrency=1
The timeline (reconstructed from Cloud Run logs)
The instance recycling gap

The red gap between batch N finishing and batch N+1 instances being ready is where the 429s happen. With minScale: 0, there's no warm instance to absorb requests during this transition.

Verified from logs

Exact timestamps from Cloud Run logs at 07:51:39:

  • 07:51:39.580–.620: 16 requests return 200 (previous batch finishing)
  • 07:51:39.589: Instance c404 fails readiness check → 503
  • 07:51:39.597–.599: Old instances start recycling
  • 07:51:39.616–.618: 3 requests aborted — 429 (no available instance)
  • 07:51:39.625–.637: 14 new instances begin starting up (too late for the 429'd requests)

All of this happens within ~40ms.

Fix: Increased Startup Probe Threshold

Changed failure_threshold from 3 to 5 in terraform/modules/video-processor/main.tf. This gives the container ~26s instead of ~16s to pass the health check, reducing 503s from startup probe failures. Applies to all environments (prod, staging, preview).

Recommended Further Fixes

OptionFixes 429?Fixes 503?RiskEffort
minScale: 1PartiallyNoVery low (~$0/month with cpu-throttling)1 line in Terraform
Sequential sub-batches of 5 in endpointYesReducesNoneCode change
concurrency: 2-3YesNoFFmpeg memory pressure1 line in Terraform
Combo: minScale=1 + sub-batches of 5YesYesNoneBoth changes

The sub-batch approach changes the endpoint from:

// Current: all 20 fire at once
await Promise.allSettled(toProcess.map(item => videoClient.hlsEncode(...)));

// Recommended: process in sub-batches of 5
for (let i = 0; i < toProcess.length; i += 5) {
  const batch = toProcess.slice(i, i + 5);
  const results = await Promise.allSettled(
    batch.map(item => videoClient.hlsEncode(...))
  );
}

Pending Experiments

  • Determine optimal CPU/memory config for production → 8 CPU / 4 Gi
  • Test batch of 20+ parallel with auto-scaling behavior → 429/503 failures, see above
  • Implement sub-batch processing (groups of 5) in backfill endpoint
  • Consider minScale: 1 for production to eliminate cold-start gap

Files Changed

  • apps/admin/app/api/backfill-hls/route.ts — parallel processing, sessionId filter
  • apps/admin/middleware.ts — bypass Clerk for backfill route
  • apps/video-processor/src/index.ts — sanitize videoUrl in logs (strip query params)
  • terraform/modules/video-processor/main.tf — add startup probe
  • terraform/environments/preview.tfvars — lightweight preview config