HLS Migration Findings — Preview Environment

Date: 2026-03-05 Branch: emdash/migration-hls-47j (PR #1259) Environment: Preview (Cloud Run video-processor-preview, Upstash Redis, Cloudflare R2)

Overview

Backfilling HLS streams for existing video responses that only have raw .mp4 URLs. The backfill endpoint (POST /api/backfill-hls on admin) discovers responses with videoUrl but no streamUrl, sends them to the video-processor for HLS encoding, then writes the resulting streamUrl back to the session in Redis.

Infrastructure

Endpoint

Admin API: apps/admin/app/api/backfill-hls/route.ts
Auth: Bearer <CRON_SECRET> (Clerk middleware bypassed for this route)
Max duration: 300s (Vercel function timeout)
Parameters: { flowId?, sessionId?, limit?, startFrom? }
limit: 0 = dry-run (discovery only, no processing)

Video Processor (Cloud Run)

Service: video-processor-preview in europe-west1
Image: Express.js + FFmpeg, Docker on Cloud Run gen2
HLS encode: 3 renditions (800k, 1500k, 3000k) of input video → HLS segments + playlists + master manifest
Output: 7 files per video (3 segments + 3 playlists + 1 master) uploaded to Cloudflare R2

Data Flow

Admin endpoint → VideoProcessingClient.hlsEncode() → Cloud Run video-processor
  → Download .mp4 from Vercel Blob
  → ffprobe metadata
  → ffmpeg HLS encode (3 renditions)
  → Upload segments to R2
  → Return manifest URL
Admin endpoint → Write streamUrl back to Redis session

Dataset (Preview)

Total sessions: 110
Unprocessed videos: 137 (after initial test processing)
Max file size: 34 MB
Avg file size: 4.8 MB
No videos over 50 MB in preview

Size Distribution

Bucket	Count
<1 MB	20
1–5 MB	76
5–10 MB	22
10–50 MB	19
50–100 MB	0
>100 MB	0

Memory Experiments

All tests with concurrency=1, max instances=20.

Memory	CPU	Result
512 Mi	1	OOM — FFmpeg SIGKILL on all videos
1 Gi	1	OOM — FFmpeg SIGKILL on all videos
2 Gi	1	Success — stable for all tested videos (up to 11.4 MB)
2 Gi	2	Success — stable, faster encoding

Conclusion: 2 Gi is the minimum for HLS encoding with 3 renditions at 1080p.

CPU Benchmark: 1 CPU vs 2 CPU

1 CPU, 2 Gi (revision 00498)

Video	Size	Duration	FFmpeg Time	Ratio
0.97 MB	2.7s	33.8s	12.5x realtime
2.01 MB	6.5s	66.3s	10.2x realtime

2 CPU, 2 Gi (revision 00499)

Video	Size	Duration	FFmpeg Time
1.77 MB	5.5s	23.5s	4.3x realtime
11.43 MB	20.4s	112.9s	5.5x realtime
6.92 MB	6.9s	27.0s	3.9x realtime
5.98 MB	6.1s	22.3s	3.6x realtime
2.01 MB	6.5s	31.4s	4.8x realtime
10.22 MB	9.0s	28.6s	3.2x realtime

Observations

2 CPU is ~2.5–3x faster per encode than 1 CPU
ffprobe latency drops from ~1.8s to ~0.2s on warm instances
Upload to R2 is negligible: consistently <1s
Download from Vercel Blob is negligible: <0.5s even for 11 MB
FFmpeg is 95%+ of total time
Encoding scales roughly linearly with video duration

Processing Mode

Original: Sequential for loop — one video at a time
Current: Promise.allSettled — all videos in the batch fire in parallel
With concurrency=1 on Cloud Run, parallel requests trigger auto-scaling to multiple instances
Cloud Run instance spin-up time observed: ~8 seconds (cold start with startup probe)

Estimated Backfill Time (137 remaining)

Config	Avg encode (6s video)	With max 20 instances	Est. total
1 CPU, 2 Gi	~50s	~7 batches	~6–7 min
2 CPU, 2 Gi	~25s	~7 batches	~3–4 min

Issues Encountered

Clerk middleware blocking API route

/api/backfill-hls was not in the isPublicRoute list in apps/admin/middleware.ts
Clerk's auth.protect() intercepted unauthenticated API calls → 404
Fix: Added /api/backfill-hls(.*) to public routes (route has its own Bearer <CRON_SECRET> auth)

Vercel admin not deploying

turbo-ignore skipped admin builds on rebased branches (unreachable VERCEL_GIT_PREVIOUS_SHA)
Fix: Rebase onto latest main gave turbo-ignore a valid comparison base

Sequential processing timeout

20 videos sequentially exceeded the 300s Vercel function timeout
Fix: Switched to Promise.allSettled for parallel processing

4 CPU Benchmark (revision 00498, 4 CPU / 2 Gi)

Video	Size	Duration	FFmpeg Time
6.92 MB	6.9s	14.2s	2.1x realtime
5.98 MB	6.1s	11.8s	1.9x realtime
10.22 MB	9.0s	16.3s	1.8x realtime

4 CPU is ~2x faster than 2 CPU, roughly halving encode time again.

Large Video OOM (4 CPU / 2 Gi)

Video: 200+ MB, 179s duration (session session_1772627716500, flow vf_nejiqtqdxtjc)
Result: OOM — FFmpeg SIGKILL after 12.4s of encoding
Log: [encodeToHls] ffmpeg failed | dur=179.178667s, error=ffmpeg process exited with code null, signal SIGKILL. Possible OOM
HLS failure handled as non-fatal; thumbnails, audio extraction, silence/quality detection still completed
Conclusion: 2 Gi is NOT sufficient for 200MB+ / 3-minute videos even with 4 CPU. Need 4 Gi or 8 Gi for production.

4 Gi Memory Test (200MB+ video)

Video: 200+ MB, 179s duration (session session_1772627716500)
Config: 4 CPU, 4 Gi (revision 00502)
Result: Success — HLS encoding completed without OOM
Conclusion: 4 Gi is sufficient for 200MB+ / 3-minute videos

Memory Requirements Summary

Memory	Small (<34 MB)	Large (200MB+, 179s)
512 Mi	OOM	OOM
1 Gi	OOM	OOM
2 Gi	OK	OOM
4 Gi	OK	OK

Production Backfill — 2026-03-10

Dataset (Production)

Total unprocessed videos: 2,019 (at start)
After normalizeResponse fix: 1,491 (452 had been processed but streamUrl was being stripped — see bug below)

Bug: `normalizeResponse` dropping `streamUrl`

The normalizeResponse function in packages/registries/src/server/video-processing-utils.ts explicitly constructs its return object but did not include streamUrl, even though:

QuestionResponse type declares it: streamUrl?: string | null (video-processing-types.ts:111)
It's correctly saved to Redis by updateSession
It exists in the raw JSON when read from Redis

Impact: The backfill endpoint checks !response.streamUrl to find unprocessed videos — but since normalizeResponse strips it, every video always appears unprocessed. The backfill re-encoded the same videos in an infinite loop. HLS playback was also broken everywhere since no consumer could see streamUrl.

Fix: Added streamUrl extraction to the return object in normalizeResponse (same PR).

Cloud Run Configuration (Production)

Setting	Value	Effect
`containerConcurrency`	1	Each instance handles exactly 1 request at a time
`minScale`	0	No warm instances when idle — everything cold-starts
`maxScale`	20	Up to 20 instances can exist
`cpu-throttling`	true	CPU is reduced when instance has no active request
`cpu`	8	8 vCPU per instance
`memory`	4 Gi	4 GB RAM per instance
`startup-cpu-boost`	true	Full CPU during startup
Startup probe	3 attempts, 5s apart, 3s timeout	~16s max to pass health check

Issue: 429 Rate Limiting & 503 Startup Failures

When sending 20 parallel requests, 4 out of 20 failed:

3x 429 Too Many Requests ("Rate exceeded")
1x 503 Service Unavailable (readiness check failure)

Root Cause: Thundering Herd During Instance Recycling

With concurrency: 1 and minScale: 0, there's a critical gap between batches where no instances are available.

How Cloud Run routes requests with concurrency=1

The timeline (reconstructed from Cloud Run logs)

The instance recycling gap

The red gap between batch N finishing and batch N+1 instances being ready is where the 429s happen. With minScale: 0, there's no warm instance to absorb requests during this transition.

Verified from logs

Exact timestamps from Cloud Run logs at 07:51:39:

07:51:39.580–.620: 16 requests return 200 (previous batch finishing)
07:51:39.589: Instance c404 fails readiness check → 503
07:51:39.597–.599: Old instances start recycling
07:51:39.616–.618: 3 requests aborted — 429 (no available instance)
07:51:39.625–.637: 14 new instances begin starting up (too late for the 429'd requests)

All of this happens within ~40ms.

Fix: Increased Startup Probe Threshold

Changed failure_threshold from 3 to 5 in terraform/modules/video-processor/main.tf. This gives the container ~26s instead of ~16s to pass the health check, reducing 503s from startup probe failures. Applies to all environments (prod, staging, preview).

Recommended Further Fixes

Option	Fixes 429?	Fixes 503?	Risk	Effort
`minScale: 1`	Partially	No	Very low (~$0/month with cpu-throttling)	1 line in Terraform
Sequential sub-batches of 5 in endpoint	Yes	Reduces	None	Code change
`concurrency: 2-3`	Yes	No	FFmpeg memory pressure	1 line in Terraform
Combo: minScale=1 + sub-batches of 5	Yes	Yes	None	Both changes

The sub-batch approach changes the endpoint from:

// Current: all 20 fire at once
await Promise.allSettled(toProcess.map(item => videoClient.hlsEncode(...)));

// Recommended: process in sub-batches of 5
for (let i = 0; i < toProcess.length; i += 5) {
  const batch = toProcess.slice(i, i + 5);
  const results = await Promise.allSettled(
    batch.map(item => videoClient.hlsEncode(...))
  );
}

Pending Experiments

~~Determine optimal CPU/memory config for production~~ → 8 CPU / 4 Gi
~~Test batch of 20+ parallel with auto-scaling behavior~~ → 429/503 failures, see above
Implement sub-batch processing (groups of 5) in backfill endpoint
Consider minScale: 1 for production to eliminate cold-start gap

Files Changed

apps/admin/app/api/backfill-hls/route.ts — parallel processing, sessionId filter
apps/admin/middleware.ts — bypass Clerk for backfill route
apps/video-processor/src/index.ts — sanitize videoUrl in logs (strip query params)
terraform/modules/video-processor/main.tf — add startup probe
terraform/environments/preview.tfvars — lightweight preview config