HLS Migration Findings — Preview Environment
Date: 2026-03-05
Branch: emdash/migration-hls-47j (PR #1259)
Environment: Preview (Cloud Run video-processor-preview, Upstash Redis, Cloudflare R2)
Overview
Backfilling HLS streams for existing video responses that only have raw .mp4 URLs. The backfill endpoint (POST /api/backfill-hls on admin) discovers responses with videoUrl but no streamUrl, sends them to the video-processor for HLS encoding, then writes the resulting streamUrl back to the session in Redis.
Infrastructure
Endpoint
- Admin API:
apps/admin/app/api/backfill-hls/route.ts - Auth:
Bearer <CRON_SECRET>(Clerk middleware bypassed for this route) - Max duration: 300s (Vercel function timeout)
- Parameters:
{ flowId?, sessionId?, limit?, startFrom? } limit: 0= dry-run (discovery only, no processing)
Video Processor (Cloud Run)
- Service:
video-processor-previewineurope-west1 - Image: Express.js + FFmpeg, Docker on Cloud Run gen2
- HLS encode: 3 renditions (800k, 1500k, 3000k) of input video → HLS segments + playlists + master manifest
- Output: 7 files per video (3 segments + 3 playlists + 1 master) uploaded to Cloudflare R2
Data Flow
Admin endpoint → VideoProcessingClient.hlsEncode() → Cloud Run video-processor
→ Download .mp4 from Vercel Blob
→ ffprobe metadata
→ ffmpeg HLS encode (3 renditions)
→ Upload segments to R2
→ Return manifest URL
Admin endpoint → Write streamUrl back to Redis session
Dataset (Preview)
- Total sessions: 110
- Unprocessed videos: 137 (after initial test processing)
- Max file size: 34 MB
- Avg file size: 4.8 MB
- No videos over 50 MB in preview
Size Distribution
| Bucket | Count |
|---|---|
| <1 MB | 20 |
| 1–5 MB | 76 |
| 5–10 MB | 22 |
| 10–50 MB | 19 |
| 50–100 MB | 0 |
| >100 MB | 0 |
Memory Experiments
All tests with concurrency=1, max instances=20.
| Memory | CPU | Result |
|---|---|---|
| 512 Mi | 1 | OOM — FFmpeg SIGKILL on all videos |
| 1 Gi | 1 | OOM — FFmpeg SIGKILL on all videos |
| 2 Gi | 1 | Success — stable for all tested videos (up to 11.4 MB) |
| 2 Gi | 2 | Success — stable, faster encoding |
Conclusion: 2 Gi is the minimum for HLS encoding with 3 renditions at 1080p.
CPU Benchmark: 1 CPU vs 2 CPU
1 CPU, 2 Gi (revision 00498)
| Video | Size | Duration | FFmpeg Time | Ratio |
|---|---|---|---|---|
| 0.97 MB | 2.7s | 33.8s | 12.5x realtime | |
| 2.01 MB | 6.5s | 66.3s | 10.2x realtime |
2 CPU, 2 Gi (revision 00499)
| Video | Size | Duration | FFmpeg Time | Ratio |
|---|---|---|---|---|
| 1.77 MB | 5.5s | 23.5s | 4.3x realtime | |
| 11.43 MB | 20.4s | 112.9s | 5.5x realtime | |
| 6.92 MB | 6.9s | 27.0s | 3.9x realtime | |
| 5.98 MB | 6.1s | 22.3s | 3.6x realtime | |
| 2.01 MB | 6.5s | 31.4s | 4.8x realtime | |
| 10.22 MB | 9.0s | 28.6s | 3.2x realtime |
Observations
- 2 CPU is ~2.5–3x faster per encode than 1 CPU
- ffprobe latency drops from ~1.8s to ~0.2s on warm instances
- Upload to R2 is negligible: consistently <1s
- Download from Vercel Blob is negligible: <0.5s even for 11 MB
- FFmpeg is 95%+ of total time
- Encoding scales roughly linearly with video duration
Processing Mode
- Original: Sequential
forloop — one video at a time - Current:
Promise.allSettled— all videos in the batch fire in parallel - With concurrency=1 on Cloud Run, parallel requests trigger auto-scaling to multiple instances
- Cloud Run instance spin-up time observed: ~8 seconds (cold start with startup probe)
Estimated Backfill Time (137 remaining)
| Config | Avg encode (6s video) | With max 20 instances | Est. total |
|---|---|---|---|
| 1 CPU, 2 Gi | ~50s | ~7 batches | ~6–7 min |
| 2 CPU, 2 Gi | ~25s | ~7 batches | ~3–4 min |
Issues Encountered
Clerk middleware blocking API route
/api/backfill-hlswas not in theisPublicRoutelist inapps/admin/middleware.ts- Clerk's
auth.protect()intercepted unauthenticated API calls → 404 - Fix: Added
/api/backfill-hls(.*)to public routes (route has its ownBearer <CRON_SECRET>auth)
Vercel admin not deploying
turbo-ignoreskipped admin builds on rebased branches (unreachableVERCEL_GIT_PREVIOUS_SHA)- Fix: Rebase onto latest
maingave turbo-ignore a valid comparison base
Sequential processing timeout
- 20 videos sequentially exceeded the 300s Vercel function timeout
- Fix: Switched to
Promise.allSettledfor parallel processing
4 CPU Benchmark (revision 00498, 4 CPU / 2 Gi)
| Video | Size | Duration | FFmpeg Time | Ratio |
|---|---|---|---|---|
| 6.92 MB | 6.9s | 14.2s | 2.1x realtime | |
| 5.98 MB | 6.1s | 11.8s | 1.9x realtime | |
| 10.22 MB | 9.0s | 16.3s | 1.8x realtime |
4 CPU is ~2x faster than 2 CPU, roughly halving encode time again.
Large Video OOM (4 CPU / 2 Gi)
- Video: 200+ MB, 179s duration (session
session_1772627716500, flowvf_nejiqtqdxtjc) - Result: OOM — FFmpeg SIGKILL after 12.4s of encoding
- Log:
[encodeToHls] ffmpeg failed | dur=179.178667s, error=ffmpeg process exited with code null, signal SIGKILL. Possible OOM - HLS failure handled as non-fatal; thumbnails, audio extraction, silence/quality detection still completed
- Conclusion: 2 Gi is NOT sufficient for 200MB+ / 3-minute videos even with 4 CPU. Need 4 Gi or 8 Gi for production.
4 Gi Memory Test (200MB+ video)
- Video: 200+ MB, 179s duration (session
session_1772627716500) - Config: 4 CPU, 4 Gi (revision 00502)
- Result: Success — HLS encoding completed without OOM
- Conclusion: 4 Gi is sufficient for 200MB+ / 3-minute videos
Memory Requirements Summary
| Memory | Small (<34 MB) | Large (200MB+, 179s) |
|---|---|---|
| 512 Mi | OOM | OOM |
| 1 Gi | OOM | OOM |
| 2 Gi | OK | OOM |
| 4 Gi | OK | OK |
Production Backfill — 2026-03-10
Dataset (Production)
- Total unprocessed videos: 2,019 (at start)
- After normalizeResponse fix: 1,491 (452 had been processed but streamUrl was being stripped — see bug below)
Bug: normalizeResponse dropping streamUrl
The normalizeResponse function in packages/registries/src/server/video-processing-utils.ts explicitly constructs its return object but did not include streamUrl, even though:
QuestionResponsetype declares it:streamUrl?: string | null(video-processing-types.ts:111)- It's correctly saved to Redis by
updateSession - It exists in the raw JSON when read from Redis
Impact: The backfill endpoint checks !response.streamUrl to find unprocessed videos — but since normalizeResponse strips it, every video always appears unprocessed. The backfill re-encoded the same videos in an infinite loop. HLS playback was also broken everywhere since no consumer could see streamUrl.
Fix: Added streamUrl extraction to the return object in normalizeResponse (same PR).
Cloud Run Configuration (Production)
| Setting | Value | Effect |
|---|---|---|
containerConcurrency | 1 | Each instance handles exactly 1 request at a time |
minScale | 0 | No warm instances when idle — everything cold-starts |
maxScale | 20 | Up to 20 instances can exist |
cpu-throttling | true | CPU is reduced when instance has no active request |
cpu | 8 | 8 vCPU per instance |
memory | 4 Gi | 4 GB RAM per instance |
startup-cpu-boost | true | Full CPU during startup |
| Startup probe | 3 attempts, 5s apart, 3s timeout | ~16s max to pass health check |
Issue: 429 Rate Limiting & 503 Startup Failures
When sending 20 parallel requests, 4 out of 20 failed:
- 3x 429 Too Many Requests ("Rate exceeded")
- 1x 503 Service Unavailable (readiness check failure)
Root Cause: Thundering Herd During Instance Recycling
With concurrency: 1 and minScale: 0, there's a critical gap between batches where no instances are available.
How Cloud Run routes requests with concurrency=1
The timeline (reconstructed from Cloud Run logs)
The instance recycling gap
The red gap between batch N finishing and batch N+1 instances being ready is where the 429s happen. With minScale: 0, there's no warm instance to absorb requests during this transition.
Verified from logs
Exact timestamps from Cloud Run logs at 07:51:39:
07:51:39.580–.620: 16 requests return 200 (previous batch finishing)07:51:39.589: Instancec404fails readiness check → 50307:51:39.597–.599: Old instances start recycling07:51:39.616–.618: 3 requests aborted — 429 (no available instance)07:51:39.625–.637: 14 new instances begin starting up (too late for the 429'd requests)
All of this happens within ~40ms.
Fix: Increased Startup Probe Threshold
Changed failure_threshold from 3 to 5 in terraform/modules/video-processor/main.tf. This gives the container ~26s instead of ~16s to pass the health check, reducing 503s from startup probe failures. Applies to all environments (prod, staging, preview).
Recommended Further Fixes
| Option | Fixes 429? | Fixes 503? | Risk | Effort |
|---|---|---|---|---|
minScale: 1 | Partially | No | Very low (~$0/month with cpu-throttling) | 1 line in Terraform |
| Sequential sub-batches of 5 in endpoint | Yes | Reduces | None | Code change |
concurrency: 2-3 | Yes | No | FFmpeg memory pressure | 1 line in Terraform |
| Combo: minScale=1 + sub-batches of 5 | Yes | Yes | None | Both changes |
The sub-batch approach changes the endpoint from:
// Current: all 20 fire at once
await Promise.allSettled(toProcess.map(item => videoClient.hlsEncode(...)));
// Recommended: process in sub-batches of 5
for (let i = 0; i < toProcess.length; i += 5) {
const batch = toProcess.slice(i, i + 5);
const results = await Promise.allSettled(
batch.map(item => videoClient.hlsEncode(...))
);
}
Pending Experiments
-
Determine optimal CPU/memory config for production→ 8 CPU / 4 Gi -
Test batch of 20+ parallel with auto-scaling behavior→ 429/503 failures, see above - Implement sub-batch processing (groups of 5) in backfill endpoint
- Consider
minScale: 1for production to eliminate cold-start gap
Files Changed
apps/admin/app/api/backfill-hls/route.ts— parallel processing, sessionId filterapps/admin/middleware.ts— bypass Clerk for backfill routeapps/video-processor/src/index.ts— sanitize videoUrl in logs (strip query params)terraform/modules/video-processor/main.tf— add startup probeterraform/environments/preview.tfvars— lightweight preview config