Product

Make Failures Boring: Reliability at Trillion Scale

· 4 min read

Part 2 of A Trillion-Token Operator’s Playbook

Make Failures Boring: Reliability at Trillion Scale

Uptime isn’t a dashboard. It’s a set of choices made before anything breaks.

R10 runs trillion-scale workloads on Vertex AI. Reliability comes from architecture and rituals, not heroics. Here’s how we keep the experience steady when the load is anything but steady.

Design for the tail, not the average

Hold two SLOs: P95 for how it feels, P999 for what can hurt you. Budget latency across every hop — retrieval, model call, post-filters — and enforce a global deadline that triggers degradation instead of denial. Never chain unbounded calls; cap fan-out and keep concurrency honest.

The degradation ladder

When the clock or risk threshold is hit, step down — don’t fall over.

  1. Fast path: efficient model, cached features
  2. Enhanced path (on signals): add retrieval and structured tools
  3. Premium path (on ambiguity/risk): larger model with stricter safety
  4. Human review: queued with context and a decision checklist

Each rung is idempotent, resumable, and promoted only by explicit signals (uncertainty, impact, safety).

Deadlines that propagate

Outer timeouts sit just below inner deadlines and carry “time-left” downstream. Retries are jittered and capped; only transient errors get another chance. Interactive flows keep strict budgets; bulk jobs checkpoint and breathe.

Two rules we live by:

  • If you didn’t set a timeout, you set infinity.
  • No retry without idempotency.

Idempotency and backpressure

Give every write-like action an idempotency key. Use token-bucket limiters per route and tenant. Watch queue depth like an SLO and shed load before it becomes a pile-up. The goal is graceful degradation, not cascading failure.

Caching as stability

Caching isn’t a tweak — it’s structure.

  • Response caching for deterministic transforms and repeat asks
  • Embedding caching with TTLs that match content freshness
  • Feature caching for expensive signals you don’t want live on the hot path

We review cache hit rates like SLAs. When hits rise, tails shrink.

Isolation that actually isolates

Per-tenant circuit breakers contain abuse and runaway workflows. Separate traffic classes — interactive, batch, experimental — with their own queues and autoscaling. Roll out new prompts, models, and policies as canaries with automatic rollback on SLO violation. One tenant’s spike shouldn’t dim the rest.

Safety gates are reliability gates

Pre-filters (PII/PHI detection, jailbreak screens) reduce surprises upstream. Post-filters (policy checks, type validators) catch drift before users do. Log rationales on sensitive actions and escalate to humans where it matters. When safety is code, incidents become contained events — not headlines.

Observability tied to outcomes

For every call, log the route, prompt version, model/version, region, tokens in/out, cache hits, and a latency breakdown. Label the outcome — solved, policy block, escalation — and attach evals for accuracy, safety, and satisfaction.

Rituals that keep us honest:

  • Weekly Top 10 slow/costly/brittle scenario — refactor, don’t admire
  • Tail report on P999 regressions with owners and fixes
  • Change review: what shipped and what it did to SLOs

Change without chaos

Everything ships behind flags — prompts, tools, policies. Shadow and replay new versions on mirrored traffic before exposure. Keep schema contracts between retrieval → model → post-processors so “helpful” changes don’t break reality. Moving fast is safe when changes are observable, reversible, and scoped.

Drill before disaster

We practice outages: kill a region, degrade a model, poison a cache. Runbooks are one page, copy-pastable. Comms are templated and rehearsed. Practice shortens incident half-life more than any tool.

R10 x Vertex AI: practical notes

  • Keep inference and data services co-located; fail over across regions behind the same control plane.
  • Treat quotas as product constraints — soft limits per route and scheduled batch windows.
  • Pin model versions; “latest” is not a strategy.
  • Stream when users benefit; batch when throughput rules.

Ship-with-this-R10-checklist

☐ P95 and P999 SLOs with a per-stage latency budget
☐ Documented, tested degradation ladder
☐ Timeouts, capped jittered retries, backoff in code
☐ Idempotency keys + dead-letter queues
☐ Per-tenant rate limits and circuit breakers
☐ Response/embedding/feature caching with target hit rates
☐ Safety gates (pre/post) and rationale logging
☐ Route metrics, evals, and outcome labels wired
☐ Flags, canaries, rollback plan for prompts/models/policies
☐ Runbook + scheduled game day

Reliability at this scale is a posture, not a promise. R10 assumes things will fail — and chooses how.