Recent Posts
Archives

Posts Tagged ‘SLO’

PostHeaderIcon SRE Principles: From Error Budgets to Everyday Reliability

How to define, measure, and improve reliability with concrete metrics, playbooks, and examples you can apply this week.

In a world where users expect instant, uninterrupted access, reliability is a feature. Site Reliability Engineering (SRE) brings engineering discipline to operations with a toolkit built on error budgets, SLIs/SLOs, and automation. This post turns those ideas into specifics: exact metrics, alert rules, dashboards, code and infra changes, and a lightweight maturity model you can use to track progress.


1) What Is SRE Culture?

1.1 Error Budgets: A Contract Between Speed and Stability

An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.

  • Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
  • Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
  • Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.

1.2 SLIs & SLOs: A Common Language

SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.

Domain SLI (what we measure) Example SLO (target) Notes
Availability % successful requests (non-5xx + within timeout) 99.9% over 30 days Define failure modes clearly (timeouts, 5xx, dependency errors).
Latency p95 end-to-end latency (ms) ≤ 300 ms (p95), ≤ 800 ms (p99) Track server time and total time (incl. downstream calls).
Error Rate Failed / total requests < 0.1% rolling 30 days Include client-cancel/timeouts if user-impacting.
Durability Data loss incidents 0 incidents / year Backups + restore drills must be part of policy.

1.3 Automation Over Manual Ops

  • Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
  • Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
  • Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.

2) How Do You Measure Reliability?

2.1 Availability (“The Nines”)

SLO Max Downtime / Year Per 30 Days
99.0% ~3d 15h ~7h 12m
99.9% ~8h 46m ~43m
99.99% ~52m 34s ~4m 19s
99.999% ~5m 15s ~26s

2.2 Latency (Percentiles, Not Averages)

Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.

  • API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
  • Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.

2.3 Error Rate

Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).

2.4 Example SLI Formulas

# Availability SLI
availability = successful_requests / total_requests

# Latency SLI
latency_p95 = percentile(latency_ms, 95)

# Error Rate SLI
error_rate = failed_requests / total_requests

2.5 SLO-Aware Alerting (Burn-Rate Alerts)

Alert on error budget burn rate, not just raw thresholds.

  • Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
  • Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.

3) How Do You Improve Reliability?

3.1 Code Fixes (Targeted, Measurable)

  • Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
  • Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
  • Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.

3.2 Infrastructure Changes

  • Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
  • Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
  • Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.

3.3 Observability Enhancements

  • Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
  • Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
  • Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add request_id, tenant_id, release labels.

3.4 Reliability Improvement Playbook (Weekly Cadence)

  1. Review SLO attainment & burn-rate charts.
  2. Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
  3. Propose one code fix and one infra/observability change.
  4. Deploy via canary; compare SLI before/after; document result.
  5. Close the loop: update runbooks, tests, alerts.

4) Incident Response: From Page to Postmortem

4.1 During the Incident

  • Own the page: acknowledge within minutes; post initial status (“investigating”).
  • Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
  • Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
  • Comms: update stakeholders every 15–30 minutes until stable.

4.2 After the Incident (Blameless Postmortem)

  • Facts first: timeline, impact, user-visible symptoms, SLIs breached.
  • Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
  • Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
  • Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.

5) Common Anti-Patterns (and What to Do Instead)

  • Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
  • Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
  • Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
  • Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
  • Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.

6) A 30-Day Quick Start

  1. Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
  2. Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
  3. Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
  4. Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.

7) Concrete Examples & Snippets

7.1 Example SLI Prometheus (pseudo-metrics)

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.2 Burn-Rate Alert (illustrative)

# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)

7.3 Resilience Config (Java + Resilience4j sketch)

// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
  .failureRateThreshold(50f)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .permittedNumberOfCallsInHalfOpenState(5)
  .slidingWindowSize(100)
  .build();

RetryConfig retry = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.ofMillis(200))
  .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
  .build();

7.4 Kubernetes Health Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5

8) Lightweight SRE Maturity Model

Level Practices What to Add Next
Level 1: Awareness Basic monitoring, ad-hoc on-call, manual deployments Define SLIs/SLOs, create SLO dashboard, add canary deploys
Level 2: Control Burn-rate alerts, incident runbooks, partial automation Tracing, circuit breakers, chaos drills, auto-rollback
Level 3: Optimization Error budget policy enforced, game days, automated rollbacks Multi-region resilience, SLO-gated releases, org-wide error budgets

9) Sample Reliability OKRs

  • Objective: Improve checkout service reliability without slowing delivery.
    • KR1: Availability SLO from 99.5% → 99.9% (30-day window).
    • KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
    • KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
    • KR4: Implement canary + auto-rollback for 100% of releases.

Conclusion

Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.

Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.