Jonathan Lalou's Blog

Posts Tagged ‘Prometheus’

SRE Principles: From Error Budgets to Everyday Reliability

How to define, measure, and improve reliability with concrete metrics, playbooks, and examples you can apply this week.

In a world where users expect instant, uninterrupted access, reliability is a feature. Site Reliability Engineering (SRE) brings engineering discipline to operations with a toolkit built on error budgets, SLIs/SLOs, and automation. This post turns those ideas into specifics: exact metrics, alert rules, dashboards, code and infra changes, and a lightweight maturity model you can use to track progress.

1) What Is SRE Culture?

1.1 Error Budgets: A Contract Between Speed and Stability

An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.

Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.

1.2 SLIs & SLOs: A Common Language

SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.

Domain	SLI (what we measure)	Example SLO (target)	Notes
Availability	% successful requests (non-5xx + within timeout)	99.9% over 30 days	Define failure modes clearly (timeouts, 5xx, dependency errors).
Latency	p95 end-to-end latency (ms)	≤ 300 ms (p95), ≤ 800 ms (p99)	Track server time and total time (incl. downstream calls).
Error Rate	Failed / total requests	< 0.1% rolling 30 days	Include client-cancel/timeouts if user-impacting.
Durability	Data loss incidents	0 incidents / year	Backups + restore drills must be part of policy.

1.3 Automation Over Manual Ops

Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.

2) How Do You Measure Reliability?

2.1 Availability (“The Nines”)

SLO	Max Downtime / Year	Per 30 Days
99.0%	~3d 15h	~7h 12m
99.9%	~8h 46m	~43m
99.99%	~52m 34s	~4m 19s
99.999%	~5m 15s	~26s

2.2 Latency (Percentiles, Not Averages)

Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.

API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.

2.3 Error Rate

Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).

2.4 Example SLI Formulas

# Availability SLI
availability = successful_requests / total_requests

# Latency SLI
latency_p95 = percentile(latency_ms, 95)

# Error Rate SLI
error_rate = failed_requests / total_requests

2.5 SLO-Aware Alerting (Burn-Rate Alerts)

Alert on error budget burn rate, not just raw thresholds.

Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.

3) How Do You Improve Reliability?

3.1 Code Fixes (Targeted, Measurable)

Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.

3.2 Infrastructure Changes

Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.

3.3 Observability Enhancements

Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add request_id, tenant_id, release labels.

3.4 Reliability Improvement Playbook (Weekly Cadence)

Review SLO attainment & burn-rate charts.
Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
Propose one code fix and one infra/observability change.
Deploy via canary; compare SLI before/after; document result.
Close the loop: update runbooks, tests, alerts.

4) Incident Response: From Page to Postmortem

4.1 During the Incident

Own the page: acknowledge within minutes; post initial status (“investigating”).
Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
Comms: update stakeholders every 15–30 minutes until stable.

4.2 After the Incident (Blameless Postmortem)

Facts first: timeline, impact, user-visible symptoms, SLIs breached.
Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.

5) Common Anti-Patterns (and What to Do Instead)

Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.

6) A 30-Day Quick Start

Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.

7) Concrete Examples & Snippets

7.1 Example SLI Prometheus (pseudo-metrics)

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.2 Burn-Rate Alert (illustrative)

# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)

7.3 Resilience Config (Java + Resilience4j sketch)

// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
  .failureRateThreshold(50f)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .permittedNumberOfCallsInHalfOpenState(5)
  .slidingWindowSize(100)
  .build();

RetryConfig retry = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.ofMillis(200))
  .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
  .build();

7.4 Kubernetes Health Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5

8) Lightweight SRE Maturity Model

Level	Practices	What to Add Next
Level 1: Awareness	Basic monitoring, ad-hoc on-call, manual deployments	Define SLIs/SLOs, create SLO dashboard, add canary deploys
Level 2: Control	Burn-rate alerts, incident runbooks, partial automation	Tracing, circuit breakers, chaos drills, auto-rollback
Level 3: Optimization	Error budget policy enforced, game days, automated rollbacks	Multi-region resilience, SLO-gated releases, org-wide error budgets

9) Sample Reliability OKRs

Objective: Improve checkout service reliability without slowing delivery.
- KR1: Availability SLO from 99.5% → 99.9% (30-day window).
- KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
- KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
- KR4: Implement canary + auto-rollback for 100% of releases.

Conclusion

Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.

Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.

Posted in en-US | Tags: Jonathan Lalou, Prometheus, Resilience, SLA, SLI, SLO, SRE | No Comments »