1) What Is SRE Culture?
1.1 Error Budgets: A Contract Between Speed and Stability
An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.
- Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
- Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
- Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.
1.2 SLIs & SLOs: A Common Language
SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.
| Domain | SLI (what we measure) | Example SLO (target) | Notes |
|---|---|---|---|
| Availability | % successful requests (non-5xx + within timeout) | 99.9% over 30 days | Define failure modes clearly (timeouts, 5xx, dependency errors). |
| Latency | p95 end-to-end latency (ms) | ≤ 300 ms (p95), ≤ 800 ms (p99) | Track server time and total time (incl. downstream calls). |
| Error Rate | Failed / total requests | < 0.1% rolling 30 days | Include client-cancel/timeouts if user-impacting. |
| Durability | Data loss incidents | 0 incidents / year | Backups + restore drills must be part of policy. |
1.3 Automation Over Manual Ops
- Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
- Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
- Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.
2) How Do You Measure Reliability?
2.1 Availability (“The Nines”)
| SLO | Max Downtime / Year | Per 30 Days |
|---|---|---|
| 99.0% | ~3d 15h | ~7h 12m |
| 99.9% | ~8h 46m | ~43m |
| 99.99% | ~52m 34s | ~4m 19s |
| 99.999% | ~5m 15s | ~26s |
2.2 Latency (Percentiles, Not Averages)
Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.
- API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
- Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.
2.3 Error Rate
Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).
2.4 Example SLI Formulas
# Availability SLI
availability = successful_requests / total_requests
# Latency SLI
latency_p95 = percentile(latency_ms, 95)
# Error Rate SLI
error_rate = failed_requests / total_requests
2.5 SLO-Aware Alerting (Burn-Rate Alerts)
Alert on error budget burn rate, not just raw thresholds.
- Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
- Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.
3) How Do You Improve Reliability?
3.1 Code Fixes (Targeted, Measurable)
- Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
- Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
- Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.
3.2 Infrastructure Changes
- Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
- Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
- Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.
3.3 Observability Enhancements
- Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
- Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
- Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add
request_id,tenant_id,releaselabels.
3.4 Reliability Improvement Playbook (Weekly Cadence)
- Review SLO attainment & burn-rate charts.
- Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
- Propose one code fix and one infra/observability change.
- Deploy via canary; compare SLI before/after; document result.
- Close the loop: update runbooks, tests, alerts.
4) Incident Response: From Page to Postmortem
4.1 During the Incident
- Own the page: acknowledge within minutes; post initial status (“investigating”).
- Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
- Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
- Comms: update stakeholders every 15–30 minutes until stable.
4.2 After the Incident (Blameless Postmortem)
- Facts first: timeline, impact, user-visible symptoms, SLIs breached.
- Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
- Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
- Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.
5) Common Anti-Patterns (and What to Do Instead)
- Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
- Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
- Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
- Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
- Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.
6) A 30-Day Quick Start
- Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
- Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
- Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
- Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.
7) Concrete Examples & Snippets
7.1 Example SLI Prometheus (pseudo-metrics)
# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
7.2 Burn-Rate Alert (illustrative)
# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)
7.3 Resilience Config (Java + Resilience4j sketch)
// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(100)
.build();
RetryConfig retry = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(200))
.intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
.build();
7.4 Kubernetes Health Probes
livenessProbe:
httpGet: { path: /health/liveness, port: 8080 }
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet: { path: /health/readiness, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 5
8) Lightweight SRE Maturity Model
| Level | Practices | What to Add Next |
|---|---|---|
| Level 1: Awareness | Basic monitoring, ad-hoc on-call, manual deployments | Define SLIs/SLOs, create SLO dashboard, add canary deploys |
| Level 2: Control | Burn-rate alerts, incident runbooks, partial automation | Tracing, circuit breakers, chaos drills, auto-rollback |
| Level 3: Optimization | Error budget policy enforced, game days, automated rollbacks | Multi-region resilience, SLO-gated releases, org-wide error budgets |
9) Sample Reliability OKRs
- Objective: Improve checkout service reliability without slowing delivery.
- KR1: Availability SLO from 99.5% → 99.9% (30-day window).
- KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
- KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
- KR4: Implement canary + auto-rollback for 100% of releases.
Conclusion
Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.
Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.