Recent Posts
Archives

Posts Tagged ‘Resilience’

PostHeaderIcon SRE Principles: From Error Budgets to Everyday Reliability

How to define, measure, and improve reliability with concrete metrics, playbooks, and examples you can apply this week.

In a world where users expect instant, uninterrupted access, reliability is a feature. Site Reliability Engineering (SRE) brings engineering discipline to operations with a toolkit built on error budgets, SLIs/SLOs, and automation. This post turns those ideas into specifics: exact metrics, alert rules, dashboards, code and infra changes, and a lightweight maturity model you can use to track progress.


1) What Is SRE Culture?

1.1 Error Budgets: A Contract Between Speed and Stability

An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.

  • Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
  • Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
  • Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.

1.2 SLIs & SLOs: A Common Language

SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.

Domain SLI (what we measure) Example SLO (target) Notes
Availability % successful requests (non-5xx + within timeout) 99.9% over 30 days Define failure modes clearly (timeouts, 5xx, dependency errors).
Latency p95 end-to-end latency (ms) ≤ 300 ms (p95), ≤ 800 ms (p99) Track server time and total time (incl. downstream calls).
Error Rate Failed / total requests < 0.1% rolling 30 days Include client-cancel/timeouts if user-impacting.
Durability Data loss incidents 0 incidents / year Backups + restore drills must be part of policy.

1.3 Automation Over Manual Ops

  • Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
  • Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
  • Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.

2) How Do You Measure Reliability?

2.1 Availability (“The Nines”)

SLO Max Downtime / Year Per 30 Days
99.0% ~3d 15h ~7h 12m
99.9% ~8h 46m ~43m
99.99% ~52m 34s ~4m 19s
99.999% ~5m 15s ~26s

2.2 Latency (Percentiles, Not Averages)

Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.

  • API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
  • Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.

2.3 Error Rate

Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).

2.4 Example SLI Formulas

# Availability SLI
availability = successful_requests / total_requests

# Latency SLI
latency_p95 = percentile(latency_ms, 95)

# Error Rate SLI
error_rate = failed_requests / total_requests

2.5 SLO-Aware Alerting (Burn-Rate Alerts)

Alert on error budget burn rate, not just raw thresholds.

  • Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
  • Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.

3) How Do You Improve Reliability?

3.1 Code Fixes (Targeted, Measurable)

  • Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
  • Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
  • Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.

3.2 Infrastructure Changes

  • Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
  • Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
  • Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.

3.3 Observability Enhancements

  • Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
  • Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
  • Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add request_id, tenant_id, release labels.

3.4 Reliability Improvement Playbook (Weekly Cadence)

  1. Review SLO attainment & burn-rate charts.
  2. Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
  3. Propose one code fix and one infra/observability change.
  4. Deploy via canary; compare SLI before/after; document result.
  5. Close the loop: update runbooks, tests, alerts.

4) Incident Response: From Page to Postmortem

4.1 During the Incident

  • Own the page: acknowledge within minutes; post initial status (“investigating”).
  • Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
  • Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
  • Comms: update stakeholders every 15–30 minutes until stable.

4.2 After the Incident (Blameless Postmortem)

  • Facts first: timeline, impact, user-visible symptoms, SLIs breached.
  • Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
  • Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
  • Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.

5) Common Anti-Patterns (and What to Do Instead)

  • Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
  • Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
  • Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
  • Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
  • Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.

6) A 30-Day Quick Start

  1. Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
  2. Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
  3. Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
  4. Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.

7) Concrete Examples & Snippets

7.1 Example SLI Prometheus (pseudo-metrics)

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.2 Burn-Rate Alert (illustrative)

# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)

7.3 Resilience Config (Java + Resilience4j sketch)

// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
  .failureRateThreshold(50f)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .permittedNumberOfCallsInHalfOpenState(5)
  .slidingWindowSize(100)
  .build();

RetryConfig retry = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.ofMillis(200))
  .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
  .build();

7.4 Kubernetes Health Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5

8) Lightweight SRE Maturity Model

Level Practices What to Add Next
Level 1: Awareness Basic monitoring, ad-hoc on-call, manual deployments Define SLIs/SLOs, create SLO dashboard, add canary deploys
Level 2: Control Burn-rate alerts, incident runbooks, partial automation Tracing, circuit breakers, chaos drills, auto-rollback
Level 3: Optimization Error budget policy enforced, game days, automated rollbacks Multi-region resilience, SLO-gated releases, org-wide error budgets

9) Sample Reliability OKRs

  • Objective: Improve checkout service reliability without slowing delivery.
    • KR1: Availability SLO from 99.5% → 99.9% (30-day window).
    • KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
    • KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
    • KR4: Implement canary + auto-rollback for 100% of releases.

Conclusion

Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.

Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.

PostHeaderIcon [SpringIO2022] How to foster a Culture of Resilience

Benjamin Wilms, founder of Steadybit, delivered a compelling session at Spring I/O 2022, exploring how to build a culture of resilience through chaos engineering. Drawing from his experience and the evolution of chaos engineering since his 2019 Spring I/O talk, Benjamin emphasized proactive strategies to enhance system reliability. His presentation combined practical demonstrations with a framework for integrating resilience into development workflows, advocating for collaboration and automation.

Understanding Resilience and Chaos Engineering

Benjamin began by defining resilience as the outcome of well-architected, automated, and thoroughly tested systems capable of recovering from faults while delivering customer value. Unlike traditional stability, resilience involves handling partial outages with fallbacks or alternatives, ensuring service continuity. He introduced chaos engineering as a method to test this resilience by intentionally injecting faults—latency, exceptions, or service outages—to build confidence in system capabilities.

Chaos engineering involves defining a steady state (e.g., successful Netflix play button clicks), forming hypotheses (e.g., surviving a payment service outage), and running experiments to verify outcomes. Benjamin highlighted its evolution from a niche practice at Netflix to a growing community discipline, but noted its time-intensive nature often deters teams. He stressed that resilience extends beyond systems to organizational responsiveness, such as detecting incidents in seconds rather than minutes.

Pitfalls of Ad-Hoc Chaos Engineering

To illustrate common mistakes, Benjamin demonstrated a flawed approach using a Kubernetes-based microservice system with a gateway and three backend services. Running a random “delete pod” attack on the hotel service caused errors in the gateway’s product list aggregation, visible in a demo UI. However, the experiment yielded little insight, as it only confirmed the attack’s impact without actionable learnings. He critiqued such ad-hoc attacks—using tools like Pumbaa—for disrupting workflows and requiring expertise in CI/CD integration, diverting focus from core development.

This approach fails to generate knowledge or improve systems, often becoming a “rabbit hole” of additional work. Benjamin argued that starting with tools or attacks, rather than clear objectives, undermines the value of chaos engineering, leaving teams with vague results and no clear path to enhancement.

Building a Culture of Resilience

Benjamin proposed a structured approach to foster resilience, starting with the “why”: understanding motivations like surviving AWS zone outages or ensuring checkout services handle payment downtimes. The “what” involves defining specific capabilities, such as maintaining 95% request success during pod failures or implementing retry patterns. He advocated encoding these capabilities as policies—code-based checks integrated into the development pipeline.

In a demo, Benjamin showed how to define a policy for the gateway service, specifying pod redundancy and steady-state checks via a product list endpoint. The policy, stored in the codebase, runs in a CI/CD pipeline (e.g., GitHub Actions) on a staging environment, verifying resilience after each commit. This automation ensures continuous validation without manual intervention, embedding resilience into daily workflows. Policies include pre-built experiments from communities (e.g., Zalando) or static weak spot checks, like missing Kubernetes readiness probes, making resilience accessible to all developers.

Organizational Strategies and Community Impact

Benjamin addressed organizational adoption, suggesting a central component to schedule experiments and avoid overlapping tests in shared environments. For consulting scenarios, he recommended analyzing past incidents to demonstrate resilience gaps, such as running experiments to recreate outages. He shared a case where a client’s system collapsed during a rolling update under load, underscoring the need for combined testing scenarios.

He encouraged starting with static linters to identify configuration risks and replaying past incidents to prevent recurrence. By integrating resilience checks into pipelines, teams can focus on feature delivery while maintaining reliability. Benjamin’s vision of a resilience culture—where proactive testing is instinctive—resonates with developers seeking to balance velocity and stability.

Links:

PostHeaderIcon [DevoxxFR2014] Reactive Programming with RxJava: Building Responsive Applications

Lecturer

Ben Christensen works as a software engineer at Netflix. He leads the development of reactive libraries for the JVM. Ben serves as a core contributor to RxJava. He possesses extensive experience in constructing resilient, low-latency systems for streaming platforms. His expertise centers on applying functional reactive programming principles to microservices architectures.

Abstract

This article provides an in-depth exploration of RxJava, Netflix’s implementation of Reactive Extensions for the JVM. It analyzes the Observable pattern as a foundation for composing asynchronous and event-driven programs. The discussion covers essential operators for data transformation and composition, schedulers for concurrency management, and advanced error handling strategies. Through concrete Netflix use cases, the article demonstrates how RxJava enables non-blocking, resilient applications and contrasts this approach with traditional callback-based paradigms.

The Observable Pattern and Push vs. Pull Models

RxJava revolves around the Observable, which functions as a push-based, composable iterator. Unlike the traditional pull-based Iterable, Observables emit items asynchronously to subscribers. This fundamental duality enables uniform treatment of synchronous and asynchronous data sources:

Observable<String> greeting = Observable.just("Hello", "RxJava");
greeting.subscribe(System.out::println);

The Observer interface defines three callbacks: onNext for data emission, onError for exceptions, and onCompleted for stream termination. RxJava enforces strict contracts for backpressure—ensuring producers respect consumer consumption rates—and cancellation through unsubscribe operations.

Operator Composition and Declarative Programming

RxJava provides over 100 operators that transform, filter, and combine Observables in a declarative manner. These operators form a functional composition pipeline:

Observable.range(1, 10)
          .filter(n -> n % 2 == 0)
          .map(n -> n * n)
          .subscribe(square -> System.out.println("Square: " + square));

The flatMap operator proves particularly powerful for concurrent operations, such as parallel API calls:

Observable<User> users = getUserIds();
users.flatMap(userId -> userService.getDetails(userId), 5)
     .subscribe(user -> process(user));

This approach eliminates callback nesting (callback hell) while maintaining readability and composability. Marble diagrams visually represent operator behavior, illustrating timing, concurrency, and error propagation.

Concurrency Control with Schedulers

RxJava decouples computation from threading through Schedulers, which abstract thread pools:

Observable.just(1, 2, 3)
          .subscribeOn(Schedulers.io())
          .observeOn(Schedulers.computation())
          .map(this::cpuIntensiveTask)
          .subscribe(result -> display(result));

Common schedulers include:
Schedulers.io() for I/O-bound operations (network, disk).
Schedulers.computation() for CPU-bound tasks.
Schedulers.newThread() for fire-and-forget operations.

This abstraction enables non-blocking I/O without manual thread management or blocking queues.

Error Handling and Resilience Patterns

RxJava treats errors as first-class citizens in the data stream:

Observable risky = Observable.create(subscriber -> {
    subscriber.onNext(computeRiskyValue());
    subscriber.onError(new RuntimeException("Failed"));
});
risky.onErrorResumeNext(throwable -> Observable.just("Default"))
     .subscribe(value -> System.out.println(value));

Operators like retry, retryWhen, and onErrorReturn implement resilience patterns such as exponential backoff and circuit breakers—critical for microservices in failure-prone networks.

Netflix Production Use Cases

Netflix employs RxJava across its entire stack. The UI layer composes multiple backend API calls for personalized homepages:

Observable<Recommendation> recs = userIdObservable
    .flatMap(this::fetchUserProfile)
    .flatMap(profile -> Observable.zip(
        fetchTopMovies(profile),
        fetchSimilarUsers(profile),
        this::combineRecommendations));

The API gateway uses RxJava for timeout handling, fallbacks, and request collapsing. Backend services leverage it for event processing and data aggregation.

Broader Impact on Software Architecture

RxJava embodies the Reactive Manifesto principles: responsive, resilient, elastic, and message-driven. It eliminates common concurrency bugs like race conditions and deadlocks. For JVM developers, RxJava offers a functional, declarative alternative to imperative threading models, enabling cleaner, more maintainable asynchronous code.

Links: