Recent Posts
Archives

Posts Tagged ‘SRE’

PostHeaderIcon [DevoxxFR2025] Alert, Everything’s Burning! Mastering Technical Incidents

In the fast-paced world of technology, technical incidents are an unavoidable reality. When systems fail, the ability to quickly detect, diagnose, and resolve issues is paramount to minimizing impact on users and the business. Alexis Chotard, Laurent Leca, and Luc Chmielowski from PayFit shared their invaluable experience and strategies for mastering technical incidents, even as a rapidly scaling “unicorn” company. Their presentation went beyond just technical troubleshooting, delving into the crucial aspects of defining and evaluating incidents, effective communication, product-focused response, building organizational resilience, managing on-call duties, and transforming crises into learning opportunities through structured post-mortems.

Defining and Responding to Incidents

The first step in mastering incidents is having a clear understanding of what constitutes an incident and its severity. Alexis, Laurent, and Luc discussed how PayFit defines and categorizes technical incidents based on their impact on users and business operations. This often involves established severity levels and clear criteria for escalation. Their approach emphasized a rapid and coordinated response involving not only technical teams but also product and communication stakeholders to ensure a holistic approach. They highlighted the importance of clear internal and external communication during an incident, keeping relevant parties informed about the status, impact, and expected resolution time. This transparency helps manage expectations and build trust during challenging situations.

Technical Resolution and Product Focus

While quick technical mitigation to restore service is the immediate priority during an incident, the PayFit team stressed the importance of a product-focused approach. This involves understanding the user impact of the incident and prioritizing resolution steps that minimize disruption for customers. They discussed strategies for effective troubleshooting, leveraging monitoring and logging tools to quickly identify the root cause. Beyond immediate fixes, they highlighted the need to address the underlying issues to prevent recurrence. This often involves implementing technical debt reduction measures or improving system resilience as a direct outcome of incident analysis. Their experience showed that a strong collaboration between engineering and product teams is essential for navigating incidents effectively and ensuring that the user experience remains a central focus.

Organizational Resilience and Learning

Mastering incidents at scale requires building both technical and organizational resilience. The presenters discussed how PayFit has evolved its on-call rotation models to ensure adequate coverage while maintaining a healthy work-life balance for engineers. They touched upon the importance of automation in detecting and mitigating incidents faster. A core tenet of their approach was the implementation of structured post-mortems (or retrospectives) after every significant incident. These post-mortems are blameless, focusing on identifying the technical and process-related factors that contributed to the incident and defining actionable steps for improvement. By transforming crises into learning opportunities, PayFit continuously strengthens its systems and processes, reducing the frequency and impact of future incidents. Their journey over 18 months demonstrated that investing in these practices is crucial for any growing organization aiming to build robust and reliable systems.

Links:

PostHeaderIcon Building Resilient Architectures: Patterns That Survive Failure

How to design systems that gracefully degrade, recover quickly, and scale under pressure.

1) Patterns for Graceful Degradation

When dependencies fail, your system should still provide partial service. Examples:

  • Show cached product data if the pricing service is down.
  • Allow “read-only” mode if writes are failing.
  • Provide degraded image quality if the CDN is unavailable.

2) Circuit Breakers

Prevent cascading failures with Resilience4j or Hystrix:

@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public Inventory getInventory(String productId) {
    return restTemplate.getForObject("/inventory/" + productId, Inventory.class);
}

public Inventory fallbackInventory(String productId, Throwable t) {
    return new Inventory(productId, 0);
}

3) Retries with Backoff

Retries should be bounded and spaced out:

@Retry(name = "paymentService", fallbackMethod = "fallbackPayment")
public PaymentResponse processPayment(PaymentRequest req) {
    return restTemplate.postForObject("/pay", req, PaymentResponse.class);
}

RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(200))
    .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.5)) // jitter
    .build();

4) Scaling Microservices in Kubernetes/ECS

Scaling is not just replicas—it’s smart policies:

  • Kubernetes HPA: Scale pods based on CPU or custom metrics (e.g., p95 latency).
    kubectl autoscale deployment api --cpu-percent=70 --min=3 --max=10
  • ECS: Use Service Auto Scaling with CloudWatch alarms on queue depth.
  • Pre-warm caches: Scale up before big events (e.g., Black Friday).

PostHeaderIcon SRE Principles: From Error Budgets to Everyday Reliability

How to define, measure, and improve reliability with concrete metrics, playbooks, and examples you can apply this week.

In a world where users expect instant, uninterrupted access, reliability is a feature. Site Reliability Engineering (SRE) brings engineering discipline to operations with a toolkit built on error budgets, SLIs/SLOs, and automation. This post turns those ideas into specifics: exact metrics, alert rules, dashboards, code and infra changes, and a lightweight maturity model you can use to track progress.


1) What Is SRE Culture?

1.1 Error Budgets: A Contract Between Speed and Stability

An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.

  • Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
  • Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
  • Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.

1.2 SLIs & SLOs: A Common Language

SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.

Domain SLI (what we measure) Example SLO (target) Notes
Availability % successful requests (non-5xx + within timeout) 99.9% over 30 days Define failure modes clearly (timeouts, 5xx, dependency errors).
Latency p95 end-to-end latency (ms) ≤ 300 ms (p95), ≤ 800 ms (p99) Track server time and total time (incl. downstream calls).
Error Rate Failed / total requests < 0.1% rolling 30 days Include client-cancel/timeouts if user-impacting.
Durability Data loss incidents 0 incidents / year Backups + restore drills must be part of policy.

1.3 Automation Over Manual Ops

  • Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
  • Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
  • Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.

2) How Do You Measure Reliability?

2.1 Availability (“The Nines”)

SLO Max Downtime / Year Per 30 Days
99.0% ~3d 15h ~7h 12m
99.9% ~8h 46m ~43m
99.99% ~52m 34s ~4m 19s
99.999% ~5m 15s ~26s

2.2 Latency (Percentiles, Not Averages)

Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.

  • API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
  • Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.

2.3 Error Rate

Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).

2.4 Example SLI Formulas

# Availability SLI
availability = successful_requests / total_requests

# Latency SLI
latency_p95 = percentile(latency_ms, 95)

# Error Rate SLI
error_rate = failed_requests / total_requests

2.5 SLO-Aware Alerting (Burn-Rate Alerts)

Alert on error budget burn rate, not just raw thresholds.

  • Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
  • Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.

3) How Do You Improve Reliability?

3.1 Code Fixes (Targeted, Measurable)

  • Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
  • Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
  • Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.

3.2 Infrastructure Changes

  • Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
  • Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
  • Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.

3.3 Observability Enhancements

  • Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
  • Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
  • Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add request_id, tenant_id, release labels.

3.4 Reliability Improvement Playbook (Weekly Cadence)

  1. Review SLO attainment & burn-rate charts.
  2. Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
  3. Propose one code fix and one infra/observability change.
  4. Deploy via canary; compare SLI before/after; document result.
  5. Close the loop: update runbooks, tests, alerts.

4) Incident Response: From Page to Postmortem

4.1 During the Incident

  • Own the page: acknowledge within minutes; post initial status (“investigating”).
  • Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
  • Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
  • Comms: update stakeholders every 15–30 minutes until stable.

4.2 After the Incident (Blameless Postmortem)

  • Facts first: timeline, impact, user-visible symptoms, SLIs breached.
  • Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
  • Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
  • Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.

5) Common Anti-Patterns (and What to Do Instead)

  • Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
  • Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
  • Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
  • Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
  • Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.

6) A 30-Day Quick Start

  1. Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
  2. Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
  3. Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
  4. Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.

7) Concrete Examples & Snippets

7.1 Example SLI Prometheus (pseudo-metrics)

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.2 Burn-Rate Alert (illustrative)

# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)

7.3 Resilience Config (Java + Resilience4j sketch)

// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
  .failureRateThreshold(50f)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .permittedNumberOfCallsInHalfOpenState(5)
  .slidingWindowSize(100)
  .build();

RetryConfig retry = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.ofMillis(200))
  .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
  .build();

7.4 Kubernetes Health Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5

8) Lightweight SRE Maturity Model

Level Practices What to Add Next
Level 1: Awareness Basic monitoring, ad-hoc on-call, manual deployments Define SLIs/SLOs, create SLO dashboard, add canary deploys
Level 2: Control Burn-rate alerts, incident runbooks, partial automation Tracing, circuit breakers, chaos drills, auto-rollback
Level 3: Optimization Error budget policy enforced, game days, automated rollbacks Multi-region resilience, SLO-gated releases, org-wide error budgets

9) Sample Reliability OKRs

  • Objective: Improve checkout service reliability without slowing delivery.
    • KR1: Availability SLO from 99.5% → 99.9% (30-day window).
    • KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
    • KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
    • KR4: Implement canary + auto-rollback for 100% of releases.

Conclusion

Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.

Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.

PostHeaderIcon Leading Through Reliability: Coaching, Mentoring, and Decision-Making Under Pressure

SRE leadership isn’t only about systems—it’s about people, processes, and resilience under fire.

1) Coaching Team Members Through Debugging

When junior engineers struggle with incidents, I walk them through the scientific method of debugging:

  1. Reproduce the problem.
  2. Collect evidence (logs, metrics, traces).
  3. Form a hypothesis.
  4. Test, measure, refine.

For example, in a memory leak case, I let a junior take the heap dump and explain findings, stepping in only to validate conclusions.

2) Introducing SRE Practices to New Teams

In teams without SRE culture, I start small:

  • Define a single SLO for a critical endpoint.
  • Introduce a burn-rate alert tied to that SLO.
  • Run a blameless postmortem after the first incident.

This creates buy-in without overwhelming the team with jargon.

3) Prioritizing and Delegating in High-Pressure Situations

During outages, prioritization is key:

  • Delegate evidence gathering (thread dumps, logs) to one engineer.
  • Keep communication flowing with stakeholders (status every 15 minutes).
  • Focus leadership on mitigation and rollback decisions.

After stabilization, I lead the postmortem, ensuring learnings feed back into automation, monitoring, and runbooks.

PostHeaderIcon Observability for Modern Systems: From Metrics to Traces

Good monitoring doesn’t just tell you when things are broken—it explains why.

1) White-Box vs Black-Box Monitoring

White-box: metrics from inside the system (CPU, memory, app metrics). Example: http_server_requests_seconds from Spring Actuator.

Black-box: synthetic probes simulating user behavior (ping APIs, load test flows). Example: periodic “buy flow” test in production.

2) Tracing Distributed Transactions

Use OpenTelemetry to propagate context across microservices:

// Spring Boot setup
implementation "io.opentelemetry:opentelemetry-exporter-otlp:1.30.0"

// Annotate spans
Span span = tracer.spanBuilder("checkout").startSpan();
try (Scope scope = span.makeCurrent()) {
    paymentService.charge(card);
    inventoryService.reserve(item);
} finally {
    span.end();
}

These traces flow into Jaeger or Grafana Tempo to visualize bottlenecks across services.

3) Example Dashboard for a High-Value Service

  • Availability: % successful requests (SLO vs actual).
  • Latency: p95/p99 end-to-end response times.
  • Error Rate: 4xx vs 5xx breakdown.
  • Dependency Health: DB latency, cache hit ratio, downstream service SLOs.
  • User metrics: active sessions, checkout success rate.

PostHeaderIcon Java/Spring Troubleshooting: From Memory Leaks to Database Bottlenecks

Practical strategies and hands-on tips for diagnosing and fixing performance issues in production Java applications.

1) Approaching Memory Leaks

Memory leaks in Java often manifest as OutOfMemoryError exceptions or rising heap usage visible in monitoring dashboards. My approach:

  1. Reproduce in staging: Apply the same traffic profile (e.g., JMeter load test).
  2. Collect a heap dump:
    jmap -dump:format=b,file=heap.hprof <PID>
  3. Analyze with tools: Eclipse MAT, VisualVM, or YourKit to detect uncollected references.
  4. Fix common causes:
    • Unclosed streams or ResultSets.
    • Static collections holding references.
    • Caches without eviction policies (e.g., replace HashMap with Caffeine).

2) Profiling and Fixing High CPU Usage

High CPU can stem from tight loops, inefficient queries, or excessive logging.

  • Step 1: Sample threads
    jstack <PID> > thread-dump.txt

    Identify “hot” threads consuming CPU.

  • Step 2: Profile with async profilers like async-profiler or Java Flight Recorder.
    java -XX:StartFlightRecording=duration=60s,filename=recording.jfr -jar app.jar
  • Step 3: Refactor:
    • Replace String concatenation in loops with StringBuilder.
    • Optimize regex (use Pattern reuse instead of String.matches()).
    • Review logging level (DEBUG inside loops is expensive).

3) Tuning GC for Low-Latency Services

Garbage collection (GC) can cause pauses. For trading, gaming, or API services, tuning matters:

  • Choose the right collector:
    • G1GC for balanced throughput and latency (default in recent JDKs).
    • ZGC or Shenandoah for ultra-low latency workloads (<10ms pauses).
  • Sample configs:
    -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled
  • Monitor GC logs with GC Toolkit or Grafana dashboards.

4) Handling Database Bottlenecks

Spring apps often hit bottlenecks in DB queries rather than CPU.

  1. Enable SQL logging: in application.properties
    spring.jpa.show-sql=true
  2. Profile queries: Use p6spy or database AWR reports.
  3. Fixes:
    • Add missing indexes (EXPLAIN ANALYZE is your friend).
    • Batch inserts (saveAll() in Spring Data with hibernate.jdbc.batch_size).
    • Introduce caching (Spring Cache, Redis) for hot reads.
    • Use connection pools like HikariCP with tuned settings:
      spring.datasource.hikari.maximum-pool-size=30
Bottom line: Troubleshooting is both art and science—measure, hypothesize, fix, and validate with metrics.