Posts Tagged ‘Steadybit’
[SpringIO2022] How to foster a Culture of Resilience
Benjamin Wilms, founder of Steadybit, delivered a compelling session at Spring I/O 2022, exploring how to build a culture of resilience through chaos engineering. Drawing from his experience and the evolution of chaos engineering since his 2019 Spring I/O talk, Benjamin emphasized proactive strategies to enhance system reliability. His presentation combined practical demonstrations with a framework for integrating resilience into development workflows, advocating for collaboration and automation.
Understanding Resilience and Chaos Engineering
Benjamin began by defining resilience as the outcome of well-architected, automated, and thoroughly tested systems capable of recovering from faults while delivering customer value. Unlike traditional stability, resilience involves handling partial outages with fallbacks or alternatives, ensuring service continuity. He introduced chaos engineering as a method to test this resilience by intentionally injecting faults—latency, exceptions, or service outages—to build confidence in system capabilities.
Chaos engineering involves defining a steady state (e.g., successful Netflix play button clicks), forming hypotheses (e.g., surviving a payment service outage), and running experiments to verify outcomes. Benjamin highlighted its evolution from a niche practice at Netflix to a growing community discipline, but noted its time-intensive nature often deters teams. He stressed that resilience extends beyond systems to organizational responsiveness, such as detecting incidents in seconds rather than minutes.
Pitfalls of Ad-Hoc Chaos Engineering
To illustrate common mistakes, Benjamin demonstrated a flawed approach using a Kubernetes-based microservice system with a gateway and three backend services. Running a random “delete pod” attack on the hotel service caused errors in the gateway’s product list aggregation, visible in a demo UI. However, the experiment yielded little insight, as it only confirmed the attack’s impact without actionable learnings. He critiqued such ad-hoc attacks—using tools like Pumbaa—for disrupting workflows and requiring expertise in CI/CD integration, diverting focus from core development.
This approach fails to generate knowledge or improve systems, often becoming a “rabbit hole” of additional work. Benjamin argued that starting with tools or attacks, rather than clear objectives, undermines the value of chaos engineering, leaving teams with vague results and no clear path to enhancement.
Building a Culture of Resilience
Benjamin proposed a structured approach to foster resilience, starting with the “why”: understanding motivations like surviving AWS zone outages or ensuring checkout services handle payment downtimes. The “what” involves defining specific capabilities, such as maintaining 95% request success during pod failures or implementing retry patterns. He advocated encoding these capabilities as policies—code-based checks integrated into the development pipeline.
In a demo, Benjamin showed how to define a policy for the gateway service, specifying pod redundancy and steady-state checks via a product list endpoint. The policy, stored in the codebase, runs in a CI/CD pipeline (e.g., GitHub Actions) on a staging environment, verifying resilience after each commit. This automation ensures continuous validation without manual intervention, embedding resilience into daily workflows. Policies include pre-built experiments from communities (e.g., Zalando) or static weak spot checks, like missing Kubernetes readiness probes, making resilience accessible to all developers.
Organizational Strategies and Community Impact
Benjamin addressed organizational adoption, suggesting a central component to schedule experiments and avoid overlapping tests in shared environments. For consulting scenarios, he recommended analyzing past incidents to demonstrate resilience gaps, such as running experiments to recreate outages. He shared a case where a client’s system collapsed during a rolling update under load, underscoring the need for combined testing scenarios.
He encouraged starting with static linters to identify configuration risks and replaying past incidents to prevent recurrence. By integrating resilience checks into pipelines, teams can focus on feature delivery while maintaining reliability. Benjamin’s vision of a resilience culture—where proactive testing is instinctive—resonates with developers seeking to balance velocity and stability.