Responsive

Breaking Things on Purpose: The Real Power of Chaos Engineering with Benjamin Wilms

July 16, 2025
:
24
:
22
Ole Lensmar
Ole Lensmar
CTO
Testkube
Benjamin Wilms
Benjamin Wilms
Co-Founder & CEO
Steadybit
Share on X
Share on LinkedIn
Share on Reddit
Share on HackerNews
Copy URL

Table of Contents

Want a Personalized Feature Set Demo?

Subscribe to our monthly newsletter to stay up to date with all-things Testkube.

You have successfully subscribed to the Testkube newsletter.
You have successfully subscribed to the Testkube newsletter.
Oops! Something went wrong while submitting the form.

Transcript

Welcome to the Cloud Native Testing podcast. I'm Ole Lensmar, your host, and I've spent years building testing tools like SoapUI and Swagger, and now I'm the CTO of Testkube where we're redefining testing for the cloud native landscape. In this podcast we invite people in the space to discuss with us the challenges solutions and best practices for testing in Kubernetes and beyond, covering automation CI/CD, GitOps, and more.

This podcast is proudly sponsored by Testkube. If you're interested about how to test cloud native applications and infrastructure, subscribe and stay tuned.

Now let's dive into the episode. Thank you for joining.

---

Ole Lensmar:
Hello and welcome to another episode of the Cloud Native Testing Podcast. I'm super thrilled to be joined by Benjamin Wilms today from Steadybit. Ben, welcome to the show.

Benjamin Wilms:
Thank you very much for having me.

Ole Lensmar:
It's a pleasure and an honor. Tell us a bit about yourself—where you are and what you're doing.

Benjamin Wilms:
I started my career around 25 years ago as a software engineer. I worked in logistics, moved into consulting, and learned a lot—especially from failure. If you're surrounded by the right people and culture, failure becomes a learning opportunity.
About nine years ago, I discovered chaos engineering—intentionally causing failure to learn from it. Now, I'm the founder and CEO of Steadybit, a chaos engineering platform.

Ole Lensmar:
Very cool. As you were talking, I thought about how chaos engineering really formalized the idea of failing forward. Instead of just reacting to failure, it's about provoking it to evolve systems.

Benjamin Wilms:
Exactly. It’s a mindset shift—intentionally injecting failure to validate whether your system and organization can handle it. Hope is not a strategy.

Ole Lensmar:
No, it's not. So, tell us more about Steadybit and how it's different in the chaos engineering space.

Benjamin Wilms:
During my time as a consultant, I helped companies across Germany adopt chaos engineering. I found existing tools—open source or otherwise—were time-consuming to set up. People were spending too much time configuring tools instead of running meaningful experiments.
We founded Steadybit in 2019 to change that. Our mission is to make chaos engineering accessible to non-experts so they can improve reliability without getting lost in tooling.

Ole Lensmar:
Makes sense. At first, chaos engineering felt like something for companies at massive scale. But now it seems like smaller teams can benefit too?

Benjamin Wilms:
Definitely. There’s no minimum scale. Some of our customers are teams of six or seven, just starting out. They want to trust the system they’re building.
In smaller teams, people have to build and operate everything, so surprises in production are costly.
Of course, large organizations use chaos engineering too—Netflix was an early pioneer. But everyone can and should do it. You don’t even need a tool. Just SSH into a VM and shut it down.

Ole Lensmar:
That’s a very manual, but real, way to start. Chaos engineering seems like an ops responsibility, but in testing, we talk a lot about shift left. Can chaos engineering shift left too? When does it add value in the SDLC?

Benjamin Wilms:
Good question. When I talk about a "system," I mean more than just technical components—it's the software, the pipeline, the people, the processes. Chaos engineering is a team sport.
SREs feel the pain the most, but they can’t fix everything. They need support from the whole organization to make systems more resilient.

Ole Lensmar:
Let’s say I’m a developer working on a few microservices. I might assume chaos engineering comes later. How could I get involved earlier?

Benjamin Wilms:
Start by stopping happy path testing. Your unit or integration tests might pass, but production is messy—latency, network issues, corrupted packages happen.
You need to design components that can still function under those conditions. That’s what resilience is about.

Ole Lensmar:
Especially in cloud-native environments where infrastructure is fluid, it’s hard to know what conditions your app will face.

Benjamin Wilms:
Exactly. The system is constantly evolving—new features, more users, changes in traffic. You need to design for unpredictability.

Ole Lensmar:
Is there a way for developers to simulate that locally? Like adding network latency?

Benjamin Wilms:
Yes. Tools like Toxiproxy can simulate latency or limit request rates. You can build this into your integration tests.
But let’s be honest—developers often don’t think about security or reliability until something breaks. They’re measured on how fast they can ship features. Eventually, though, that approach creates a monster in production.

Ole Lensmar:
We’ve talked in other episodes about mocking and injecting errors in test environments. That’s a form of chaos testing too, right?

Benjamin Wilms:
It's a good start. But the real value of chaos engineering comes when you test the full system as deployed—not just individual components.

Ole Lensmar:
How do you foster a culture of chaos testing? Is it through automated testing, or team-wide participation?

Benjamin Wilms:
To start automation, your system should be ready. Some customers run two or three game days a year and find major issues that take months to fix.
If you automate too early, you’ll be overwhelmed with issues. So, first, raise your system's reliability, then automate.

Ole Lensmar:
Game day? Can you explain that term?

Benjamin Wilms:
A game day is a chaos experiment run with everyone in the same room—SREs, engineers, product, and even decision-makers.
Why? Because decision-makers often push for features over fixing invisible technical debt.
In a game day, you simulate a scenario—like a zone outage in Kubernetes—and discuss how your system will behave. If someone already knows it will fail, don’t run the experiment. Fix it first. Then run the experiment to validate.

Ole Lensmar:
So the goal is not failure—it’s findings.

Benjamin Wilms:
Exactly. You inject failure to learn how your system reacts. The outcome should be insights you can act on.

Ole Lensmar:
Is there a case for fully automated chaos testing, or does that miss the point?

Benjamin Wilms:
It’s useful—once you’ve validated an experiment during a game day and it passes, you can include it in your CI/CD pipeline as a regression test. That way, every deployment re-validates resilience.

Ole Lensmar:
What kinds of experiments are common?

Benjamin Wilms:
Start simple: kill a pod or shut down a node.
Before running the experiment, define your hypothesis—for example, "If one node goes down, the system should still serve traffic."
You can also simulate DNS outages, corrupt packages, throttle bandwidth—using tools like Pumba, Toxiproxy, or even IP tables.

Ole Lensmar:
Do these experiments have to run in production?

Benjamin Wilms:
Eventually, yes—but start earlier. In finance, running in production might not be allowed. But in gaming, where there's no staging system at scale, production is the only option.
Start on your local machine. Learn how your service behaves under stress. Then, when you're confident, run in production.

Ole Lensmar:
Is Kubernetes an enabler for chaos engineering?

Benjamin Wilms:
Yes, it helps implement resilience patterns by default. But there are risks—like zones in your cluster being managed by a cloud provider.
If you’re not aware of how your pods are distributed across zones, a zone outage could take down your whole system. You need to account for that.

Ole Lensmar:
Let’s talk AI—it’s everywhere. Does it play a role in chaos engineering?

Benjamin Wilms:
Right now, most teams don’t trust AI to inject failure in real time. But AI can help analyze experiments and highlight recurring issues.
If you’re running 60,000 experiments a year, AI can identify patterns and help prioritize what to fix.

Ole Lensmar:
What about chaos testing for AI systems themselves?

Benjamin Wilms:
Not for the models, but for the infrastructure underneath—like GPUs—you can simulate load or faults there. I haven’t seen chaos testing directly for LLMs yet.

Ole Lensmar:
What’s the future of chaos testing look like?

Benjamin Wilms:
Tools will help more in the pre-experiment phase—by analyzing your system and pointing out weak spots before you even run an experiment.
In Steadybit, we have a feature called "Advice" that highlights risks based on system data. It reduces the need for constant game days and lets you improve proactively.

Ole Lensmar:
If someone’s just starting out, how can they avoid getting overwhelmed?

Benjamin Wilms:
Start with a platform that makes setup easy. Focus on the experiment itself, not on infrastructure. There are good tools out there—Steadybit and others—that let you get going quickly.

Ole Lensmar:
And culture matters too, right?

Benjamin Wilms:
Yes, you need a culture where incidents are learning opportunities—not about blame. The whole system and team should grow from failure.

Ole Lensmar:
Absolutely. Benjamin, it’s been great to have you on. I’ve learned so much. Thanks for joining, and thanks to everyone listening. See you next time.

Benjamin Wilms:
Thank you very much.