Chaos engineering: Stress-testing the cloud


Chaos engineering: Stress-testing the cloud

Cloud migration & operation

Cloud architectures are incredibly complex and, often, it's all but impossible to predict and prevent failure scenarios. Enter the ascendant discipline of chaos engineering, which aims to discover cloud failure points, in in-production systems, before they become disasters.

Chaos engineering: Finding cloud bugs before they become disasters

As cloud architectures have grown more complex and distributed, failures and outages have become an all-too-frequent occurrence—which can erode executive confidence in the value of cloud computing. However, architecting the cloud is vastly different from architecting traditional, monolithic systems, and it’s maddeningly difficult to predict, and plan for, all the possibilities for failures. In this episode of the podcast, Mike Kavis and guest, Kolton Andrus of Gremlin, discuss the emerging discipline of chaos engineering and how it’s used to stress-test cloud-based systems in production—which can increase uptime, reduce risk, and accelerate time-to-value of cloud-deployments.

We're taking the engineering discipline that we've developed, and we're applying it to tame that chaos that's inherent in our system. And the outcome is we would love a boring predictable system that just works.

Kolton Andrus, co-founder and CEO of Gremlin, was previously a chaos engineer at Netflix improving streaming reliability and operating the edge services. Prior to that, he improved the performance and reliability of the Amazon Retail website.

Back to top

PDF - 468 MB

Please visit our cloud services webpage and discover a full suite of service offerings and capabilities available to accompany you throughout your Cloud journey.

Did you find this useful?