Amplify software resilience and learn from failures by asking how, not why has been saved
Podcast
Amplify software resilience and learn from failures by asking how, not why
Part of the Architecting the Cloud podcast series
When systems fail, the blame game often begins, and, surprisingly, it begins with a simple question: "Why?" Instead, to build more resilient systems, as well as more trust among team members, ask what happened and how.
Build resilience: When systems fail, don't ask why–ask what happened, and how
Critical failures happen to software systems; it’s just a fact. Learning from those failures and focusing on building resilience is more important than ferreting out who’s to blame. In this episode of the podcast, Mike Kavis and guest, Netflix’s Jessica DeVita, talk about resilience engineering and why asking "why" something happened is not the best way to deal with system failures. Instead, Jessica, through introducing the concepts of human factors thinking and local rationality, recommends that software teams ask what happened and how, so that blame is not a part of the equation and teams can build a culture of trust that can help build more resilient systems. She also gives salient advice on how companies can start their own resilience engineering journey starting with unlearning that human error is a cause.
Disclaimer: As referenced in this podcast, “Amazon” refers to AWS (Amazon Web Services) and “Google” refers to GCP (Google Cloud Platform).
Fundamentally, if you move away from the why question, you have a better shot at learning more from this incident.
Jessica DeVita is a senior applied resilience engineer at Netflix where she investigates incidents and outages in production software. She has 20 years of experience in IT operations across industries such as medical device, entertainment, and large scale cloud computing.
How can SRE help organizations achieve better and fast results?
Site Reliability Engineering (SRE) offers many benefits, but it has to be specific to the organization implementing it to harness its full value.
Chaos engineering: Stress-testing the cloud
Cloud architectures are incredibly complex and, often, it's nearly impossible to predict failures. Enter chaos engineering, which aims to discover cloud failure points before they become disasters.
Put Cloud in context with the future of business and technology
Because cloud is never just about cloud, a podcast about cloud isn’t either. Our two hosts deliver two unique perspectives to help bring you closer to achieving what matters most—your possible.
For Cloud Professionals, hosted by David Linthicum, provides an enterprise-level, strategic look at key issues impacting clients’ businesses. David, ranked as the #1 cloud influencer in a recent Apollo Research report, has published 13 books on computing, written over 5,000 published articles and performed over 500 conference presentations, making his specialization in the power of cloud simply undeniable.
As a pioneer in cloud computing, Mike Kavis leads Architecting the Cloud, which offers insights from the POV of those who’ve had hands-on experience with cloud technology. Mike’s personal cloud journey includes leading the team that built the world's first high-speed transaction network in Amazon's public cloud—a project that ultimately won the 2010 AWS Global Startup Challenge.
With two leaders in your ear, you’ll have the content you need to drive the next conversation around cloud. Check out both talk tracks within the Deloitte On Cloud podcast to get the compelling stories on your schedule to help you understand the topics that are reshaping today’s market.
Contact us at cloud@deloitte.com for information on this or any other On Cloud podcasts.
Or visit the On Cloud library for the full collection of episodes.
Subscribe now on: iTunes | SoundCloud | Stitcher | Google Play | Spotify
Recommendations
Deloitte On Cloud podcast
Reimagining what cloud computing can do for business