Posted: 11 Jan. 2023 10 min. read

DevOps, SRE, and the reliability life cycle

A blog post by Eric A. Marks, specialist leader, Cloud Strategy, Deloitte Consulting LLP

 

The winds of change are blowing fiercely in the world of information technology (IT). Cloud computing, containers, serverless and other new technical capabilities are disrupting how enterprises develop, test, and release software products and services, and then support and maintain those applications and services in production. Critical to these dramatic changes in technology are the workforce demands, roles, and skills necessary to succeed in these high-change scenarios. Two roles in particular are increasingly in demand: DevOps engineers and site reliability engineers (SREs). While both roles have been around for several years, there is confusion around what each role is accountable for in a modern software delivery model.

DevOps and site reliability engineering (SRE) are really two sides of the same coin. They not only complement each other but are indispensable components for an organization to reach the highest level of operational excellence maturity. There are several interpretations of what constitutes DevOps as contrasted to SRE, and what the scope and key roles and responsibilities are for each of these disciplines.

While DevOps and SRE have both seen dramatic growth in the respective disciplines, and engineers with these respective skills are difficult to find, they remain irrevocably linked in IT history. It’s historically interesting that SRE as a concept preceded the discipline of DevOps, and now they are tightly coupled concepts, co-evolving as the demand for these skills and capabilities increases.

The reliability life cycle: SRE and DevOps interaction models

One useful model to help explain the way in which SRE and DevOps engineers interact is the reliability life cycle model, which is based on analysis of performance metrics over defined time periods. The reliability life cycle accurately models the behavior of a new system over time from its launch into production until it reaches performance stability. It is also based on the continuous ingestion of key performance indicators (KPIs) and logging and tracing data into an observability platform that monitors a system from its initial launch, with new features or releases being deployed continuously.

The reliability life cycle describes the process that every new application or service undergoes when it is launched. Initially there may be reliability or stability issues that arise as the system or service is consumed. In this initial launch phase, the SRE teams, working closely with the application developers, DevOps engineers, and operations teams, will troubleshoot system issues and resolve them. Early in the launch there typically will be multiple stability issues to remediate. Eventually, the rate of system issues will decrease, and the system will begin to demonstrate a level of performance reliability that can be managed within an upper and lower performance threshold, also known as an error budget.

Over time, the system will encounter reliability issues, which are identified and resolved by the SRE or system operations/support team. The SRE team investigates the issues to determine root causes and potential solutions and then rectifies the issues in collaboration with DevOps engineers and the application development team. As new and unexpected issues arise, SRE engineers will follow the same troubleshooting process and resolve issues accordingly.

The resulting system performance pattern is illustrated via the concept of a stabilization ladder (figure 1).

[Click on this image to expand]

The stabilization ladder

New systems or services typically follow the reliability life cycle and climb the stabilization ladder over time, eventually settling into a stable performance range with a defined upper and lower limit of availability. Once this stability threshold is reached, where the application is performing consistently within these established tolerances, an error budget process can be implemented based on real-world performance.

The stabilization ladder can be characterized as a series of rungs, which are application disruptions, or SLA breaches, that may occur after release into production. The number of rungs in the stabilization ladder, as well as the duration of each rung and the total time to achieve stability, are useful ways to measure application quality and post-release performance. The stabilization ladder can also be viewed as a series of rungs that are climbed to reach application stability. 

In figure 1, the depicted application or service requires four rungs to achieve stability. Others may require more rungs or fewer rungs to reach stability. The final rung pushes the application into the target stability performance envelope, and from that point forward an error budget process can be used to maintain the application or service and sustain its stability, reliability, and overall performance as measured by its service level agreement (SLA), service level objective (SLO), and relevant service level indicator (SLI).

The continuous identification and resolution of issues on an individual basis, as well as in the aggregate, reflects the maturity and capability of SRE teams and their troubleshooting and problem resolution skills. Through a continuous loop of fixing bugs and errors and incorporating optimization feedback, the system eventually reaches the stability phase. Once a system is stabilized and operating consistently within target availability and stability tolerances—the SLOs and SLAs—an error budget can be implemented as described above.

The reliability life cycle model and the stabilization ladder concepts largely reflect the focus and scope of SRE engineers when they are working to ensure an application or service is performing as required and operates within its defined SLA and SLO. These SRE activities reflect the focus of SREs on supporting application stability, reliability, and availability once it is released into production. These activities should consume 50% of the SRE engineer’s time allocation.

Once the application is stable and performing well, the SRE pivots to working upstream in the SDLC and partners with the DevOps engineers to automate processes, eliminate manual toil and optimize both operations tasks as well as SDLC capabilities as prioritized in the SRE backlog. These operations-excellence tasks should consume the other 50% of an SRE’s time allocation. 

SREs in action: Error budgets, SLAs, SLOs, and SLIs

In performing their day-to-day activities, SREs harness many metrics, tools, and capabilities to perform their jobs. Central to SREs are the concepts of SLAs, SLOs, and SLIs, and how they support the use of “error budgets.” Before error budgets can be addressed, it’s necessary to define the concepts of SLA, SLO and SLI.

A service level agreement (SLA) is a defined agreement between two parties, typically contractually obligated, that specifies the availability levels for an in-scope service or application as well as the amount of allowable downtime before the consumer may request remedies or financial remuneration. A SLO is more stringent than an SLA. 

Service level indicators (SLIs) are metrics used to be able to monitor system performance, stability, and downtime. So, putting it all together, an error budget is the amount of downtime a system is allowed to experience as defined by its SLO (and SLA), with SLIs being the metrics used to monitor production systems using observability capabilities and trigger alerts based on the SLI, performance thresholds, and other key performance metrics. 

Error budgets are a critical tool used by the SRE community. An error budget is defined as the amount of downtime a system is allowed to experience as defined by its SLO. The error budget for a given system is calculated as follows:

Error budget = Actual availability — SLO

where actual availability is the percentage of service uptime per month. So, if the SLO is set at 99.99% availability, and actual performance is 99.97%, these become the upper and lower limits of the error budget. If the performance dips below 99.97% availability, then new feature development ceases and the SRE focuses on stability and reliability of the service or application. When the application is performing within its defined error budget, within the defined tolerances, new features may be released. 

In this manner, an error budget helps guide the allocation of resources toward two broad categories of work: adding new features to the application or service and ensuring stability and reliability for the application or service. 

If the system is performing within its error budget, new features can be developed and released as prioritized by the product manager. If a new feature or set of new features causes the system to fail and fall out of its defined error budget tolerances, new feature development work stops and the SRE team instead focuses on system stability and reliability features and tasks to bring the system back to operating within its error budget window. Once the system performance is operating within its defined error budget tolerances, new features can again be developed and deployed.

In practice, SREs typically focus 50% of their time proactively working with application teams and DevOps engineers to automate and improve the overall SDLC and implement new operations automation and tooling. In proactive mode, the SRE works to implement automation and, eliminate manual toil and other SRE-specific tasks working from a backlog.

The other 50% of an SREs time is usually spent on more reactive tasks to address stability and reliability issues that are prioritized by Level 1 and Level 2 support tickets. These tasks might involve incident support tasks or critical support issues such as when the error budget is breached. In reactive mode, the SRE will primarily focus on operations activities to bring the system back to within its error budget tolerances. In this mode, the SRE may spend 100% of their time focused on regaining application stability and bringing the application back to performing to its SLO. Once the system is stabilized, the SRE regains the 50-50 balance between reactive operations tasks and proactive stability improvements, toil elimination, removal of technical debt, and other tasks.

The SRE team enables an application to operate within its defined stability window under its defined error budget, based on the defined SLA, SLO, and SLIs. An error budget process establishes a clear SLO-based metric describing the allowable downtime for a given application and thereby governing the SRE time allocation based on balancing new features versus application stability and reliability. This error budget process helps balance the potential risk introduced by adding new features as compared to the goals of application reliability and stability. 

The bottom line

SREs and DevOps engineers are critical resources in a modern software-delivery enterprise. They are also scarce. It is essential to optimize how SRE teams and DevOps teams are organized and how they operate to deliver high-quality software. SREs and DevOps are complementary skill sets that must be aligned to a shared operating model and interaction model that drives desired productivity, software quality, and ultimately revenue and customer satisfaction.

Interested in exploring more on cloud?

Get in touch

David Linthicum

David Linthicum

Managing Director | Chief Cloud Strategy Officer

As the chief cloud strategy officer for Deloitte Consulting LLP, David is responsible for building innovative technologies that help clients operate more efficiently while delivering strategies that enable them to disrupt their markets. David is widely respected as a visionary in cloud computing—he was recently named the number one cloud influencer in a report by Apollo Research. For more than 20 years, he has inspired corporations and start-ups to innovate and use resources more productively. As the author of more than 13 books and 5,000 articles, David’s thought leadership has appeared in InfoWorld, Wall Street Journal, Forbes, NPR, Gigaom, and Lynda.com. Prior to joining Deloitte, David served as senior vice president at Cloud Technology Partners, where he grew the practice into a major force in the cloud computing market. Previously, he led Blue Mountain Labs, helping organizations find value in cloud and other emerging technologies. He is a graduate of George Mason University.