How do you engineer your systems to be highly reliable, especially in a regulated environment like healthcare? Dr Vladyslav Ukis joins us to explain the concepts of Site Reliability Engineering, how to choose and manage Service Level Objectives and Error Budgets, and how to combine it with Continuous Delivery.
Dr Vladyslav Ukis is the author of “Establishing SRE Foundations: A Step by Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.” Dr Ukis is Head of R&D for Siemens Healthineers, where he is running a software delivery organization of 250 distributed team members who develop and operate the Siemens Healthineers teamplay digital health platform. He has led the implementation of Continuous Delivery, DevOps Site Reliability Engineering and Cloud implementations in regulated environments. He holds a Ph.D. in Computer Science from the University of Manchester, and joined us for this episode from Bavaria, Germany.
To learn more about how to structure your service level commitments and provide a reliable application, be sure to check out this episode of the Scaling Tech Podcast!
Watch the video:
Show notes with links to jump ahead are below
Show Notes from Episode 27 – Dr Vladyslav Ukis on Site Reliability Engineering
Timestamp links will open that part of the show in YouTube in a new window
- 00:00 Vlad’s opening quote: “Each potential roll out to production can cause incidents. Or it can just happen that the services have been running for some time and something changed. That means that you need to be an organization that is ready to respond, even if nothing happens for a long period of time, you need to be ready to respond. When something happens, you need to be able to roll out your incident response plan. That means you need to be able to mobilize people quickly, and mobilize the right people quickly. There are a small number of people who are able to fix the incident quickly. Ideally you caught it before the users actually saw it, because of your SLOs, and your infrastructure also alerted you in a timely manner.”
- After the opening quote, Arin encourages listenters to think about how you define success when you launch a new software application or a new feature to an existing application. Any deployment involves risk, and there’s always going to be some failures in an application, either at launch or just operationally. Minimizing those and responding to them properly is what Site Reliability Engineering is all about – maintaining an acceptable level of failure, or an “Error Budget” as our guest Vlad will talk about today.
- Arin introduces Vlad: Dr Vladyslav Ukis is the author of “Establishing SRE Foundations: A Step by Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.” Dr Ukis is Head of R&D for Siemens Healthineers, where he is running a software delivery organization of 250 distributed team members who develop and operate the Siemens Healthineers teamplay digital health platform. He has led the implementation of Continuous Delivery, DevOps Site Reliability Engineering and Cloud implementations in regulated environments. He holds a Ph.D. in Computer Science from the University of Manchester, and joins us today from Bavaria, Germany.
- The origin of Site Reliability Engineering
- 02:52 Vlad works at Siemens Healthineers, a part of the company that takes care of healthcare providers and produces hardware and software products. He started the Team Play digital health platform, which is a platform for digital services deployed in the cloud. Along the way, Vlad learned that you need a well-defined structure for operating your services reliably. This was especially important since most of the team came from an on-premise background and they were learning how to operate in the cloud reliably.
- That led him to learn more about Site Reliability Engineering, which is a discipline that was invented by Google. It enables you to run services reliably at scale, and one of the discipline’s principles is that you treat operations as a software problem. Therefore it’s all inspired by software engineering, which is quite a departure from past system management principles, which were based on more manual configuration.
- Key Concepts of Site Reliability Engineering
- 08:19Vlad talks about different terminology used in Site Reliability Engineering (SRE). Service Level Indicators (SLIs) are non-functional attributes that we would be familiar with even without SRE, things like availability, latency, and throughput that define a particular service’s reliability. They can constitute a set of SLIs that apply to a service in order to define its reliability.
- High availability is important, for example a team might define that the service must be available and responsive in 99.5% of requests within a four-week period. Another service-level objective is latency. If a service takes 10 minutes to return, it will not be considered reliable.
- Once you definite the SLIs that you will track, you can roll that up into Service Level Objectives (SLOs), which are the higher level objectives you have for your service needs to reach in order to provide good reliability to the users of the service. You could define SLOs for a set of APIs to say that these operations, they need to return within 500 milliseconds in 95% of requests over a four-week period. And there you go, you’ve got already two SLOs, one for availability, another one for latency.
- Balancing Service Level Objectives
- 13:00 Arin brings up the tradeoffs that may be necessary, and how different types of applications the importance of different measurements may vary. For instance, in an application to record a podcast, audio latency is very important so that the participants don’t talk over each other during the recording and can have a natural conversation. In a more static website, perhaps availability is more important than latency. How do you balance these?
- Vlad talks about how SRE as a discipline shines here, because it brings together people of different roles into the team and they can figure out the right tradeoffs to make. Vlad stresses the importance of bringing together the product manager responsible for the service, developers responsible for building it, and DevOps people to the same table.
- Ultimately what the team comes up with initially is a best guess – that X.X% for this metric is enough to have a useful application. But this is only theory until you put it in production and see if that service level commitment is sufficient for users or not. So it becomes an iterative process with the team and the system over time.
- Iterating on Service Level Objectives
- 18:00 Arin suggests that it’s common for engineers to overestimate their services’ capabilities and sometimes underestimate their users. So Vlad and Arin agree that it’s important the SLOs are defined from the user’s perspective, not the engineer’s.
- Vlad points out that a common pitfall is that some people on the team may not be comfortable with setting a number that everybody agrees on without a lot of research, so they want more scientific backing. He thinks it’s important to recognize the people on the team and to be more data-driven when setting initial SLOs when possible, and this can require building in more monitoring like uniform system logging.
- Vlad states that breaches of your SLO will inevitably create resistance with your users. Once you have set your initial SLOs, you need to incorporate the insights from that user and system feedback into your calibrations. Getting into the habit of working with feedback from production is something that needs to be trained within the team and be part of your process, because these feedback loops and iteration are core to successful SRE. Learning loops are important because it’s hard to build experience in these concepts and especially for more junior engineers, this is not something they learned at the university.
- Incident Response Processes
- 23:25 Vlad explains that any Software as a Service will need an incident response process because every potential rollout to production can cause an incident. When something happens, an organization needs to be ready to respond and be able to roll out an incident response quickly. A person investigating the incident can be either an operations or development person, and ideally they catch the breach even before users due to their system monitoring.
- You need to have a way in the organization to what a Priority One issue is and how you based that decision, or if it’s a lower level Priority Two issue. It definitely doesn’t have to be science, but it needs to be something kind of reasonable. If nobody can log on, that is obviously priority one. After you define the priority of the issue, you need to be able to mobilize people. In more severe cases you are mobilizing people outside the core team.
- The most severe issues require the fastest response time and the highest degree of communication with users about what is happening and when it is expected to be resolved. Once you’ve fixed the incident, then you again need to inform everyone that this is now fixed and settled. So your status page needs to reflect this, that there was an incident and now there isn’t anymore. And hopefully all that happened before the customer actually noticed this! Afterwards you need to be able to run an efficient postmortem so that you then quickly assemble people and information and try to understand what happened so you can prevent it from happening again. All of this together is the incident response process, and it’s very essential and necessary in a software service organization.
- Vlad and Arin also discuss how you might have different SLOs and response plans based on the severity of the incident. Vlad points out that this all depends on understanding the user journeys in your application. What are the most important user journeys in order to make sure that the application is useful?
- A message from our sponsor: WebRTC.ventures
- 29:08 Building custom WebRTC video applications is hard, but your go live doesn’t have to be stressful. WebRTC.ventures is here to help you build, integrate, assess, test, and manage your live video application for web or mobile! Show notes continue below the ad
- SREs in Regulated Industries
- 30:01 Vlad works in healthcare, and that industry is heavily regulated. There are different classifications of software, and non-medical device software can become medical device software if certain features are added, but you don’t want to have to revamp your whole process if that happens. With medical device software, they have to do clinical evaluations before they can put the software on the market. However, they follow a lot of the same regulations even when building non-medical devices, which helps produce high quality software.
- SRE and Agile Engineering
- 34:00 Arin asks about how does site reliability engineering overlap with other agile engineering techniques like continuous delivery? Vlad has been thinking about that overlap for a long time, and wants to find more direct correlation between the continuous delivery indicators and the SRE indicators.
- Arin talks about how in the ideal Continuous Delivery environment, any developer can deploy any piece of code to production at any time, as long as they’ve followed the process and automated the tests and the code passes the system automation checks. But that ideal is harder to pull off in a heavily regulated environment, and Vlad agrees that this is an area of tension between agile engineering principles and SRE practices, especially in a regulated environment.
- Vlad suggests that you can still implement more continuous delivery concepts in your internal environments, and benefit from the developer freedom of rapid deployments there. But eventually you will need to institute the additional checks that an SRE and regulated industry requires, and that does slow down the final deployment to the users. Vlad points out an additional benefit of this approach is that the team is now thinking about automation of the maximum amount of the system that is possible in order to support the internal continuous delivery environment, and this mindset of automation ultimately benefits the quality of the SRE practices at the external deployment layer.
- The Role of AI in SRE
- 39:20 Arin asks Vlad if he foresees any impact on the role of AI and machine learning on site reliability engineering? Vlad certainly hopes so, and expects that the impact will be large. AI could be used to assist in the monitoring of the system, and helping the team to find error logs and data about the incident and how to handle the response.
- Error Budgets
- 41:03 An Error Budget is another concept from the SRE jargon, and it’s calculated automatically once you’ve got the SLOs. If you have a Service Level Objective for Availability of 99%, for example, that means that your service is available or should be available to fulfill the SLO for 99% of requests, say within four week period. Therefore in 1% of requests, your service is allowed by definition of the SLO to be unavailable. And this is the budget that you’ve got in order to make mistakes. And therefore it’s called error budget. If that budget is too much, and unhappy with the number of errors or availability, then the availability needs to be increased and your error budget goes down.
- Once you have an agreed upon error budget, you can consider what to do with that time? That time is not just for mistakes, but also tells you how much time you have available to do experiments in production, to do configuration changes or deployments that require downtime, etc. Over time you can also look at how your error budget is spent in order to see areas of the system that need improvement.
- Getting Started with SRE
- 45:03 How should an engineering manager get started with Site Reliability Engineering, especially when you have a legacy system to deal with?
- Assuming you already have a cloud based system, Vlad’s book is written with managing legacy systems in mind. Teams will need to some internal alignment on the value and need for SRE around that legacy system, since you’ll need involvement from a variety of roles in order to improve the application. Those roles are product management, product development, and product operations.
- Next you pick a small area of the application to focus on initially, with a tight feedback loop for the features to be built in that area. As you build up the SRE in that area, you need to define and track the SLOs made over time and show the improvement. This allows you to then move from team to team across the organization and implement SRE to make the overall system more reliable.
- 48:31 To get a copy of Vlad’s book, you can buy it on Amazon (link below) or in other major booksellers. To learn more about Vlad’s work and also to see his writings and sample chapters from the book, follow him on LinkedIn where he regularly posts content and chapter summaries from the book.