Toil
Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1
Slide 2
@DivineOps
SRE teams usually aim for under 50% toil
Slide 3
@DivineOps
If an employee is told that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? - Byron Miller
Slide 4
The eternal sunshine of the toil-less prod Sasha Rosenbaum @DivineOps
Slide 5
Dev
B.Sc. in C.S.
Ops
MBA
Dev + Ops Cloud Consulting DevRel Technical Sales Sasha Rosenbaum @DivineOps
Slide 6
@DivineOps
And you?
Slide 7
Slide 8
Red Hat OpenShift - Kubernetes Experience 2014
2015
2016
Nearly 100 Kubernetes Distributions In the Market
OpenShift v4 (2019-2020)
OpenShift v3 (2015)
Data Center 8
Industry Leader 2nd Largest Contributor Most Enterprise Customers Most Multi-Cloud Deployments
Slide 9
A consistent experience no matter where you run it Developer Efficiency
Business Productivity
Enterprise Ready
Red Hat OpenShift On-premises
Red Hat OpenShift Service on AWS
Azure Red Hat OpenShift
Red Hat Managed Joint offerings with Cloud Provider
9
Red Hat OpenShift on IBM Cloud
Red Hat OpenShift Dedicated
Red Hat Managed
OpenShift Container Platform
OCP Customer Managed
Slide 10
Our Scale
Our Experience 4 Public Clouds
100,000+ Clusters
60+ Regions
2M+ Develop ers
6000+ New Users
2 Million+ Hours of SRE
Every Week
experience
Slide 11
@DivineOps
Products to Services
Slide 12
@DivineOps
Products and Services
Slide 13
@DivineOps
SRE
Slide 14
@DivineOps
What is the most important and innovative thing about SRE discipline?
Slide 15
@DivineOps
SRE is about explicit agreements that align incentives
Slide 16
@DivineOps
SLA, SLI, SLO
Slide 17
@DivineOps
SLA = Financially-backed availability
Slide 18
@DivineOps
Monthly downtime > 1.5 days means 100% refund
Slide 19
@DivineOps
SLAs are about aligning incentives between Vendor & Customer
Slide 20
@DivineOps
99% => 99.99%
Slide 21
@DivineOps
• SLA usually includes a single metric • For financial and reputational reasons, companies prefer to under promise and overdeliver
Slide 22
@DivineOps
SLO = Targeted reliability
Slide 23
Our Journey
Service Level Objectives: What we care about? Availability
OAuth Server
Availability
Registry Success rate Cluster installation time (includes external dependencies)
Builds
Cluster Provisioning Success rate Critical rollout time
Availability
Cluster Upgrade
Availability measured from customer perspective (closed box monitoring) API Server
Can customers run their CI/CD? Periodically run synthetic builds
Router
OpenShift Console
Availability measured from customer perspective (closed box monitoring)
Slide 24
@DivineOps
SLI = Actual reliability
Slide 25
@DivineOps
Monitoring
Slide 26
@DivineOps
Without monitoring, you have no way to tell whether your service even works!
Slide 27
@DivineOps
Good Monitoring
Slide 28
@DivineOps
Without good monitoring, you don’t know that the service does what users expect it to do!
Slide 29
@DivineOps
Signal to noise ratio
Slide 30
@DivineOps
Early on, one of the major monitoring problems we had is alerts on customer clusters that were intentionally taken offline
Slide 31
@DivineOps
Without good monitoring, your SRE is potentially overloaded with unwarranted emergencies and blindsided by real incidents
Slide 32
@DivineOps
Periodically incidents may be caught by internal users, rather than the monitoring system We aim to implement monitoring improvements that will catch future problems of the same kind
@DivineOps
Error budgets are about aligning incentives between Dev & Ops
Slide 42
@DivineOps
If developers are measured on the same SLO, then when the error budget is drained developers shift focus from delivering new features to improving reliability
Slide 43
@DivineOps
So, we’ve written things down
Slide 44
@DivineOps
Are we there yet?
Slide 45
The future is already here. It’s just not evenly distributed ~ William Gibson
Slide 46
@DivineOps
Awkward Segue
Slide 47
@DivineOps
What we all got wrong
Slide 48
@DivineOps
Slide 49
@DivineOps
“SRE is what happens when you ask a software engineer to design an operations team.” - Google SRE book, 2017
2000 27% of server market
41% of server market
• File-based OS • Maintains configuration in files • Every device is a file
• Executable-based OS • Maintains configuration in registry • Every device has a different driver mechanism
Slide 58
@DivineOps
PowerShell (Windows) configuration management framework, CLI, and scripting language GA: 2006
Jeffrey Snover
Slide 59
@DivineOps
DevOpsDays Seattle 2019: Thriving Through Transitions by Jeffrey Snover
Slide 60
@DivineOps
Every wave of automation Enables the next wave of automation
Slide 61
@DivineOps
Infrastructure-level APIs
Slide 62
@DivineOps
Borg cluster manager
“Central to its success - and its conception - was the notion of turning cluster management into an entity for which API calls could be issued” - Google SRE book, 2017
@DivineOps
We did NOT suddenly get the idea of infrastructure & platform automation
65
Slide 66
@DivineOps
We did NOT suddenly get the idea of infrastructure & platform automation We gradually built the tools required to make it happen 66
Slide 67
@DivineOps
Why does this matter?
Slide 68
@DivineOps
If we get the origin story wrong, we end up working to solve the wrong problem!
Slide 69
@DivineOps
Corollary 1: Hiring developers to do operations work ≠ effective SRE
Slide 70
@DivineOps
We have seen success from hiring the skillsets across the entire landscape, hiring well-rounded folks with understanding of Ops and Dev concepts, as opposed to just Dev experience
Slide 71
@DivineOps
Corollary 2: The desire to automate the infrastructure & platform operations is insufficient
Slide 72
@DivineOps
Corollary 2: The desire to automate the infrastructure & platform operations is insufficient We need consistent APIs and reliable monitoring to unblock the automation
Slide 73
@DivineOps
Early on, we had to move the Cloud Services Build system from on-prem to the Cloud, because it was not meeting our reliability and agility targets
Slide 74
@DivineOps
What we all got wrong
Slide 75
@DivineOps
Toil
Slide 76
Toil
Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 76
Slide 77
@DivineOps
SRE teams usually aim for under 50% toil
Slide 78
@DivineOps
So, are we striving for a human-less system?
Slide 79
@DivineOps
Second law of thermodynamics
Slide 80
@DivineOps
With time, the net entropy (degree of disorder) of any isolated system will increase
Slide 81
@DivineOps
Slide 82
Entropy always wins
Slide 83
Slide 84
People working above the line of representation continuously build and refresh their models of what lies below the line. That activity is critical to the resilience of Internetfacing systems and the principal source of adaptive capacity. - Dr. Richard Cook
Slide 85
Slide 86
Resilience
Velocity 2012: Richard Cook, “How Complex Systems Fail”
Slide 87
@DivineOps
What we call toil is a major part of resilience and adaptive capacity
Slide 88
@DivineOps
Perhaps we need a better way to look at toil
Slide 89
@DivineOps
SRE folks worry that if they spend significant parts of their day focusing on toil, it will negatively affect their bonuses, chances of promotions etc.
Slide 90
@DivineOps
If an employee is told that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? - Byron Miller
Slide 91
@DivineOps
SRE Work Allocation
Slide 92
@DivineOps
Work Allocations: SRE P and SRE O
Slide 93
Traditional IT
dev
ops
wall of confusion
Slide 94
@DivineOps
Work Allocations: On-call once a month
Slide 95
@DivineOps
SRE teams asked management for more on-call, as they were losing their “Ops muscle”
Slide 96
@DivineOps
Work Allocations: Rotate engineers working on toilreduction tasks
Slide 97
@DivineOps
Lack of continuity severely impacted team’s ability to deliver
Slide 98
@DivineOps
Work Allocations: The search for the perfect system is still in progress!
Slide 99
@DivineOps
Where do we go from here?
Slide 100
@DivineOps
Let’s look at some of the insights from the talk:
@DivineOps
Cloud provides an industry standard for consistent infrastructure-level APIs
Slide 104
@DivineOps
Are you in the datacenter management business?
Slide 105
@DivineOps
Kubernetes
Slide 106
85% of global IT leaders agree that Kubernetes is key to cloud-native application strategies
Source: Red Hat State of Open Source Report 2021
Source: Red Hat State of Open Source Report 2021
Slide 107
@DivineOps
Kubernetes could provide the industry standard for consistent platform-level APIs
Slide 108
@DivineOps
If building PaaS isn’t your company’s core business
Slide 109
@DivineOps
If building PaaS isn’t your company’s core business Allow your provider to toil for you
Slide 110
A consistent experience no matter where you run it Developer Efficiency
Business Productivity
Enterprise Ready
Red Hat OpenShift On-premises
Red Hat OpenShift Service on AWS
Azure Red Hat OpenShift
Red Hat OpenShift on IBM Cloud
Joint offerings with Cloud Provider Offered as a Native Console offering on equal parity with cloud provider Kubernetes service 110
or OCP Customer Managed
Red Hat OpenShift Dedicated
Red Hat Managed
OpenShift Container Platform
OCP Customer Managed
Slide 111
@DivineOps
Software Services
You build it, you run it
Platform Services
Operated
Infrastructure Services
Operated
Slide 112
@DivineOps
Company A
Company B
Toil
Toil
Automated
Automated
Slide 113
@DivineOps
Get your skills above the API! Image Source: Hans Moravec’s illustration of the rising tide of the AI capacity. From Max Tegmark (2017)
Slide 114
@DivineOps
If building PaaS IS your company’s core business
Slide 115
@DivineOps
Remember that SRE is about explicit agreements that align incentives
Slide 116
@DivineOps
Focus your toil where your business value is
Slide 117
@DivineOps
Last, but not least
Slide 118
@DivineOps
Ideas are open source
Slide 119
@DivineOps
Operate First A concept of incorporating operational experience into software development