A presentation at Kubecon Europe 2020 by Kat Cosgrove
DevOps Patterns & Antipatterns for Continuous Software Updates Kat Cosgrove
Kat Cosgrove IoT Engineer Developer Advocate @Dixie3Flatline katc@jfrog.com jfrog.com/shownotes
Why do we update software?
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
WHO ARE WE? WHAT DO WE WANT? USERS! FEATURES!
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
“As every company become a software company, Security vulnerabilities are the new oil spills” @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Fix Identify @dixie3flatline #LiquidSoftware #KubeCon Deploy http://jfrog.com/shownotes
Identify Immediate Fix OS Update Deploy Years
Identify 2 Months Fix Struts Upgrade Deploy 2 Months
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Identify Fix Deploy @dixie3flatline #LiquidSoftware #KubeCon As Fast as Possible As Fast as Possible As Fast as Possible http://jfrog.com/shownotes
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
This is not a new idea! @dixie3flatline • XP: short feedback • Scrum: reducing cycle time to absolute minimum • TPS: Decide as late as possible and Deliver as fast as possible • Kanban: Incremental change #LiquidSoftware #KubeCon http://jfrog.com/shownotes
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
@jbaruch #LiquidSoftware #AzureDayRome http://jfrog.com/shownotes
How do we update? @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Update available Yes No Why not? Do we trust the update? Yes Let’s update! How about no Yes Are there any high risks? No Do we want it? No
number of artifacts as a symptom of complexity Today IoT Serverless Docker Microservices Infrastructure as Code Continuous Delivery Continuous Integration Agile 2000 @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
The problem is not the code, it’s the data. Big data. @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Update available Yes No Can we verify the update? No Yes Yes How about no Do we trust the update? Time consuming verification Let’s update! Yes Are there any high risks? No Do we want it? No
Features that we want @dixie3flatline Acceptance tests costs #LiquidSoftware #KubeCon http://jfrog.com/shownotes
•Your browser •Twitter in your browser •Twitter on your smartphone •Your smartphone OS?! Update available Yes Are there any high risks? No Let’s update! Do we want it? No one asked you (auto update)
What could possibly go wrong?
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Local Rollback @dixie3flatline • Problem: update went catastrophically wrong and an over the-air patch can’t reach the device • Solution: Have a previous version saved on the device prior to update. Rollback in case problem occurred #LiquidSoftware #KubeCon http://jfrog.com/shownotes
@dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: OTA Software Updates @dixie3flatline • Problem: physical recalls are costly. Extremely costly. Also, you can’t force an upgrade. • Solution: Implement over the air software updates, preferably, continuous updates. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous OTA updates are like normal OTA updates, but better @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Continuous Updates @dixie3flatline • Problem: In batch updates, important features wait for unmportant features. • Solution: Implement continuous updates. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
You thought your problems were hard? Things under your control @dixie3flatline #LiquidSoftware Server-side Updates #KubeCon IoT (Mobile, Automotive, Edge) Updates http://jfrog.com/shownotes
You thought your problems were hard? Things under your control Server-side Updates IoT (Mobile, Automotive, Edge) Updates ✓ ✕ The availability of the target @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
You thought your problems were hard? Things under your control Server-side Updates IoT (Mobile, Automotive, Edge) Updates ✓ ✓ ✕ ✕ The availability of the target The state of the target @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
You thought your problems were hard? Things under your control Server-side Updates IoT (Mobile, Automotive, Edge) Updates ✓ ✓ ✓ ✕ ✕ ✕ The availability of the target The state of the target The version on the target @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
You thought your problems were hard? Things under your control Server-side Updates IoT (Mobile, Automotive, Edge) Updates ✓ ✓ ✓ ✓ ✕ ✕ ✕ ✕ The availability of the target The state of the target The version on the target The access to the target @dixie3flatline #LiquidSoftware #KubeCon http://jfrog.com/shownotes
KNIGHT-MARE @dixie3flatline • New system reused old APIs • 1 out of 8 servers was not updated • New clients sent requests to machine contained old code •Engineers removed working code from updated servers, increasing the load on the un-updated server •No monitoring, no alerting, no debugging #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Automated Deployment @dixie3flatline • Problem: People suck at repetitive tasks. • Solution: Automate everything. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Frequent Updates @dixie3flatline • Problem: Seldom deployments generate anxiety and stress, leading to errors. • Solution: Update frequently to develop skill and habit. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: State awareness @dixie3flatline • Problem: Target state can affect the update process and the behavior of the system after the update. • Solution: Know and consider target state when updating. Reverting might require reverting the state. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Cloud-dark @dixie3flatline • New rules are deployed frequently to battle attacks • Deployment of a single misconfigured rule • Included regex to spike CPU to 100% • “Affected region: Earth” #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Progressive Delivery @dixie3flatline • Problem: Releasing a bug affects ALL the users. • Solution: Release to a small number of users first effectively reducing the blast radius and observe. If a problem occurs, stop the release, revert or update the affected users. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Observability @dixie3flatline • Problem: Some problems are hard to trace relying on user feedback only • Solution: Implement tracing, monitoring and logging #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Rollbacks @dixie3flatline • Problem: Fixes might take time, users suffer in the interim • Solution: Implement rollback, the ability to deploy a previous version without delay #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Feature Flags @dixie3flatline • Problem: Rollbacks are not always supported by the deployment target platform • Solution: Embed 2 versions of the features in the app itself and trigger them with API calls #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates pattern: Zero Downtime Updates @dixie3flatline • Problem: You will probably loose all your users if you shut down for 5 weeks to perform an update. • Solution: Perform zero-downtime OTA small and frequent continuous updates. #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Continuous updates @dixie3flatline • Frequent • Automatic • Tested • Progressively delivered • State-aware • Observability • *Local Rollbacks #LiquidSoftware #KubeCon http://jfrog.com/shownotes
Update available Yes Do we trust the update? Yes Let’s update! Yes Are there any high risks? No Do we want it? Sure, why not? (auto update)
So, you want to update the software for your user, be it the nodes in your K8s cluster, a browser on user’s desktop, an app in user’s smartphone or even a user’s car. What can possibly go wrong?
In this talk, we’ll analyze real-world software update failures and how multiple DevOps patterns, that fit a variety of scenarios, could have saved the developers. Manually making sure that everything works before sending update and expecting the user to do acceptance tests before they update is most definitely not on the list of such patterns.
Join us for some awesome and scary continuous update horror stories and some obvious (and some not so obvious) proven ideas for improvement and best practices you can start following tomorrow.