A presentation at SAEC Days 2020 by Alexander Reelsen
Understand, Visualize & Improve Continuous Integration Alexander Reelsen Community Advocate alex@elastic.co | @spinscale
How do you act on your CI data?
Everything is a search problem!
Ecommerce
Social networks
File Search
Location Search
Observability
Elasticsearch in one minute Search Engine (FTS, Analytics, Geo), near realtime Distributed, scalable, highly available, resilient Interface: HTTP & JSON Heart of the Elastic Stack (Kibana, Logstash, Beats)
Kibana in one minute The window into the Elastic Stack Visualization & Dashboarding Management Monitoring
Survey time!
CI Tooling Jenkins, JenkinsX, TeamCity, Bamboo Travis, CircleCI, GitLab CI Azure Pipelines, AWS CI/CD, Google CI/CD Others?
CI infrastructure Self Hosted on own infrastructure Self Hosted in the cloud CI-as-a-service
CI as a role CI is part of ops role CI is part of dev role Infrastructure Engineer Build Engineer
Test test test dev branches (master, 7.x) release branches (7.8, 6.8) feature branches PRs BWC Benchmarking Packaging tests JVM versions Garbage collectors Operating systems
Do you act on CI results? If so, how?
What is CI data? Time series Recency Locality Fragmentation
CI output unstructured huge needle in the haystack ( TRACE loglevel) requires postprocessing security
Analyzing CI results Detect seemingly random bugs Centralised lookup ability for the team Emails are not a good CI status medium Long term trends (failures, test count, coverage)
Requirements meta data enrichment - per branch, per test run, per test method run, failures only data model should be based on search requirements (query optimized) define search ability: by timestamp, by branch, by class, by test method, by success/failure
Ask your data Did this test fail earlier this month?
Ask your data Show me all failed test runs of this class in this branch in the last month
Ask your data Is this test failing only under this OS?
Ask your data Is our test count increasing compared to our SLOC count?
Ask your data Are our successful test runs decreasing since we doubled the team size?
Ask your data Do the holidays/OOO hours have impact on our CI? Can we reduce costs?
crystal spec
Structure test output xUnit Test Anything Protocol
TAP
xUnit output
<?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”0” time=”0.261869582” timestamp=”2020-07-21T12:43:50Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to '/' endpoint” time=”3.6394e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000250577”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve html_url” time=”7.589e-6”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve region” time=”7.438e-6”/>That’s it - right? Map standard output/error to tests Map logger output to tests Add run specific meta data (branch, OS, JVM) Indexing/Storing strategy Preaggregate data (test count, success/failure, failed tests) Code coverage metrics
Index data into Elasticsearch Massage output data into proper JSON logstash/ xml2json /self written tool? Integration with CI
Java Output (file per class) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”0” errors=”0” timestamp=”2020-07-14T11:32:02” hostname=”rhincodon” time=”0.016”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.015”/> <testcase name=”testCustomAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.001”/> <system-out><![CDATA[TEST TEST TEST ]]></system-out> <system-err><![CDATA[]]></system-err> </testsuite>
Java Output (test failure) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”1” errors=”0” timestamp=”2020-07-14T11:36:54” hostname=”rhincodon” time=”0.022”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.021”> <failure message=”org.opentest4j.AssertionFailedError: Expecting: <true> to be equal to: <false> but was not.” type=”org.opentest4j.AssertionFailedError”>org.opentest4j.AssertionFailedError: Expecting: <true> to be equal to: <false> but was not. at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at co.elastic.community.AdminServiceTests.testDefaultAdminService(AdminServiceTests.java:13) …
Crystal Output (single XML file) <?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”1” time=”0.258457318” timestamp=”2020-07-14T11:26:23Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to '/' endpoint” time=”3.1523e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000282161”> <failure message=”Expected: 2 got: 1”>../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/methods.cr:76:5 in ‘fail’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/expectations.cr:447:9 in ‘should’ spec/custom_log_handler_spec.cr:33:5 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/example.cr:33:16 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:330:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:147:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/dsl.cr:270:7 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:45:14 in ‘main’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:114:3 in ‘main’</failure> </testcase> … </testsuite>
Demo Elastic Build Stats Elasticsearch CI
Summary Analytics use-case Not needed for operational tasks ML: Time series anomaly detection (network outage) Test triage!
Test triage Daily single person owner for test failures Check failure Assign team Disable tests to stabilize
The future[tm] Delivery pipeline as a service (K8s: Tekton) Much more automated canary deployments Smarter test execution (affected code changes test cases first) Increase of specialization roles around CI/CD Analytics build into pipelines
Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale
Elastic Cloud
Elastic Support Subscriptions
Getting more help
Discuss Forum https://discuss.elastic.co
Community & Meetups https://community.elastic.co
Official Elastic Training https://training.elastic.co
Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale
This talk is about the challenges of making sense of your CI data. Getting the data into the right format, that it allows extract data from it, is the first big step. We’ll talk about why CI data is useful and how to determine long time trends out of your data.