Understand, Visualize & Improve Continuous Integration using the Elastic Stack

A presentation at SAEC Days 2020 by Alexander Reelsen

Understand, Visualize & Improve Continuous Integration Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

How do you act on your CI data?

Everything is a search problem!

Ecommerce

Social networks

File Search

Location Search

Observability

Elasticsearch in one minute Search Engine (FTS, Analytics, Geo), near realtime Distributed, scalable, highly available, resilient Interface: HTTP & JSON Heart of the Elastic Stack (Kibana, Logstash, Beats)

Kibana in one minute The window into the Elastic Stack Visualization & Dashboarding Management Monitoring

Survey time!

CI Tooling Jenkins, JenkinsX, TeamCity, Bamboo Travis, CircleCI, GitLab CI Azure Pipelines, AWS CI/CD, Google CI/CD Others?

CI infrastructure Self Hosted on own infrastructure Self Hosted in the cloud CI-as-a-service

CI as a role CI is part of ops role CI is part of dev role Infrastructure Engineer Build Engineer

Test test test dev branches (master, 7.x) release branches (7.8, 6.8) feature branches PRs BWC Benchmarking Packaging tests JVM versions Garbage collectors Operating systems

Do you act on CI results? If so, how?

What is CI data? Time series Recency Locality Fragmentation

CI output unstructured huge needle in the haystack ( TRACE loglevel) requires postprocessing security

Analyzing CI results Detect seemingly random bugs Centralised lookup ability for the team Emails are not a good CI status medium Long term trends (failures, test count, coverage)

Requirements meta data enrichment - per branch, per test run, per test method run, failures only data model should be based on search requirements (query optimized) define search ability: by timestamp, by branch, by class, by test method, by success/failure

Ask your data Did this test fail earlier this month?

Ask your data Show me all failed test runs of this class in this branch in the last month

Ask your data Is this test failing only under this OS?

Ask your data Is our test count increasing compared to our SLOC count?

Ask your data Are our successful test runs decreasing since we doubled the team size?

Ask your data Do the holidays/OOO hours have impact on our CI? Can we reduce costs?

crystal spec

Structure test output xUnit Test Anything Protocol

TAP

xUnit output
<?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”0” time=”0.261869582” timestamp=”2020-07-21T12:43:50Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to '/' endpoint” time=”3.6394e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000250577”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve html_url” time=”7.589e-6”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve region” time=”7.438e-6”/>

That’s it - right? Map standard output/error to tests Map logger output to tests Add run specific meta data (branch, OS, JVM) Indexing/Storing strategy Preaggregate data (test count, success/failure, failed tests) Code coverage metrics

Index data into Elasticsearch Massage output data into proper JSON logstash/ xml2json /self written tool? Integration with CI

Java Output (file per class) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”0” errors=”0” timestamp=”2020-07-14T11:32:02” hostname=”rhincodon” time=”0.016”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.015”/> <testcase name=”testCustomAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.001”/> <system-out><![CDATA[TEST TEST TEST ]]></system-out> <system-err><![CDATA[]]></system-err> </testsuite>

Java Output (test failure) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”1” errors=”0” timestamp=”2020-07-14T11:36:54” hostname=”rhincodon” time=”0.022”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.021”> <failure message=”org.opentest4j.AssertionFailedError: 
Expecting:
 <true>
to be equal to:
 <false>
but was not.” type=”org.opentest4j.AssertionFailedError”>org.opentest4j.AssertionFailedError: Expecting: <true> to be equal to: <false> but was not. at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at co.elastic.community.AdminServiceTests.testDefaultAdminService(AdminServiceTests.java:13) …

Crystal Output (single XML file) <?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”1” time=”0.258457318” timestamp=”2020-07-14T11:26:23Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to '/' endpoint” time=”3.1523e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000282161”> <failure message=”Expected: 2 got: 1”>../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/methods.cr:76:5 in ‘fail’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/expectations.cr:447:9 in ‘should’ spec/custom_log_handler_spec.cr:33:5 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/example.cr:33:16 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:330:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:147:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/dsl.cr:270:7 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:45:14 in ‘main’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:114:3 in ‘main’</failure> </testcase> … </testsuite>

Demo Elastic Build Stats Elasticsearch CI

Summary Analytics use-case Not needed for operational tasks ML: Time series anomaly detection (network outage) Test triage!

Test triage Daily single person owner for test failures Check failure Assign team Disable tests to stabilize

The future[tm] Delivery pipeline as a service (K8s: Tekton) Much more automated canary deployments Smarter test execution (affected code changes test cases first) Increase of specialization roles around CI/CD Analytics build into pipelines

Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

Elastic Cloud

Elastic Support Subscriptions

Getting more help

Discuss Forum https://discuss.elastic.co

Community & Meetups https://community.elastic.co

Oﬃcial Elastic Training https://training.elastic.co

Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

Alexander Reelsen
@spinscale

1 / 57

This talk is about the challenges of making sense of your CI data. Getting the data into the right format, that it allows extract data from it, is the first big step. We’ll talk about why CI data is useful and how to determine long time trends out of your data.