Understand, Visualize & Improve Continuous Integration using the Elastic Stack

A presentation at SAEC Days 2020 in July 2020 in by Alexander Reelsen

Slide 1

Slide 1

Understand, Visualize & Improve Continuous Integration Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

Slide 2

Slide 2

How do you act on your CI data?

Slide 3

Slide 3

Everything is a search problem!

Slide 4

Slide 4

Ecommerce

Slide 5

Slide 5

Social networks

Slide 6

Slide 6

File Search

Slide 7

Slide 7

Location Search

Slide 8

Slide 8

Observability

Slide 9

Slide 9

Slide 10

Slide 10

Elasticsearch in one minute Search Engine (FTS, Analytics, Geo), near realtime Distributed, scalable, highly available, resilient Interface: HTTP & JSON Heart of the Elastic Stack (Kibana, Logstash, Beats)

Slide 11

Slide 11

Kibana in one minute The window into the Elastic Stack Visualization & Dashboarding Management Monitoring

Slide 12

Slide 12

Slide 13

Slide 13

Slide 14

Slide 14

Slide 15

Slide 15

Slide 16

Slide 16

Slide 17

Slide 17

Slide 18

Slide 18

Survey time!

Slide 19

Slide 19

CI Tooling Jenkins, JenkinsX, TeamCity, Bamboo Travis, CircleCI, GitLab CI Azure Pipelines, AWS CI/CD, Google CI/CD Others?

Slide 20

Slide 20

CI infrastructure Self Hosted on own infrastructure Self Hosted in the cloud CI-as-a-service

Slide 21

Slide 21

CI as a role CI is part of ops role CI is part of dev role Infrastructure Engineer Build Engineer

Slide 22

Slide 22

Test test test dev branches (master, 7.x) release branches (7.8, 6.8) feature branches PRs BWC Benchmarking Packaging tests JVM versions Garbage collectors Operating systems

Slide 23

Slide 23

Do you act on CI results? If so, how?

Slide 24

Slide 24

What is CI data? Time series Recency Locality Fragmentation

Slide 25

Slide 25

CI output unstructured huge needle in the haystack ( TRACE loglevel) requires postprocessing security

Slide 26

Slide 26

Analyzing CI results Detect seemingly random bugs Centralised lookup ability for the team Emails are not a good CI status medium Long term trends (failures, test count, coverage)

Slide 27

Slide 27

Requirements meta data enrichment - per branch, per test run, per test method run, failures only data model should be based on search requirements (query optimized) define search ability: by timestamp, by branch, by class, by test method, by success/failure

Slide 28

Slide 28

Ask your data Did this test fail earlier this month?

Slide 29

Slide 29

Ask your data Show me all failed test runs of this class in this branch in the last month

Slide 30

Slide 30

Ask your data Is this test failing only under this OS?

Slide 31

Slide 31

Ask your data Is our test count increasing compared to our SLOC count?

Slide 32

Slide 32

Ask your data Are our successful test runs decreasing since we doubled the team size?

Slide 33

Slide 33

Ask your data Do the holidays/OOO hours have impact on our CI? Can we reduce costs?

Slide 34

Slide 34

crystal spec

Slide 35

Slide 35

Structure test output xUnit Test Anything Protocol

Slide 36

Slide 36

TAP

Slide 37

Slide 37

xUnit output

<?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”0” time=”0.261869582” timestamp=”2020-07-21T12:43:50Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to &#39;/&#39; endpoint” time=”3.6394e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000250577”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve html_url” time=”7.589e-6”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/objects_spec.cr” classname=”spec.objects_spec” name=”Objects Event can retrieve region” time=”7.438e-6”/>

Slide 38

Slide 38

That’s it - right? Map standard output/error to tests Map logger output to tests Add run specific meta data (branch, OS, JVM) Indexing/Storing strategy Preaggregate data (test count, success/failure, failed tests) Code coverage metrics

Slide 39

Slide 39

Index data into Elasticsearch Massage output data into proper JSON logstash/ xml2json /self written tool? Integration with CI

Slide 40

Slide 40

Java Output (file per class) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”0” errors=”0” timestamp=”2020-07-14T11:32:02” hostname=”rhincodon” time=”0.016”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.015”/> <testcase name=”testCustomAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.001”/> <system-out><![CDATA[TEST TEST TEST ]]></system-out> <system-err><![CDATA[]]></system-err> </testsuite>

Slide 41

Slide 41

Java Output (test failure) <?xml version=”1.0” encoding=”UTF-8”?> <testsuite name=”co.elastic.community.AdminServiceTests” tests=”2” skipped=”0” failures=”1” errors=”0” timestamp=”2020-07-14T11:36:54” hostname=”rhincodon” time=”0.022”> <properties/> <testcase name=”testDefaultAdminService()” classname=”co.elastic.community.AdminServiceTests” time=”0.021”> <failure message=”org.opentest4j.AssertionFailedError: &#10;Expecting:&#10; &lt;true&gt;&#10;to be equal to:&#10; &lt;false&gt;&#10;but was not.” type=”org.opentest4j.AssertionFailedError”>org.opentest4j.AssertionFailedError: Expecting: <true> to be equal to: <false> but was not. at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at co.elastic.community.AdminServiceTests.testDefaultAdminService(AdminServiceTests.java:13) …

Slide 42

Slide 42

Crystal Output (single XML file) <?xml version=”1.0”?> <testsuite tests=”62” skipped=”0” errors=”0” failures=”1” time=”0.258457318” timestamp=”2020-07-14T11:26:23Z” hostname=”rhincodon”> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should not log access to &#39;/&#39; endpoint” time=”3.1523e-5”/> <testcase file=”/Users/alr/devel/elastic/community/slack-command-community/spec/custom_log_handler_spec.cr” classname=”spec.custom_log_handler_spec” name=”CustomLogHandler should log access to any other endpoint” time=”0.000282161”> <failure message=”Expected: 2 got: 1”>../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/methods.cr:76:5 in ‘fail’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/expectations.cr:447:9 in ‘should’ spec/custom_log_handler_spec.cr:33:5 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/example.cr:33:16 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:330:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:18:23 in ‘internal_run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/context.cr:147:7 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/spec/dsl.cr:270:7 in ‘->’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/primitives.cr:255:3 in ‘run’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:45:14 in ‘main’ ../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:114:3 in ‘main’</failure> </testcase> … </testsuite>

Slide 43

Slide 43

Demo Elastic Build Stats Elasticsearch CI

Slide 44

Slide 44

Slide 45

Slide 45

Slide 46

Slide 46

Slide 47

Slide 47

Summary Analytics use-case Not needed for operational tasks ML: Time series anomaly detection (network outage) Test triage!

Slide 48

Slide 48

Test triage Daily single person owner for test failures Check failure Assign team Disable tests to stabilize

Slide 49

Slide 49

The future[tm] Delivery pipeline as a service (K8s: Tekton) Much more automated canary deployments Smarter test execution (affected code changes test cases first) Increase of specialization roles around CI/CD Analytics build into pipelines

Slide 50

Slide 50

Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale

Slide 51

Slide 51

Elastic Cloud

Slide 52

Slide 52

Elastic Support Subscriptions

Slide 53

Slide 53

Getting more help

Slide 54

Slide 54

Discuss Forum https://discuss.elastic.co

Slide 55

Slide 55

Community & Meetups https://community.elastic.co

Slide 56

Slide 56

Official Elastic Training https://training.elastic.co

Slide 57

Slide 57

Thanks for listening Q&A Alexander Reelsen Community Advocate alex@elastic.co | @spinscale