Performance Testing Chrome

A presentation at Automationeers Assemble in April 2019 in by Annie Sullivan

Slide 1

Slide 1

Performance Testing Chrome Lessons learned from 4 years in the trenches

Slide 2

Slide 2

Hi, I’m Annie! @anniesullie Led performance testing on Chrome from 2015-2018 after contributing since 2011. Previously worked on web performance for Google Search and Google Docs. Now work on metrics for Chrome.

Slide 3

Slide 3

Wait, how do we define performance? ● Speed (page load speed, JavaScript speed, etc) ● Smoothness (animations, scrolling, etc) ● Memory usage ● Battery usage ● Binary size

Slide 4

Slide 4

Why do we do performance testing, anyway? ● First and foremost, to detect regressions. ○ Improving performance doesn’t help if you always regress. ○ Lab tests can narrow down regressions to specific commits. ● Performance tests can sometimes be used to measure improvements. ○ Most useful early in development. ○ In most cases, we prefer A/B testing with end users to measure improvements.

Slide 5

Slide 5

Competing Goals y du o pr Re Re al ism U nd er sta nd a bi lit y lit i b ci

Slide 6

Slide 6

Reproducibility ● Is this a real regression, or just noise? ● Can we repeat the test to narrow down to a commit? ● Can a developer reproduce the problem locally?

Slide 7

Slide 7

Realism

Slide 8

Slide 8

Reality changes over time ● Web content changes ● Devices users browse the web on change ● Network quality and speed changes

Slide 9

Slide 9

Understandability ● What does the test measure? ● What diagnostic information is available? ● What tools are available for local debugging?

Slide 10

Slide 10

Our Performance Testing Stack Bisect Dashboard Benchmarks Test Framework Hardware

Slide 11

Slide 11

Hardware: Realism ● Test on real devices ○ ○ ○ ○ ○ Android Windows Mac ChromeOS Linux ● Use release official builds (unsigned)

Slide 12

Slide 12

Hardware: Reproducibility ● Each run of a test case on exact same device ● Run a “reference build” (Chrome stable) side-by-side with build under test. ● Turn off things running in background as much as possible ● On Android, wait for device to cool between runs ● On Android, often throttle CPU

Slide 13

Slide 13

Test framework: Telemetry ● We call it telemetry. ● Source code: https://github.com/catapult-project/catapult/tree/master/telemetry ● Cross-platform ○ ○ ○ ○ ○ Android Windows Mac ChromeOS Linux

Slide 14

Slide 14

Test framework: Telemetry Two parts of a benchmark ● Metrics ○ ○ Performance measurements Generally independent of story ● Stories ○ ○ ○ The test case Usually a web page Supports user input, multi-page navigations, multi-app on Android

Slide 15

Slide 15

Telemetry: Realism ● Use WprGo to record/replay real web pages. ● End-to-end testing of thousands of real web pages ● Simulate network conditions like 3G ● Simulate user input, multi-page navigations.

Slide 16

Slide 16

Telemetry: Reproducibility Provides a deterministic environment for repeatable results. ● Set up browser profile, caching, etc the same on each run. ● Replay recorded sites instead of live sites ● Network simulation for consistent network speed

Slide 17

Slide 17

Telemetry: Understandability ● Metrics are generated from chrome traces, which are available to developers. ● Metrics can be broken down so it’s clear which components contributed. ● Benchmark owner, documentation, and bug component required in definition.

Slide 18

Slide 18

Telemetry: Understandability

Slide 19

Slide 19

Benchmarks: harnesses System Health Loading Memory blink_perf Rendering V8 Runtime Power Media User-facing C++ microbenchmarks Startup WebRTC JavaScript/DOM cross-browser In-page Micro C++ Micro

Slide 20

Slide 20

User-facing benchmarks focus on realism ● Measuring key metrics on thousands of real web pages ● Get end-to-end performance measurements across the entire codebase

Slide 21

Slide 21

In-page Microbenchmarks focus on Usability ● Easy to compare results between browsers ● Can easily read source and inspect in devtools ● Easy to profile/trace

Slide 22

Slide 22

C++ Microbenchmarks have reproducibility and usability ● Easy to profile ● Smallest amount of code under test

Slide 23

Slide 23

Perf Dashboard ● Source code: https://github.com/catapult-project/catapult/tree/master/dashboard ● Benchmark results uploaded after every run. ● Automatically detects and groups regressions in timeseries. ● Integrates with bug tracker. ● Integrates with bisect tool.

Slide 24

Slide 24

Perf Dashboard Reproducibility: Regression detection

Slide 25

Slide 25

Perf Dashboard Reproducibility: Regression Detection ● Uses a sliding-window step detection algorithm. ○ ○ ○ ○ Runs each time a new data point is added Divide each window into two possible segments, trying every possible division Find the greatest difference between two segments Check if greatest difference passes filters ● Filters are user-configurable ○ ○ ○ ○ ○ Segment size Absolute change Relative change Multiple of standard deviation Steppiness

Slide 26

Slide 26

Perf Dashboard: Understandability ● Automatically links traces generated by telemetry ● Allows users to re-run test on same bot with additional trace categories ● Links test ownership and documentation ● Shows list of performance regressions/improvements at each commit

Slide 27

Slide 27

Bisection: Pinpoint ● Source code: https://github.com/catapult-project/catapult/tree/master/dashboard/ dashboard/pinpoint ● Bisection kicked off by perf dashboard for regressions ● Also supports test failure/flakiness

Slide 28

Slide 28

Pinpoint Reproducibility: Bisection algorithm ● Algorithm explainer ● Considers the set of results from all runs at a given revision as a distribution of samples. ● Uses multiple hypothesis tests in comparing distributions. T-test isn’t appropriate because data is not normally distributed. ● Has custom sharding algorithm to make use of multiple devices

Slide 29

Slide 29

Pinpoint Reproducibility: Bisection Results

Slide 30

Slide 30

Pinpoint Understandability ● Links to traces from all runs ● Allows re-run with more tracing categories ● Integrates with bugs database ● Can run A/B test on any benchmark/bot on unsubmitted change