Go Green, Stay Green Fixing the intermittent failures in your CI - - PowerPoint PPT Presentation

go green stay green
SMART_READER_LITE
LIVE PREVIEW

Go Green, Stay Green Fixing the intermittent failures in your CI - - PowerPoint PPT Presentation

Go Green, Stay Green Fixing the intermittent failures in your CI Greg Law, co-founder and CTO https://undo.io From 1990 - 2005 development hardly changed In the last ten years everything has changed Test OK? What does this mean?? 100%


slide-1
SLIDE 1

https://undo.io

Go Green, Stay Green

— Fixing the intermittent failures in your CI

Greg Law, co-founder and CTO

slide-2
SLIDE 2
slide-3
SLIDE 3

From 1990 - 2005 development hardly changed

slide-4
SLIDE 4

In the last ten years everything has changed

Test OK?

What does this mean??

100% test coverage?

(obviously not.)

100% reliable test-suite? Absolutely!

slide-5
SLIDE 5

The productivity vs quality tradeoff

Quality Productivity

slide-6
SLIDE 6

The productivity vs quality tradeoff

Quality Productivity

slide-7
SLIDE 7
  • 50,000 tests per hour
  • 25,000 * 0.01
  • = 250 failures per hour
  • 24*7*4.333
  • = 182,000 per month
  • 50 tests, once per week
  • Half solid, half 99% reliable
  • 25 x 4 = 100
  • 1 failure per month
  • 50,000 tests per hour
  • 99% reliable, 1% @ 99.9%
  • 50,000*0.01*0.001
  • = 0.5; 1 failure every hour
  • 168/2
  • = 84 failures per week
  • 84 * 4.33333
  • = 364 per month

Arithmetic lesson

slide-8
SLIDE 8

The productivity vs quality tradeoff

Quality Productivity

Go green, stay green

  • Going green is the hard bit.
  • But it’s essential.

Step 1: exclude flaky tests Ever-growing backlog of test that are flaky where no-one understands why.

slide-9
SLIDE 9

Remove the flaky tests? Viable, but has the obvious flaw of reducing coverage. The flaky tests are often the most interesting. Write only deterministic tests? Not viable because deterministic tests are unable to catch non- deterministic errors (e.g. race conditions). Excludes fuzz testing and other powerful techniques. Fix the flaky tests? Gee thanks, great advice(!)

CI/CD vision assumes reliable/repeatable testing - what to do?

slide-10
SLIDE 10

The intermittent test failures kill us

1000’s more tests every hour. Even 0.1% failure rate very bad news. Most of them probably don’t really matter. So we’ll come back to them later, it should be less hectic next week.

Ever-growing backlog of test that are flaky where no-one understands why.

slide-11
SLIDE 11
slide-12
SLIDE 12

Continuous Integration Stress Testing of SAP HANA

  • SAP HANA as an enterprise-class, in-memory database management system
  • OLTP and OLAP, relational and noSQL functionality in a single system
  • Complex codebase
  • Very strict quality and governance processes
  • Sophisticated continuous integration platform
  • Large functional and performance test harness (see Rehmann@RDSS 2014)
  • “Regular“ tests plus highly parallel, multi-user stress tests (PMUT)
  • Arbitrary database operations (DML, DDL, etc) in parallel
  • High amount of stress for system resources
  • Complements other tests with explorative/non-deterministic testing
  • Similar approaches with other systems („chaos monkey“)
slide-13
SLIDE 13

Record program’s execution Replay at any time Freeze-frame Single-step backwards Single-step forwards Find out why the program made the decisions it did

Software Flight Recording Technology

slide-14
SLIDE 14
slide-15
SLIDE 15

The solution

  • SAP uses Live Recorder from Undo to record multi-user stress test (PMUT) runs
  • When a failure occurs the recording is kept and handed over to developers to diagnose
  • Turns the sporadic problem into a 100% reproducible
  • SAP developers use Live Recorder’s interactive reversible debugger – UndoDB – on the

recording to diagnose the root cause of the problem

slide-16
SLIDE 16

Captured in test and diagnosed with Live Recorder

  • A number of sporadic memory leaks and memory corruption defects
  • Several issues in the networking code, including the incorrect flushing of a receive

buffer and sporadically releasing channels in cases of timeout, resulted in queries incorrectly aborting

  • Incorrect parallel access to a shared data-structure which resulted in very subtle

sporadic problems which were hard to reproduce

  • Very sporadic race condition in SAP HANA’s asynchronous garbage collection for in

memory table structures during table unloads under heavy system load

  • A race condition in SAP HANA’s transaction management cache with the potential of

incorrectly reusing cached session data

slide-17
SLIDE 17

@gregthelaw https://undo.io/resources/gdb-watchpoint/

Questions