Testing atomicity Finding race conditions by random testing John - - PowerPoint PPT Presentation

testing atomicity
SMART_READER_LITE
LIVE PREVIEW

Testing atomicity Finding race conditions by random testing John - - PowerPoint PPT Presentation

Testing atomicity Finding race conditions by random testing John Hughes "We know there is a lurking bug somewhere in the dets code. We have got 'bad object' and 'premature eof' every other month the last year. We have not been able to


slide-1
SLIDE 1

Testing atomicity

Finding race conditions by random testing John Hughes

slide-2
SLIDE 2

"We know there is a lurking bug somewhere in the dets code. We have got 'bad object' and 'premature eof' every other month the last year. We have not been able to track the bug down since the dets files is repaired automatically next time it is opened.“ Tobbe Törnqvist, Klarna, 2007

slide-3
SLIDE 3

What is it?

Application Mnesia Dets File system

Invoicing services for web shops Distributed database: transactions, distribution, replication Tuple storage 700+ people in 6 years Race conditions?

slide-4
SLIDE 4

QuickCheck

1999—invented by Koen Claessen and me (ICFP 2000), in Haskell 2006—Quviq founded marketing Erlang version 2009—Race condition testing method (ICFP) Real successes and further developments

slide-5
SLIDE 5

Imagine Testing This…

dispenser:take_ticket() dispenser:reset()

slide-6
SLIDE 6
  • k =

1 = 2 = 3 =

  • k =

1 =

A Unit Test in Erlang

test_dispenser() -> Expected results reset(), take_ticket(), take_ticket(), take_ticket(), reset(), take_ticket(). Side-effects require a sequence of calls to test

slide-7
SLIDE 7

State Machine Specifications

API Calls API Calls API Calls API Calls

Model state Model state Model state Model state

postconditions

slide-8
SLIDE 8

Modelling the dispenser

reset

take take take 1 2

  • k 1 2 3
slide-9
SLIDE 9

A Parallel Unit Test

  • Three possible correct
  • utcomes!

reset take_ticket take_ticket take_ticket 1 2 3 1 3 2 1 2 1

  • k
slide-10
SLIDE 10

Another Parallel Test

  • 42 possible correct outcomes!

reset take_ticket take_ticket take_ticket take_ticket reset

slide-11
SLIDE 11

Deciding a Parallel Test

reset ok

take 1 take 3 take 2 1 2

Atomic operations: an important special case

slide-12
SLIDE 12

Prefix: Parallel:

  • 1. dispenser:take_ticket() --> 1
  • 2. dispenser:take_ticket() --> 1

Result: no_possible_interleaving take_ticket() -> N = read(), write(N+1), N+1.

slide-13
SLIDE 13

dets

  • Tuple store:

{Key, Value1, Value2…}

  • Operations:

– insert(Table,ListOfTuples) – delete(Table,Key) – insert_new(Table,ListOfTuples) – …

  • Model:

– List of tuples (almost)

slide-14
SLIDE 14

QuickCheck Specification

... … ... …

<100 LOC

> 6,000 LOC

slide-15
SLIDE 15

Bug #1

Prefix:

  • pen_file(dets_table,[{type,bag}]) -->

dets_table Parallel:

  • 1. insert(dets_table,[]) --> ok
  • 2. insert_new(dets_table,[]) --> ok

Result: no_possible_interleaving

insert_new(Name, Objects) -> Bool Types: Name = name() Objects = object() | [object()] Bool = bool()

slide-16
SLIDE 16

Bug #2

Prefix:

  • pen_file(dets_table,[{type,set}]) --> dets_table

Parallel:

  • 1. insert(dets_table,{0,0}) --> ok
  • 2. insert_new(dets_table,{0,0}) --> …time out…

=ERROR REPORT==== 4-Oct-2010::17:08:21 === ** dets: Bug was found when accessing table dets_table

slide-17
SLIDE 17

Bug #3

Prefix:

  • pen_file(dets_table,[{type,set}]) --> dets_table

Parallel:

  • 1. open_file(dets_table,[{type,set}]) --> dets_table
  • 2. insert(dets_table,{0,0}) --> ok

get_contents(dets_table) --> [] Result: no_possible_interleaving

!

slide-18
SLIDE 18

Is the file corrupt?

slide-19
SLIDE 19

Bug #4

Prefix:

  • pen_file(dets_table,[{type,bag}]) --> dets_table

close(dets_table) --> ok

  • pen_file(dets_table,[{type,bag}]) --> dets_table

Parallel:

  • 1. lookup(dets_table,0) --> []
  • 2. insert(dets_table,{0,0}) --> ok
  • 3. insert(dets_table,{0,0}) --> ok

Result: ok

premature eof

slide-20
SLIDE 20

Bug #5

Prefix:

  • pen_file(dets_table,[{type,set}]) --> dets_table

insert(dets_table,[{1,0}]) --> ok Parallel:

  • 1. lookup(dets_table,0) --> []

delete(dets_table,1) --> ok

  • 2. open_file(dets_table,[{type,set}]) --> dets_table

Result: ok false

bad object

slide-21
SLIDE 21

"We know there is a lurking bug somewhere in the dets code. We have got 'bad object' and 'premature eof' every other month the last year.” Tobbe Törnqvist, Klarna, 2007 Each bug fixed the day after reporting the failing case

slide-22
SLIDE 22

Testing a Worker Pool

  • Check out a worker
  • Check in a worker
  • Handle workers

crashing

  • Handle clients

crashing while holding a worker

slide-23
SLIDE 23
  • Loads and loads of bugs found
  • 80 unit tests passed throughout!
  • Parallel testing found no race conditions

Problem: checking

  • ut a worker blocks

if there isn’t one!

slide-24
SLIDE 24

Blocking operations

Test deadlocks? In practice, lock times out

slide-25
SLIDE 25

Should this test pass?

But a blocked

  • peration should not

run before an unblocked one!

slide-26
SLIDE 26

Serializability with Blocking

  • Specify when an atomic operation should

block

  • When exploring interleavings, never choose a

blocked operation when there is an unblocked alternative

  • We rule out some interleavings, potentially

making test fail that would otherwise have passed

slide-27
SLIDE 27

A race condition in Poolboy?

Start the worker pool (1 worker) checkout checkout checkout checkin checkin

slide-28
SLIDE 28

Conclusion

  • Serializability is a

– simple condition – that is surprisingly effective – at revealing bugs in real industrial code

Not quite done…

slide-29
SLIDE 29

Provoking races

  • We’ve used:

– Repeated execution on a multicore processor – Random scheduling – ”Procrastination”… repeating a test, but reordering message deliveries to the same process – Model checking—all possible schedules