Unit Tes)ng Tool Compe))on Round Four Urko Rueda, Ren Just, Juan - - PowerPoint PPT Presentation

unit tes ng tool compe on round four
SMART_READER_LITE
LIVE PREVIEW

Unit Tes)ng Tool Compe))on Round Four Urko Rueda, Ren Just, Juan - - PowerPoint PPT Presentation

Unit Tes)ng Tool Compe))on Round Four Urko Rueda, Ren Just, Juan P. Galeo5, Tanja E. J. Vos The 9th Interna=onal Workshop on Search-Based SoDware Tes=ng Contents 1. About the Tool compe==on 2. The Tools 3. The Methodology 4. The Results


slide-1
SLIDE 1

Unit Tes)ng Tool Compe))on Round Four

Urko Rueda, René Just, Juan P. Galeo5, Tanja E. J. Vos

The 9th Interna=onal Workshop on Search-Based SoDware Tes=ng

slide-2
SLIDE 2

Contents

1.

About the Tool compe==on

2.

The Tools

3.

The Methodology

4.

The Results

5.

Lessons learned

4th Java unit tes=ng compe==on 1

slide-3
SLIDE 3

Unit Testing Tool Competition FITTEST crest.cs.ucl.ac. uk/fi;est Coverage metrics Mutation metrics CUTs / Projects / Tools Tools SBST & nonSBST

2012

ICST’13

Cobertura Javalanche 77 / 5 / 2 Manual & Randoop

  • baselines

2013

Round Two FITTEST’13

JaCoCo PITest 63 / 9 / 4 1st + T3 & Evosuite 63 / 9 / 8

2014

Round Three SBST’15

2nd + Commercial & GRT & jTexPert & Mosa(Evosuite)

2015

Round Four SBST’16

Defects4J: github.com/rjust/ defects4j + Real fault finding metric 68 / 5 / 4 Randoop - baseline & T3 & Evosuite & jTexPert

Benchmarked Java unit tes=ng at the class level

About the Tool compe))on

4th Java unit tes=ng compe==on 2

slide-4
SLIDE 4

About the Tool compe))on

§ Why? § Towards tes=ng field maturity – this is just Java … § Tools improvements, future developments insight § What is new in the 4th edi=on? § Benchmark infrastructure – split into

§ Test genera=on § Test execu=on & Test assessment (Defects4J)

§ Benchmark subjects (from Defects4J dataset) § Time budgets (1, 2, 4 & 8 minutes) § Flaky tests (non compliable, non reliable pass)

4th Java unit tes=ng compe==on 3

slide-5
SLIDE 5

The Tools

Tool Technique Static analysis Edition 2012 2013 2014 2015 Randoop (baseline) Random ✗ ✓ ✓ ✓ ✓ T3 ✗ ✗ ✓ ✓ ✓ jTexPert Random (guided) ✓ ✗ ✗ ✓ ✓ Evosuite Evolutionary algorithm ✓ ✗ ✓ ✓ ✓

§ SBST and non-SBST tools § Command line tools § Fully automated – no human interven=on

4th Java unit tes=ng compe==on 4

slide-6
SLIDE 6

The Methodology

§ Tool deployment § Installa=on – Linux environment § Wrapper implementa=on – runtool script

§ Std. IN/OUT communica=on protocol § 4th edi=on has a =me budget

§ Tune-up cycle – setup, run, resolve issues

§ Benchmark infrastructure

§ Defects4J integra=on § Decoupling test genera=on from test execu=on/assessment

§ Tool – run over non contest benchmark samples

4th Java unit tes=ng compe==on 5

slide-7
SLIDE 7

The Methodology

run tool for Tool T benchmark framework "BENCHMARK" Src Path / Bin Path / ClassPath ClassPath for JUnit Compilation "READY"

. . .

name of CUT

. . .

generate file in ./temp/testcases "READY" compile + execute + measure test case

loop

preparation time-budget

4th Java unit tes=ng compe==on 6

slide-8
SLIDE 8

The Methodology

§ Benchmark infrastructure § Two HP Z820 worksta=ons – each:

§ 2 CPU sockets for a total of 20 cores § 256Gb RAM

§ 32 virtual machines (16 per worksta=on)

§ Test genera=on

§ 1 core – control tool mul=-threading capability § 8GB RAM

§ Test execu=on/assessment (tool independent)

§ 2 cores § 16Gb RAM – resolves out of memory issues

4th Java unit tes=ng compe==on 7

slide-9
SLIDE 9

The Methodology

benchmark tool replicated x32 VMs T3 jTexpert EvoSuite Randoop runtool

80 CUTs

RUNs 1, 2, 3 generate test cases collect metrics aggregator runtool runtool runtool HP Z820 16 VMs 20core CPU 256Gb RAM 1core CPU 8Gb RAM time budgets 1 2 4 8m 2core CPU 16Gb RAM 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m HP Z820 16 VMs 20core CPU 256Gb RAM 1core CPU 8Gb RAM time budgets 1 2 4 8m 2core CPU 16Gb RAM 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m RUNs 4, 5, 6 generate test cases collect metrics Calculate Score 4th Java unit tes=ng compe==on 8

slide-10
SLIDE 10

Randoop

Test classes @Test @Test @Test

compilable

run to detect and remove flaky tests

Test classes @Test @Test No flaky tests

run to collect metrics calculate score

benchmark tool

runtool runtool runtool runtool

T3 EvoSuite jTexpert Time- budget

(1, 2 , 4, 8min)

Y N

CUT (fixed) CUT (1 real fault) CUT (mutated) generate CUT (fixed)

The Methodology

4th Java unit tes=ng compe==on 9

slide-11
SLIDE 11

The Methodology

§ Flaky tests § Passes during genera=on § But, might Fail during execu=on/assessment § False-posi=ve warnings

§ Non reliable fault-detec=on § Non reliable muta=on analysis

§ Defects4J flaky tests sanity § Non compiling test classes § Failing tests over 5 execu=ons (fixed CUT versions)

4th Java unit tes=ng compe==on 10

slide-12
SLIDE 12

The Methodology

§ The Metrics – Test effec=veness § Code coverage (fixed benchmark versions)

§ Defects4J <- Cobertura § Statement coverage § Condi=on coverage

§ Muta=on score

§ Defects4J <- Major framework (all muta=on operators)

§ Real fault detec=on (buggy benchmark versions)

§ 1 real fault per benchmark § 0 or 1 score, independent of how many tests reveal it

4th Java unit tes=ng compe==on 11

slide-13
SLIDE 13

covScore(T,L,C,r) := wi · covi + wb · covb + wm · covm + (real fault found ? wf : 0)

The Methodology

§ The Scoring formula T = Tool; L = Time budget; C = CUT; r = RUN (1..6) Coverages: covi = statement; covb = condi=on covm = mutants kill ra=o Weights: wi = 1; wb = 2; wm = 4; wf = 4

4th Java unit tes=ng compe==on 12

slide-14
SLIDE 14

The Methodology

§ The Scoring formula – =me penalty § Test genera=on slot: L .. 2 · L § No penalty if genTime <= L § Penalty for Extra =me taken (genTime – L)

§ Half covScore if the Tool must be killed (> 2 · L)

4th Java unit tes=ng compe==on 13

slide-15
SLIDE 15

The Methodology

§ The Scoring formula – tests penalty

#Classes = generated test classes; #uClasses = uncompilable #Tests = test cases; #fTests = flaky

4th Java unit tes=ng compe==on 14

slide-16
SLIDE 16

The Methodology

§ The Scoring formula – Tool score Score(T,L,C) := avg(Score(T,L,C,r) for all r execu=ons

Score(T,L,C,r) := tScore(T,L,C,r) – penalty(T,L,C,r)

4th Java unit tes=ng compe==on 15

slide-17
SLIDE 17

The Methodology

§ Conclusion validity § Reliability of treatment implementa=on

§ Tool deployment instruc=ons EQUAL for all par=cipants

§ Reliability of measures

§ Efficiency: wall clock =me by Java System.currentTimeMillis() § Effec=veness: Defects4J § Tools non-determinis=c nature: 6 runs (HW Capacity)

4th Java unit tes=ng compe==on 16

slide-18
SLIDE 18

The Methodology

§ Internal validity § CUTs from Defects4J (uniform and arbitrary selec=on from 5 open source projects)

§ Tools and benchmark infrastructure Tune-up samples § Contest benchmarks

§ Wrappers runtool: implemented by Tools side § Construct validity § Scoring formula weights – quality indicators value

§ Empirical studies – correla=on of proxy metrics for: Test effec=veness and Fault finding capability

4th Java unit tes=ng compe==on 17

slide-19
SLIDE 19

The Results

Contest run for ~1week Test genera=on, execu=on and assessment x32 VMs A single virtual machine would use 8 CPU months!

4th Java unit tes=ng compe==on 18

slide-20
SLIDE 20

Lessons learned

§ Tes=ng Tools improvements § Automa=on, Test effec=veness, Comparability § Benchmarking infrastructure improvements § Decoupling Test gen. from execu=on/assessment § Flaky tests iden=fica=on and sanity § Fault finding capability measurement § Test effec=veness due to Test genera=on =me § What next?

§ Automated paralleliza=on of the benchmark contest § More Tools, new languages? (i.e. C#?)

4th Java unit tes=ng compe==on 19

slide-21
SLIDE 21

Contact us

Universidad Politécnica de Valencia, ES urueda@pros.upv.es, tvos@dsic.upv.es Open Universiteit Heerlen, NL tanja.vos@ou.nl University of Massachuseys Amherst, MA, USA rjust@cs.umass.edu University of Buenos Aires, Argen=na jgaleo5@dc.uba.ar web: hyp://sbstcontest.dsic.upv.es/

4th Java unit tes=ng compe==on 20