Identifying Patch Correctness in Test-based Program Repair Yingfei - - PowerPoint PPT Presentation

identifying patch correctness
SMART_READER_LITE
LIVE PREVIEW

Identifying Patch Correctness in Test-based Program Repair Yingfei - - PowerPoint PPT Presentation

Identifying Patch Correctness in Test-based Program Repair Yingfei Xiong, Xinyuan Liu, Muhan Zeng , Lu Zhang, Gang Huang Peking University Test-based Program Repair Passing test Passing test Program Program Passing test Patch Passing


slide-1
SLIDE 1

Identifying Patch Correctness in Test-based Program Repair

Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, Gang Huang Peking University

slide-2
SLIDE 2

Test-based Program Repair

Program (Buggy) Program’ (Fixed) Patch Passing test Passing test Failing test Passing test Passing test Passing test

slide-3
SLIDE 3

Program repair: The cure

Bug Disease Test Symptom Patch Therapy

slide-4
SLIDE 4

Workflow : Program repair & hospital

Feel bad Bug discovered Feel better Test passed Cured? Correct? Go to hospital Program repair

slide-5
SLIDE 5

Symptoms are gone == cured?

Plausible patches

  • Pass all the tests
  • Can still be

incorrect (overfit) Therapy

  • Makes you free of

pain

  • Disease may still

be there

slide-6
SLIDE 6

Tools: Hospitals

  • Precision: Correct / (Correct + Incorrect)

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00%

Prophet Angelix Nopol Kali Genprog

slide-7
SLIDE 7

Approach overview

Test suite

Buggy program

Test-based program repair Patch Identifying patch correctness Patch High-quality patch Low precision High precision

slide-8
SLIDE 8

Plausible patches: Wrong cure

An incorrect patch produced by jKali[1] A test checking for null dataset. Test oracle: function draw returns normally (without exception)

[1]Martinez M, Durieux T, Sommerard R, et al. Automatic repair of real bugs in java: A large-scale experiment on the defects4j dataset[J]. Empirical Software Engineering, 2017, 22(4): 1936-1964.

slide-9
SLIDE 9

Bad therapy: What’s wrong here?

Passing test Failing test (Null dataset) Nothing is done Exception not thrown Passing test Failing test (Null dataset) Something is drawn Exception thrown

The original draw

Should fail!

slide-10
SLIDE 10

Wrong cure

  • All symptoms are cured but in a bad way
  • Problems are solved but not in a satisfying way
  • “My leg is wounded”
  • “Cut it off so you no longer have a hurt leg”
  • Weak test oracle

No exception Directly return

slide-11
SLIDE 11

Weak test oracle

  • No exception ≠ correct patch

Weak test oracle

slide-12
SLIDE 12

Plausible patches : Incomplete cure

An incorrect patch with wrong condition generated by Nopol[1] Correct developer patch with correct null guard

[1]Xuan J, Martinez M, Demarco F, et al. Nopol: Automatic repair of conditional statement bugs in java programs[J]. IEEE Transactions on Software Engineering, 2017, 43(1): 34-55.

slide-13
SLIDE 13

Bad therapy: What’s wrong here?

Same as original program The whole loop is skipped increase = 0 The whole loop skipped Passing test repeat=false Passing test repeat=true Failing test repeat=false increase should be 0 increase calculated Expecting: increase=0 Get: Exception thrown

The original program

Passing test repeat=true Failing test repeat=false increase should be 0 This test is not in the test suite!

slide-14
SLIDE 14

Incomplete cure

  • Incomplete cure: concerned symptoms are cured, but some other

symptoms are not.

  • Bugs that covered by tests is fixed while others not
  • “We cured your left leg and cut off your right leg”
  • “So what about my right leg?”
  • “Well, we only care about your left leg”
  • Weak test input

Wrong condition Missing test inputs Existing test inputs

slide-15
SLIDE 15

Test suites and heuristics

  • Test suites are weak on both input and oracle.
  • Two heuristics to save weak test suites:
  • PATCH-SIM: compensate for weak test oracle
  • TEST-SIM: compensate for weak test input

Test Test Input Test Oracle

slide-16
SLIDE 16

PATCH-SIM: heuristic for test oracle

Passing tests

Behavior on

  • riginal program

Behavior on patched program

Similar

Failing tests

Behavior on

  • riginal program

Behavior on patched program

Different

“What’s more, the wound (which was bad) should be cured” “Well, you should keep my legs (which were good) as good as before”

slide-17
SLIDE 17

Bad cure identified!

Passing test Nothing happens Passing test Something is drawn

The original draw

Different!

“Well, you should keep my legs (which were good) as good as before”

slide-18
SLIDE 18

TEST-SIM: heuristic for test input

  • PATCH-SIM on newly generated tests: pass or fail?
  • Classification result can be used by PATCH-SIM

Behavior of the new test Behavior of a passing test

Similar

The new test probably passes

Behavior of the new test Behavior of a failing test

Similar

The new test probably fails

“My left leg is just like my right leg. My right leg is good, so my left leg is also good”

slide-19
SLIDE 19

Bad cure identified!

Classified as passing test The whole loop skipped Passing test repeat=false Passing test repeat=true

“Check my left leg, it’s good and I want it as good as before”

Different with

  • riginal

behavior

slide-20
SLIDE 20

Workflow

  • “Check my left leg, it’s good and I want it as good as before”

Test generation Classification by TEST-SIM Oracle of PATCH-SIM Test generation New test inputs TEST-SIM Classification PATCH-SIM Correctness

slide-21
SLIDE 21

Similar? Different?

  • Test oracle: output
  • Result is not all: the process is also important
  • Runtime information: Behavior similarity

Not so reliable

slide-22
SLIDE 22

Details for ‘Behavior similarity’

  • Complete-path spectrum[1]: the sequence of executed statements
  • Distance and similarity:

{1,2,3,2,3,2,3,2,4}

[1]Harrold M J, Rothermel G, Wu R, et al. An empirical investigation of program spectra, Acm Sigplan

  • Notices. ACM, 1998, 33(7): 83-90
slide-23
SLIDE 23

‘Similar’ is relative, not absolute

  • Behaviors on passing tests should be more similar

Common cold

  • Easy cure
  • Slightly affect your

body Cancer

  • Big surgery
  • Greatly affect your

body Simple bug

  • Small patch
  • Slightly affect original

program behavior Complex bug

  • Big patch
  • Greatly affect original

program behavior

slide-24
SLIDE 24

Effectiveness

  • Dataset: 139 Patches from jGenProg, Nopol, jKali, ACS and HDRepair
  • Defects4J benchmark
  • 56.3% of incorrect patches filtered out without losing any of the correct

patches.

Anti-pattern: pre-defined patterns Opad: patches shouldn’t introduce crash

  • r memory safety problem (designed

for C)

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% Ours Anti-pattern Opad Incorrect filtered Correct filtered

slide-25
SLIDE 25

Summary

  • Many program repair tools have low precision
  • Patch correctness can be identified based on behavior similarity
  • 2 heuristics: PATCH-SIM and TEST-SIM
  • 56.3% incorrect patches filtered, 0 loss on correct patches
slide-26
SLIDE 26

Discussion: complicate patches

  • Patches from APR are simple (for now).
  • Will our approach still be effective in the future?
  • E.g. on more complicate patches
slide-27
SLIDE 27

Developer patches

  • 194 correct patches from Defects4J benchmark
  • 178(91.75%) still classified as correct
  • Reason for misclassification:
  • Significant behavior change
  • Calling a different method with the same functionality