identifying patch correctness
play

Identifying Patch Correctness in Test-based Program Repair Yingfei - PowerPoint PPT Presentation

Identifying Patch Correctness in Test-based Program Repair Yingfei Xiong, Xinyuan Liu, Muhan Zeng , Lu Zhang, Gang Huang Peking University Test-based Program Repair Passing test Passing test Program Program Passing test Patch Passing


  1. Identifying Patch Correctness in Test-based Program Repair Yingfei Xiong, Xinyuan Liu, Muhan Zeng , Lu Zhang, Gang Huang Peking University

  2. Test-based Program Repair Passing test Passing test Program Program’ Passing test Patch Passing test (Buggy) (Fixed) Failing Passing test test

  3. Program repair: The cure Bug Disease Test Symptom Patch Therapy

  4. Workflow : Program repair & hospital Feel bad Bug discovered Go to hospital Program repair Feel better Test passed Cured? Correct?

  5. Symptoms are gone == cured? Therapy Plausible patches • Makes you free of • Pass all the tests pain • Disease may still • Can still be be there incorrect (overfit)

  6. Tools: Hospitals • Precision: Correct / (Correct + Incorrect) 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Prophet Angelix Nopol Kali Genprog

  7. Approach overview Test suite Patch Test-based program repair Buggy program Low precision Identifying patch High-quality Patch correctness patch High precision

  8. Plausible patches: Wrong cure An incorrect patch produced by jKali [1] A test checking for null dataset. Test oracle: function draw returns normally (without exception) [1]Martinez M, Durieux T, Sommerard R, et al. Automatic repair of real bugs in java: A large-scale experiment on the defects4j dataset[J]. Empirical Software Engineering, 2017, 22(4): 1936-1964.

  9. Bad therapy: What’s wrong here? Passing test Something is drawn The original draw Failing test Exception thrown (Null dataset) Should fail! Passing test Nothing is done Failing test Exception not thrown (Null dataset)

  10. Wrong cure • All symptoms are cured but in a bad way • Problems are solved but not in a satisfying way • “My leg is wounded” • “Cut it off so you no longer have a hurt leg” Directly return No exception • Weak test oracle

  11. Weak test oracle • No exception ≠ correct patch Weak test oracle

  12. Plausible patches : Incomplete cure An incorrect patch with wrong condition generated by Nopol [1] [1]Xuan J, Martinez M, Demarco F, et al. Nopol: Correct developer patch with correct null guard Automatic repair of conditional statement bugs in java programs[J]. IEEE Transactions on Software Engineering, 2017, 43(1): 34-55.

  13. Bad therapy: What’s wrong here? Passing test increase calculated repeat=true The original program Failing test Expecting: increase=0 repeat=false Get: Exception thrown increase should be 0 This test is not in the test suite! Passing test The whole loop repeat=false skipped Passing test Same as original repeat=true program The whole loop is Failing test skipped repeat=false increase = 0 increase should be 0

  14. Incomplete cure • Incomplete cure: concerned symptoms are cured, but some other symptoms are not. • Bugs that covered by tests is fixed while others not • “We cured your left leg and cut off your right leg” Wrong condition • “So what about my right leg?” Missing test inputs • “Well, we only care about your left leg” Existing test inputs • Weak test input

  15. Test suites and heuristics Test Test Input Test Oracle • Test suites are weak on both input and oracle. • Two heuristics to save weak test suites: • PATCH-SIM: compensate for weak test oracle • TEST-SIM: compensate for weak test input

  16. PATCH-SIM: heuristic for test oracle Behavior on Behavior on Passing tests patched Similar original program program “Well, you should keep my legs (which were good) as good as before” Behavior on Behavior on Failing tests patched Different original program program “What’s more, the wound (which was bad) should be cured”

  17. Bad cure identified! Passing test The original draw Something is drawn Different! Passing test Nothing happens “Well, you should keep my legs (which were good) as good as before”

  18. TEST-SIM: heuristic for test input • PATCH-SIM on newly generated tests: pass or fail? The new test probably Behavior of the Behavior of a Similar new test passing test passes “My left leg is just like my right leg. My right leg is good, so my left leg is also good” The new test Behavior of the Behavior of a probably Similar new test failing test fails • Classification result can be used by PATCH-SIM

  19. Bad cure identified! Different with Classified as original passing test behavior Passing test Passing test The whole loop repeat=false repeat=true skipped “Check my left leg, it’s good and I want it as good as before”

  20. Workflow • “Check my left leg, it’s good and I want it as good as before” Oracle of Test generation Classification PATCH-SIM by TEST-SIM Test generation TEST-SIM PATCH-SIM New test inputs Classification Correctness

  21. Similar? Different? • Test oracle: output Not so reliable • Result is not all: the process is also important • Runtime information: Behavior similarity

  22. Details for ‘Behavior similarity’ • Complete-path spectrum [1] : the sequence of executed statements {1,2,3,2,3,2,3,2,4} • Distance and similarity: [1]Harrold M J, Rothermel G, Wu R, et al. An empirical investigation of program spectra, Acm Sigplan Notices. ACM, 1998, 33(7): 83-90

  23. ‘Similar’ is relative, not absolute Common cold Simple bug • Easy cure • Small patch • Slightly affect your • Slightly affect original body program behavior Cancer Complex bug • Big surgery • Big patch • Greatly affect your • Greatly affect original body program behavior • Behaviors on passing tests should be more similar

  24. Effectiveness • Dataset: 139 Patches from jGenProg, Nopol, jKali, ACS and HDRepair • Defects4J benchmark • 56.3% of incorrect patches filtered out without losing any of the correct patches. 60.00% 50.00% 40.00% Anti-pattern: pre-defined patterns 30.00% 20.00% Opad : patches shouldn’t introduce crash 10.00% or memory safety problem (designed for C) 0.00% Ours Anti-pattern Opad Incorrect filtered Correct filtered

  25. Summary • Many program repair tools have low precision • Patch correctness can be identified based on behavior similarity • 2 heuristics: PATCH-SIM and TEST-SIM • 56.3% incorrect patches filtered, 0 loss on correct patches

  26. Discussion: complicate patches • Patches from APR are simple (for now). • Will our approach still be effective in the future? • E.g. on more complicate patches

  27. Developer patches • 194 correct patches from Defects4J benchmark • 178(91.75%) still classified as correct • Reason for misclassification: • Significant behavior change • Calling a different method with the same functionality

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend