Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw - PowerPoint PPT Presentation

MaLTeSQuE2017, Feb 21 st, 2017, Klagenfurt Using ML to Design a Flexible LOC Counter Mirosław Ochodek Miroslaw Staron Dominik Bargowski Wilhelm Meding Regina Hebig Workshop on Machine Learning Techniques for SoKware Quality EvaluaNon

So7ware size Cost predicNon ProducNvity Size Metrics normalizaNon #Defects Defects density = Size 2

The Problem i n g b y U s i n t y C e r t a m e n t t s u r e e m e n M e a e a s u r o v i n g t i c M I m p r t e m a d S y s r e o F i n M e a s u i o n t o d e i b r a t o f - C C a l L i n e s - e o f A C a s o r – E r r Miroslaw Staron 1 , Darko Durisic 2 , and Rakesh Rana 1 w e d e n n b u r g , S o f G o t h e n v i e r s i t y e r i n g , U d E n g i n e @ g u . s e , c i e n c e a n e s h . r a n a m p u t e r S r o n / r a k 1 C o s l a w . s t a w e d e n m i r o G r o u p , S v o C a r 2 V o l a r s . c o m c @ v o l v o c . d u r i s i d a r k o a r e o f - e s - o f c - o d e e r o f l i n t , h e n u m b j e c t e ff o r s u c h a s t a a s p r o m e a s u r e s h e n o m e n o n B a s e u t s u c h p n w e r e l y b s t r a c t . i o n s a b o q u i t e o f t e A e p r e d i c t o w e v e r , l a t i n g d t o m a k e ff o r t . H f o r c a l c u t e n u s e n t e n a n c e g l o i r t h m y o r m a i e e x a c t a e a r c h i s u c t q u a l i t w h e r e t h o f o u r r e s p r o d t r u m e n t s b j e c t i v e t - m e n t i n s w n . T h e o e s n i s o f m e a s u r e n o t k n o e m e a s u r t h e e a s u r e i s t y o f b a s o u r o f t h e m e c e r t a n i w e u s e f h e v a u l e c r e a s e h t d y w h e r e t w e c a n i n r k i n g s t u k n o w n o r e h o w b e n h c m a w i t h u n o t e x p l o n d u c t a u r e m e n t n g . W e c o d e m e a s c a n a d - e n g n i e e r i l i n e s - o f - c t h a t w e w a r e m e n t s f o r u l t s s h o w c t i n s t r u O u r r e s s y s t e m a t i a s u r e m e n d e b a s e s . w i n g t h e m e r e fi v e c o 2 0 % k n o t r u - t o m e a s u m u c h a s m e n t i n s e r t a i n t y u e s b y a s m e a s u r e c m e n t v a l a t i n g t h e e m e n t m e a s u r e h a t c a l i b r n m e a s u r j u s t t h e n c l u d e t c c u r a c y i o . l W e c o c r e a s e d a y o f p r e - o f t h e t o u t e t o i n a c c u r a c e r r o r y c o n t r i b m p a c t t h e e g n i fi c a n t l h i s w i l l i c r e a s e t h Four tools n t s c a n s i e r i n g . T r e f o r e i n m e r e e n g i n e ) a n d t h e n i s o f t w a p r o j e c t s r o c e s s e s s o f t w a r e p e ff o t r i n c e s s e s . ( e g . . o f e r i n g p r o d i c t i o n s r e e n g i n e o f s o f t w a ffi c i e n c y c o s t - e With the introduction of the measurement information model in the interna- Introduction 1 tional ISO/IEC 15939 standard for measurement processes the discipline of software engineering evolved from discussing metrics in general to categorizing them Error (vs. median) into three categories – base measures, derived measures and indicators. The use of base measures is fundamental for the construction of derived measures and indicators. The base measures are also the types of measures which are collected directly and are a result of a measurement method. In many cases this measurement method is an automated algorithm (e.g. a script) which we can refer to as the measurement instrument which quantifies an attribute of interest into a up to ~20% Since in software engineering we do not have reference measurement etalons as we do in other disciplines (e.g. kilogram or meter for physics), we often rely number. on arbitrary definitions of the base quantities. One of such quantities is the size of programs measured as the number of lines of code. Even though the number of lines of code of a given program is a deterministic and fully quantifiable Introduces (unknown) measurement error, problems with reliability of the Output: 2512 LOC measurement, difficulNes in measuring mulN-language code base… 3

Poten>al solu>ons A tool based on Programming A machine learning (ML) approach Language (PL) parsers • It is difficult to explicitly define the rules • Explicitly known rules for coun3ng that (either not known or too complex) • Learns from examples (require training set) can be somehow formulated • ClassificaNon error depending on the • 100% accurate according to the rules quality of training set • Requires implementaNon for each PL • Doesn’t require new implementaNon for • Can be also implemented to allow for new language (however, may require a some configuraNon of rules (however, new training set) probably somehow limited) ? 4

Poten>al solu>ons A tool based on Programming A machine learning (ML) approach Language (PL) parsers • It is difficult to explicitly define the rules • Explicitly known rules for coun3ng that (either not known or too complex) • Learns from examples (require training set) can be somehow formulated • ClassificaNon error depending on the • 100% accurate according to the rules quality of training set • Requires implementaNon for each PL • Doesn’t require new implementaNon for • Can be also implemented to allow for new language (however, may require a some configuraNon of rules (however, new training set) probably somehow limited) ? 5

Idea of the solu>on • Flexible lines of code counter (CCFlex) – A user teaches the tool which lines should be counted based on a sample (a training set) 10 LOC JusNficaNon 6

Idea of the solu>on 7

Feature acquisi>on Each line is characterized by a set of features and its decision class (count or ignore) We parse the text to extract those features. File type #Characters If … Decision class java 25 TRUE … Count … … … … … 8

Feature acquisi>on ID Name Type Description F01 File Nominal The extension of the file (e.g., extension java, cpp, etc.) F02 Full Numeric The number of characters in the length line. F03 Length Numeric The number of characters in the • Plain text (F01-F04): line after removing all leading and trailing white characters. F04 Tokens Numeric The number of tokens in the line – File extension (the line is split based on white characters). F05 Semicolons Numeric The number of semicolons in the – Full and trimmed length (characters) line. F06 Comments Boolean The line includes any of //, /*, */ or after trimming starts with *. – Tokens F07 Assignments Numeric the number of single assignment signs in the line (=). F08 Brackets Numeric The number of brackets: (, )in • Programming language (F05-F19): the line. F09 Square Numeric The number of square brackets: brackets [ , ] in the line. – Assignment, F10 Curly Numeric The number of curly brackets: { , brackets } in the line. F11 Class Boolean The word ”class” appears in the line. – Brackets, F12 For Boolean The word ”for” appears in the line. – Class, F13 If Boolean The word ”if” appears in the line. F14 While Boolean The word ”while” appears in the line. – Comment, F15 Case Boolean The word ”case” appears in the line. – Semicolons, F16 Try Boolean The word ”try” appears in the line. F17 Catch Boolean The word ”catch” appears in the – … line. F18 Expect Boolean The word ”expect” appears in the line. F19 Member Numeric Counts members accessors: . or access - > 9

Feature acquisi>on • Bag of words approach (automa>c) – Tokenize: ()[]{}!@#$%ˆ&*-=;:’”\|‘ ̃,.<>/? – Treat split character as a token – Calculate thresholds: • Frequencies of tokens in the code base (min. 5) • % of files a token is present in (min. 25%) – If thresholds are met: • F i : the number of Nmes the token i occurs in a line 10

Preliminary valida>on • RQ1: What level of predicNon quality can be achieved by the proposed approach? • RQ2: How the automaNc features acquisiNon affects the classificaNon quality? • RQ3: How the choice of classificaNon algorithm affects the classificaNon quality? 11

Code databases • 2402 physical lines of code in total – Eclipse: 475 LOC, – Jasper Reports 757 LOC, – Spring MVC: 1170 LOC • ELOC (Count 1492 / Ignore 910) • Subjec>ve (Count 1237, Ignore 1165) 12

Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw - PowerPoint PPT Presentation

MaLTeSQuE2017, Feb 21 st, 2017, Klagenfurt Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw Staron Dominik Bargowski Wilhelm Meding Regina Hebig Workshop on Machine Learning Techniques for SoKware Quality EvaluaNon

Software is eating the world 128k LoC 4-5M LoC 9M LoC 18M LoC 45M LoC 150M LoC ML will

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

LOC Process Change: Effective April 2018 April 2018 CMS requires Level of Care (LOC)

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI LOC-DB

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

The UN Global Counter- -Terrorism Strategy Terrorism Strategy The UN Global Counter The UN

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do

Counter Braids: A novel counter architecture Balaji Prabhakar Balaji Prabhakar Stanford

For Loops and Arrays November 13, 2008 Counting Initialize counter Test counter against limit

Decidable Problems for Counter Systems Day 1 Introduction to Counter Systems St ephane Demri

April 2020 Stakeholder Meetings- Childrens Hospital LOC & NF-LOC Updates April 15 &

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

Test Formulae Approach Alessio Mansutti Barbizon 2018 Memory states A memory state is a pair ( s

Efficient CFI Enforcement for C++ Dynamic Dispatch Dimitar Bounov , Rami Kici, Sorin Lerner UCSD

Cointeracting bialgebras Loc Foissy October 2020 Wien Loc Foissy Cointeracting

PNWS AWWA Conference Water Audit Workshop 2018 In Introductions Mike Dexel Reinhard Sturm

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas

Long-term study of low energy counting rate with the Large Volume Detector Gianmarco Bruno

properties through on-chip tests Ramin Mirzazadeh, Aldo Ghisi and Stefano Mariani Politecnico di

L 22 Vibrations and Waves [2] Vibrations and Waves [2] L 22 resonance

Recursion Relations for Anomalous Dimensions of the 6d (2,0) Theory Arthur Lipstein GGI April

Spinning Geodesic Witten Diagrams Strings and Fields 2017 @ YITP Aug. 10 / 2017

CIS-5373 Systems Security Class 1 Bogdan Carbunar 1 CIS-5373: 6.January.2020 Outline

Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw - PowerPoint PPT Presentation

MaLTeSQuE2017, Feb 21 st, 2017, Klagenfurt Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw Staron Dominik Bargowski Wilhelm Meding Regina Hebig Workshop on Machine Learning Techniques for SoKware Quality EvaluaNon

Software is eating the world 128k LoC 4-5M LoC 9M LoC 18M LoC 45M LoC 150M LoC ML will

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

LOC Process Change: Effective April 2018 April 2018 CMS requires Level of Care (LOC)

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI LOC-DB

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

The UN Global Counter- -Terrorism Strategy Terrorism Strategy The UN Global Counter The UN

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do

Counter Braids: A novel counter architecture Balaji Prabhakar Balaji Prabhakar Stanford

For Loops and Arrays November 13, 2008 Counting Initialize counter Test counter against limit

Decidable Problems for Counter Systems Day 1 Introduction to Counter Systems St ephane Demri

April 2020 Stakeholder Meetings- Childrens Hospital LOC &amp; NF-LOC Updates April 15 &amp;

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

Test Formulae Approach Alessio Mansutti Barbizon 2018 Memory states A memory state is a pair ( s

Efficient CFI Enforcement for C++ Dynamic Dispatch Dimitar Bounov , Rami Kici, Sorin Lerner UCSD

Cointeracting bialgebras Loc Foissy October 2020 Wien Loc Foissy Cointeracting

PNWS AWWA Conference Water Audit Workshop 2018 In Introductions Mike Dexel Reinhard Sturm

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas

Long-term study of low energy counting rate with the Large Volume Detector Gianmarco Bruno

properties through on-chip tests Ramin Mirzazadeh, Aldo Ghisi and Stefano Mariani Politecnico di

L 22 Vibrations and Waves [2] Vibrations and Waves [2] L 22 resonance

Recursion Relations for Anomalous Dimensions of the 6d (2,0) Theory Arthur Lipstein GGI April

Spinning Geodesic Witten Diagrams Strings and Fields 2017 @ YITP Aug. 10 / 2017

CIS-5373 Systems Security Class 1 Bogdan Carbunar 1 CIS-5373: 6.January.2020 Outline

April 2020 Stakeholder Meetings- Childrens Hospital LOC & NF-LOC Updates April 15 &