Automated Topic Naming to Support Cross-project Analysis of - - PowerPoint PPT Presentation

automated topic naming to support cross project analysis
SMART_READER_LITE
LIVE PREVIEW

Automated Topic Naming to Support Cross-project Analysis of - - PowerPoint PPT Presentation

Automated Topic Naming to Support Cross-project Analysis of Software Maintenance Activities Abram Hindle Neil A. Ernst Dept. of Computer Science Dept. of Computer Science University of California, Davis University of Toronto Davis, CA, USA


slide-1
SLIDE 1 Abram Hindle
  • Dept. of Computer Science
University of California, Davis Davis, CA, USA abram@softwareprocess.es Neil A. Ernst
  • Dept. of Computer Science
University of Toronto Toronto, Ontario, CANADA nernst@cs.toronto.edu Michael W. Godfrey David Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, CANADA migod@uwaterloo.ca John Mylopoulos
  • Dept. Information Eng. and
Computer Science University of Trento Trento, ITALY jm@disi.unitn.it

Automated Topic Naming to Support Cross-project Analysis of Software Maintenance Activities

US NSF SHF Medium 0964703 1
slide-2
SLIDE 2 Managers Investors New Developers Developers

Who Cares About Quality?

Customers 2
slide-3
SLIDE 3

Added a test for bug #1326 on OSX What is this commit about?

3
slide-4
SLIDE 4

Added a test for bug #1326 on OSX What is this commit about?

4
slide-5
SLIDE 5

Added a test for bug #1326 on OSX What is this commit about?

Reliability Maintain- ability Portability

5
slide-6
SLIDE 6

But we have many commits..

Reliability Maintain- ability Portability

6
slide-7
SLIDE 7 L D A L S I

Commit

Developer Topic

Developer Topics

Developer Topic

Commit

Maintainability Reliability

purpose? 7
slide-8
SLIDE 8 Version Control Version Control Version Control Version Control portability efficiency maintainability usability reliability and functionality (includes correctness)

Shared Concepts

Cross Project Relevance

8
slide-9
SLIDE 9

portability m a i n t a i n a b i l i t y reliability and functionality (includes correctness) efficiency

Quality-related Non Functional Requirements (NFRs)

usability

[iso9126] [cleland-huang03] [ernst10] 9
slide-10
SLIDE 10 Maintainability Functionality Portability Efficiency Usability Non-Functional Requirements Reliability time ->

Can't we just summarize quality related efforts within this project?

Source Code Source Code Documentation Build / Configuration T ests Revisions Software Repositories 10
slide-11
SLIDE 11

Time (months) Unique Topics Labelled Developer Topics

11
slide-12
SLIDE 12

Time (months) Unique Topics Labelled Developer Topics

Linux Kernel Windows AMD64

12
slide-13
SLIDE 13

maintainability portability reliability efficiency efficiency functionality functionality

maintainability portability portability efficiency

Time (months) Unique Topics Labelled Developer Topics

13
slide-14
SLIDE 14 apologies to those with prior LDA/LSI experience

Example

[Blei] 14
slide-15
SLIDE 15

Arts

International News

Opinion

15
slide-16
SLIDE 16

Section Section Arts International News

Article Article Article Article Article Article Article Article Article Article 16
slide-17
SLIDE 17 Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article

What if we didn't know what section the articles were in?

17
slide-18
SLIDE 18 Article Article Article Article Article Article Article Article Article Article Article Article Article

LDA LSI

18
slide-19
SLIDE 19 Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article

LDA LSI

Article Article Article Article Article 19
slide-20
SLIDE 20

LDA LSI

Article

cat dog car city pound festival street mischief Word Distribution Documents are represented as word distributions (word counts)

20
slide-21
SLIDE 21 Word Distributions Topics: Independent Word Distributions LDA finds independent word distributions that the documents are related to. Documents can be associated with more than one topic.

LDA LSI

21
slide-22
SLIDE 22 Word Distributions Topics: Independent Word Distributions

Sports Entertainment

Athlete and Actor Award Nominees Baseball Movie Theatre Review Original Article 22
slide-23
SLIDE 23 Word Distributions Topics: Independent Word Distributions Sports Entertainment Athlete and Actor

C C

1

+ x x = ~ Documents are represented as a linear combination of independent topics

23
slide-24
SLIDE 24

Topic 1 * play * game * inning * player * quarter * opponent * ... Topic 2 * gambling * play * night life * comedy * movie * theatre * ... These word lists look look like: Sports and Entertainment !

Article Article Article Article Article Article Article Article Article Article Article Article Article LDA LSI

Here are two topics. I don't know what they are about!

24
slide-25
SLIDE 25 25
slide-26
SLIDE 26

Word bag analysis

Portability Usability Reliability Efficiency Maintainability

26
slide-27
SLIDE 27

Portability Reliability

portability transferability interoperability documentation internationalization i18n ... reliability failure error redundancy fails bug ...

Word Bag Examples

27
slide-28
SLIDE 28 maintainability portability reliability efficiency efficiency functionality functionality maintainability portability portability efficiency

Time (months) Unique Topics

Labelled Topics of MaxDB 7.500

28
slide-29
SLIDE 29

MaxDB 7.500 Timeline

Maintainability Portability Maintainability Portability Reliability Effeciency Maintainability Effeciency 29
slide-30
SLIDE 30 functionality functionality usability functionality/portability efficiency maintainability/reliability/portability portability reliability portability portability portability functionality/portability reliability/usability usability

Topics of MySQL 3.23

tags Time (months) Unique Topics

30
slide-31
SLIDE 31 Maintainability Functionality Reliability Portability Maintainability Functionality Efficiency Portability Functionality

MySQL 3.23 Timeline

31
slide-32
SLIDE 32 porta bility efficiency relia bility functiona lity m a inta ina bility usa bility tota l 0.8 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Ma xDB e xp 3 MyS QLe xp 3 Ma xDB e xp 2 MyS QLe xp 2

ROC Values of Semi-Supervised Word Bags ROC NFR

32
slide-33
SLIDE 33

Supervised Tags

33
slide-34
SLIDE 34 CLR HOMER 1 0.4 0.5 0.6 0.7 0.8 0.9 BR m ic ro CLR HOMER 1 0.4 0.5 0.6 0.7 0.8 0.9 BR m a cro m ic ro

Supervised Multitag Classifiers: MySQL and MaxDB

MaxDB Classifiers MySQL Classifiers

34
slide-35
SLIDE 35 reliability efficiency functionality functionality maintainability portability portability efficiency L D A L S I [Hindle09ICSM] Version Control Revisions Source Control Portability Reliability Efficiency Usability Maintainability Developer Topic Analysis and Labelling

Conclusions

Managers Investors and Acquisitions New Developers Core Developers Customers Version Control Version Control Version Control Version Control portability efficiency maintainability usability reliability and functionality (includes correctness) Shared Concepts http://softwareprocess.es/name/ 35
slide-36
SLIDE 36 porta bility efficiency relia bility functiona lity m a inta ina bility usa bility tota l 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Ma xDB e xp 3 MyS QLe xp 3 Ma xDB e xp 2 MyS QLe xp 2

F-1 Measure of Semi-Supervised Word Bags F-1 NFR

36
slide-37
SLIDE 37 Topic 1 Topic 20 Topic 10 Few Documents Many Documents 37
slide-38
SLIDE 38

Annotation: Stop Words

STOP

words

STOP

words

topics joined due to similarity 2 long trends instead of one MaxDB 7.500 Case Study 38
slide-39
SLIDE 39

Annotation: Training Sets

Version Control

Maintainability+ Maintainability-

39
slide-40
SLIDE 40

Annotation: Stop Words

STOP

words

a l r e a d y t h

  • r
  • u

g h l y m i g h t b e s i d e i n s t e p e r h a p s t e n d s t h a t s n e c e s s a r y b y e g f

  • c

l e a r l y b e s t l e s t h e r e ' s b e e n g e t s a l m

  • s

t b e t w e e n s h e i m m e d i a t e n p l u s fi f t h a t d

  • n

e t h e m a n y

  • n

e s h a l l s e e a m

  • u

n g s t w h

  • b

e c a u s e &

  • w

n w e n t f f

  • l

l

  • w

e d i h a v e n ' t m

  • v

e e v e r y t h i n g u p c

  • n

c e r n i n g i n a s k i n g e x a m p l e t h i r d m u c h 3 s

  • a

b

  • u

t n e v e r t h e l e s s d

  • e

s n ' t i ' m m a y b e d u r i n g l a s e n s i b l e

  • u

r s

  • m

e w h a t

  • r

i n c i n n

  • t

u d e e l s e w h e r e u p

  • n

a s k h e r e u p

  • n

i s n ' t b e f

  • r

e h a n d i e f

  • u

n d e x c e p t u n l e s s 5 c a n ' t a n y w h e r e i t c

  • n

t a i n i n g i n t e r e s t n

  • n

e s i x e v e r y w h e r e d e t a i l w h

  • s

e n e i t h e r t h e r e s n e e d a s s

  • c

i a t e d

  • a

g a i n b e l i e v e g

  • e

s l i k e l y s p e c i f y i n g r i g h t 1 s i n c e r e s i x t y g c l e t w ! c

  • n

d i d n ' t t h i s # m e a n w h i l e h e l l

  • v

e r f

  • r

m e r t w e n t y s u r e ) c ' s l l a t t e r r e g a r d s

Used in topic analysis
  • r to reduce # of
features for learners. 40
slide-41
SLIDE 41

Annotation: Training Sets

Version Control Maintainability+ Maintainability- Maintainability

sample and correct

Maintainability+ Maintainability-

MANUAL

AUTO

41
slide-42
SLIDE 42

Message Word Distribution Topic Trend

Top 10 Words: * perforce * bug # * POSIX * Opteron * ... 42