Latent Social Structure in Open Source Projects Philipp Brschweiler - - PowerPoint PPT Presentation

latent social structure in open source projects
SMART_READER_LITE
LIVE PREVIEW

Latent Social Structure in Open Source Projects Philipp Brschweiler - - PowerPoint PPT Presentation

Latent Social Structure in Open Source Projects Philipp Brschweiler ETH Zrich March 16, 2010 Paper by Christian Bird, David Pattison, Raissa DSouza, Vladimir Filkov and Premkumar Devanbu Research Question Number of interactions grows


slide-1
SLIDE 1

Latent Social Structure in Open Source Projects

Philipp Brüschweiler

ETH Zürich

March 16, 2010

Paper by Christian Bird, David Pattison, Raissa D’Souza, Vladimir Filkov and Premkumar Devanbu

slide-2
SLIDE 2

Research Question

Number of interactions grows quadratically with team size

Divide and conquer

slide-3
SLIDE 3

Research Question

Number of interactions grows quadratically with team size

Divide and conquer

Open Source Software (OSS) projects not formally

  • rganized
slide-4
SLIDE 4

Research Question

Number of interactions grows quadratically with team size

Divide and conquer

Open Source Software (OSS) projects not formally

  • rganized

Latent social structure?

Not explicit, but observable

slide-5
SLIDE 5

Studied Projects

Ant Apache Python Perl PostgreSQL

slide-6
SLIDE 6

Project Selection Criteria

Well-known and stable projects

slide-7
SLIDE 7

Project Selection Criteria

Well-known and stable projects Complex codebases with several subsystems

slide-8
SLIDE 8

Project Selection Criteria

Well-known and stable projects Complex codebases with several subsystems Different governance structures

Foundation (Apache and Ant)

slide-9
SLIDE 9

Project Selection Criteria

Well-known and stable projects Complex codebases with several subsystems Different governance structures

Foundation (Apache and Ant) Community (PostgreSQL)

slide-10
SLIDE 10

Project Selection Criteria

Well-known and stable projects Complex codebases with several subsystems Different governance structures

Foundation (Apache and Ant) Community (PostgreSQL) Monarchist (Python and Perl)

slide-11
SLIDE 11

Data Mining

Build social network of mailing list participants

slide-12
SLIDE 12

Data Mining

Build social network of mailing list participants Download and parse mailing list archives

Reconstruct threads of conversation Answers to emails ⇔ create link between authors

slide-13
SLIDE 13

Data Mining

Build social network of mailing list participants Download and parse mailing list archives

Reconstruct threads of conversation Answers to emails ⇔ create link between authors

Extract code information

Author and time of commit File names, file contents

slide-14
SLIDE 14

Data Mining

Build social network of mailing list participants Download and parse mailing list archives

Reconstruct threads of conversation Answers to emails ⇔ create link between authors

Extract code information

Author and time of commit File names, file contents

Time intervals of 3 months

slide-15
SLIDE 15

Finding Community Structure – Modularity

Network partitioned into groups

slide-16
SLIDE 16

Finding Community Structure – Modularity

Network partitioned into groups Modularity:

i insidei n

  • alli

n

2

insidei: number of connections inside group i alli: number of all connections to or from group i (including insidei) n: total number of connections

slide-17
SLIDE 17

Finding Community Structure – Modularity

Network partitioned into groups Modularity:

i insidei n

  • alli

n

2

insidei: number of connections inside group i alli: number of all connections to or from group i (including insidei) n: total number of connections

Intuition: ratio of connections inside groups vs. between groups

slide-18
SLIDE 18

Finding Community Structure – Modularity

Values between 0 and 1

0: not modular 1: disconnected complete graphs

slide-19
SLIDE 19

Finding Community Structure – Modularity

Values between 0 and 1

0: not modular 1: disconnected complete graphs

Modularity of known modular networks ranges from 0.3 to 0.7

slide-20
SLIDE 20

Finding Community Structure – Modularity

Values between 0 and 1

0: not modular 1: disconnected complete graphs

Modularity of known modular networks ranges from 0.3 to 0.7 Find partition of the network that yields highest modularity

slide-21
SLIDE 21

Finding Community Structure – Modularity

Values between 0 and 1

0: not modular 1: disconnected complete graphs

Modularity of known modular networks ranges from 0.3 to 0.7 Find partition of the network that yields highest modularity NP-complete, approximation used

slide-22
SLIDE 22

Example Network

This network has modularity 0.39

slide-23
SLIDE 23

Spontaneous Formation of Subcommunities

Hypothesis 1 Mailing list participants spontaneously form subcommunities and the modularity values of these subcommunities will be significant.

slide-24
SLIDE 24

Strong Community Structure

Very significant when compared to random network Hypothesis 1 confirmed

slide-25
SLIDE 25

Product and Process Messages

Product messages

About code

slide-26
SLIDE 26

Product and Process Messages

Product messages

About code

Process messages

Everything else, e.g., high-level architecture discussions

slide-27
SLIDE 27

Product and Process Messages

Product messages

About code

Process messages

Everything else, e.g., high-level architecture discussions

Automatic classification by scanning for names of files, functions, classes, . . .

slide-28
SLIDE 28

Higher Modularity of Product Messages

Hypothesis 2 Modularity values of networks constructed from only product messages will be higher than when only process messages or all messages are used.

slide-29
SLIDE 29

Hypothesis Confirmed

Hypothesis 2 confirmed Successful projects focus into subcommunities for product-related work

slide-30
SLIDE 30

Subcommunities Signify Collaboration

Hypothesis 3 Pairs of developers within the same subcommunity will have more files in common than pairs of developers from different subcommunities.

slide-31
SLIDE 31

Defining Collaboration

Compare number of files worked on by developers in

the same subcommunity different subcommunities

slide-32
SLIDE 32

Hypothesis Confirmed

Hypothesis 3 confirmed Social interaction linked with programming collaboration

slide-33
SLIDE 33

Subcommunities are Focused

Hypothesis 4 Subcommunities focus their attention to small parts of the system, so the average directory distance of files worked on by a subcommunity will be small.

slide-34
SLIDE 34

Directory Distance

Directory distance is the tree distance in the directory tree

slide-35
SLIDE 35

Directory Distance

Directory distance is the tree distance in the directory tree Find average directory distance of files that were worked

  • n by a subcommunity
slide-36
SLIDE 36

Directory Distance

Directory distance is the tree distance in the directory tree Find average directory distance of files that were worked

  • n by a subcommunity

Compare to random samples of developers

slide-37
SLIDE 37

Hypothesis Not Confirmed

No significantly lower directory distance Hypothesis 4 not confirmed

slide-38
SLIDE 38

Hypothesis Not Confirmed

No significantly lower directory distance Hypothesis 4 not confirmed Possible explanations:

Hypothesis incorrect

slide-39
SLIDE 39

Hypothesis Not Confirmed

No significantly lower directory distance Hypothesis 4 not confirmed Possible explanations:

Hypothesis incorrect Directory distance no good measure for task focus

slide-40
SLIDE 40

Conclusion

OSS projects have strong social structures

slide-41
SLIDE 41

Conclusion

OSS projects have strong social structures Code discussion more modular than general discussion

slide-42
SLIDE 42

Conclusion

OSS projects have strong social structures Code discussion more modular than general discussion Social interaction is linked with programming collaboration