Dependence Communities in Software Crest COW UCL Sebastian Danicic - - PowerPoint PPT Presentation

dependence communities in software
SMART_READER_LITE
LIVE PREVIEW

Dependence Communities in Software Crest COW UCL Sebastian Danicic - - PowerPoint PPT Presentation

Dependence Communities in Software Crest COW UCL Sebastian Danicic and James Hamilton Goldsmiths, University of London 30th April 2012 1 / 58 Communities in Graphs A network is said to have community structure if the nodes of the network can


slide-1
SLIDE 1

Dependence Communities in Software

Crest COW UCL Sebastian Danicic and James Hamilton

Goldsmiths, University of London

30th April 2012

1 / 58

slide-2
SLIDE 2

Communities in Graphs

A network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally, with few connections to the rest of the network.

2 / 58

slide-3
SLIDE 3

Communities in Real World Graphs

Many real-world networks are known to have community structure.

3 / 58

slide-4
SLIDE 4

Communities in Real World Graphs

Many real-world networks are known to have community structure. Social networks

4 / 58

slide-5
SLIDE 5

Communities in Real World Graphs

Many real-world networks are known to have community structure. Social networks Biological networks

5 / 58

slide-6
SLIDE 6

Communities in Real World Graphs

Many real-world networks are known to have community structure. Social networks Biological networks Computer networks

6 / 58

slide-7
SLIDE 7

Communities in Real World Graphs

Many real-world networks are known to have community structure. Social networks Biological networks Computer networks Not all networks have community structure e.g. random graphs

7 / 58

slide-8
SLIDE 8

Communities in Real World Graphs

“Graphical representation of the network of communities extracted from a Belgian mobile phone network. About 2M customers are represented on this network. The size of a node is proportional to the number of individuals in the corresponding community and its colour on a red-green scale represents the main language spoken in the community (red for French and green for Dutch). Only the communities composed of more than 100 customers have been plotted. Notice the intermediate community of mixed colours between the two main language clusters. A zoom at higher resolution reveals that it is made of several sub-communities with less apparent language separation.” (Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. doi:10.1088/1742-5468/2008/10/P10008) 8 / 58

slide-9
SLIDE 9

Does Software have Community Structure?

9 / 58

slide-10
SLIDE 10

Does Software have Community Structure?

It depends how you turn the software into a graph.

10 / 58

slide-11
SLIDE 11

Does Software have Community Structure?

It depends how you turn the software into a graph. Consider, the graph: G1(P) = n1 → n2 if and only if n1 and n2 are in the same function in program P.

11 / 58

slide-12
SLIDE 12

Does Software have Community Structure?

It depends how you turn the software into a graph. Consider, the graph: G1(P) = n1 → n2 if and only if n1 and n2 are in the same function in program P. Clearly G1 has community structure but it’s not very interesting!

12 / 58

slide-13
SLIDE 13

Does Software have Community Structure?

It depends how you turn the software into a graph. Consider, the graph: G1(P) = n1 → n2 if and only if n1 and n2 are in the same function in program P. Clearly G1 has community structure but it’s not very interesting! Previous work has shown community structure exists in class dependence graphs.

13 / 58

slide-14
SLIDE 14

‘Interesting’ Communities in Software

We are looking for communites which reflect semantic properties of programs.

14 / 58

slide-15
SLIDE 15

‘Interesting’ Communities in Software

We are looking for communites which reflect semantic properties of programs. Where do we start?

15 / 58

slide-16
SLIDE 16

‘Interesting’ Communities in Software

We are looking for communites which reflect semantic properties of programs. Where do we start? We have to choose graphs which reflect semantic properties of programs.

16 / 58

slide-17
SLIDE 17

‘Interesting’ Communities in Software

We are looking for communites which reflect semantic properties of programs. Where do we start? We have to choose graphs which reflect semantic properties of programs. We then find communites in these graphs.

17 / 58

slide-18
SLIDE 18

‘Interesting’ Communities in Software

We are looking for communites which reflect semantic properties of programs. Where do we start? We have to choose graphs which reflect semantic properties of programs. We then find communites in these graphs. Finally we see if these communities reflect anything semantic.

18 / 58

slide-19
SLIDE 19

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1.

19 / 58

slide-20
SLIDE 20

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1. In other words, n1 → n2 if and only if n1 depends on n2 in P.

20 / 58

slide-21
SLIDE 21

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1. In other words, n1 → n2 if and only if n1 depends on n2 in P. Clearly, slice graphs can be considered ‘semantic’.

21 / 58

slide-22
SLIDE 22

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1. In other words, n1 → n2 if and only if n1 depends on n2 in P. Clearly, slice graphs can be considered ‘semantic’. Question: Do slice graphs have community structure, and if so are the communites ‘interesting’ or ‘useful’?

22 / 58

slide-23
SLIDE 23

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1. In other words, n1 → n2 if and only if n1 depends on n2 in P. Clearly, slice graphs can be considered ‘semantic’. Question: Do slice graphs have community structure, and if so are the communites ‘interesting’ or ‘useful’? Intuitively, a community in a slice graph is a part of a program where there is strong internal inter-dependence.

23 / 58

slide-24
SLIDE 24

Slice Graphs

S(P) = n1 → n2 if and only if n2 is in the slice of P with respect to n1. In other words, n1 → n2 if and only if n1 depends on n2 in P. Clearly, slice graphs can be considered ‘semantic’. Question: Do slice graphs have community structure, and if so are the communites ‘interesting’ or ‘useful’? Intuitively, a community in a slice graph is a part of a program where there is strong internal inter-dependence. Perhaps dependence communities will highlight different semantic concerns within a program.

24 / 58

slide-25
SLIDE 25

Modularity

Given a partition of a network, modularity is a measure of the ‘strength’ of the community structure of this partition. Q =

(fraction of edges that fall within communities in the given graph)

(expected number of edges within those communities in the null model )

(1)

25 / 58

slide-26
SLIDE 26

Modularity

Given a partition of a network, modularity is a measure of the ‘strength’ of the community structure of this partition. Q =

(fraction of edges that fall within communities in the given graph)

(expected number of edges within those communities in the null model )

(1) Modularity, of a weighted undirected graph, is defined as Q = 1 2m

  • i,j
  • Aij − Eij
  • δ(ci, cj)

(2) where Aij is the weight of the edge incident to i and j, ki =

j Aij is the

sum of the weights of the edges incident to vertex i, ci is the community to which vertex i is assigned, δ(u, v) is 1 if i and j are in the same community and 0 otherwise and m = 1

2

  • i,j Aij. Eij is the expected

number of edges between i and j in a random graph of the same degree distribution which can be calculated as kikj

2m .

26 / 58

slide-27
SLIDE 27

Algorithms for Finding Comunities

Finding partitions with the best modularity is NP-hard but tractable algortihms exist for aproximation best possible exist.

27 / 58

slide-28
SLIDE 28

Algorithms for Finding Comunities

Finding partitions with the best modularity is NP-hard but tractable algortihms exist for aproximation best possible exist. The Louvain method is a fast algorithm for detecting communities in large networks based upon modularity maximisation.

28 / 58

slide-29
SLIDE 29

Algorithms for Finding Comunities

Finding partitions with the best modularity is NP-hard but tractable algortihms exist for aproximation best possible exist. The Louvain method is a fast algorithm for detecting communities in large networks based upon modularity maximisation. The algorithm combines neighbouring nodes until a local maximum of modularity is reached and then creates a new network of communities; these two steps are repeated until there is no further increase in modularity.

29 / 58

slide-30
SLIDE 30

Algorithms for Finding Comunities

Finding partitions with the best modularity is NP-hard but tractable algortihms exist for aproximation best possible exist. The Louvain method is a fast algorithm for detecting communities in large networks based upon modularity maximisation. The algorithm combines neighbouring nodes until a local maximum of modularity is reached and then creates a new network of communities; these two steps are repeated until there is no further increase in modularity. This creates a hierarchical decomposition of the network - at the lowest level all nodes are in their own community, and at the highest level nodes are in communities which gives the highest gain in modularity.

30 / 58

slide-31
SLIDE 31

Algorithms for Finding Comunities

Finding partitions with the best modularity is NP-hard but tractable algortihms exist for aproximation best possible exist. The Louvain method is a fast algorithm for detecting communities in large networks based upon modularity maximisation. The algorithm combines neighbouring nodes until a local maximum of modularity is reached and then creates a new network of communities; these two steps are repeated until there is no further increase in modularity. This creates a hierarchical decomposition of the network - at the lowest level all nodes are in their own community, and at the highest level nodes are in communities which gives the highest gain in modularity. This technique is simple, fast and has good accuracy and has been tested

  • n networks with millions of vertices/edges.

31 / 58

slide-32
SLIDE 32

Example of Communities in the Slice Graph: Sum/Product

int main() { const int N = 10; int sum = 0; int product = 1; int i = 1; while(i < N) { sum = sum + i; product = product * i; i = i + 1; } printf("%d\n", sum); printf("%d\n", product); } 32 / 58

slide-33
SLIDE 33

Example of Communities in the Slice Graph: Word Count Program

It separates out the code that does the counting from the code that does the I/O.

33 / 58

slide-34
SLIDE 34

More Examples of Communities in the Slice Graph

34 / 58

slide-35
SLIDE 35

More Examples of Communities in the Slice Graph

GNU Chess: frontend, adapter and engine

35 / 58

slide-36
SLIDE 36

More Examples of Communities in the Slice Graph

GNU Chess: frontend, adapter and engine GNU bc: parser, calculator

36 / 58

slide-37
SLIDE 37

More Examples of Communities in the Slice Graph

GNU Chess: frontend, adapter and engine GNU bc: parser, calculator GNU robots: many communities due to low coupling

37 / 58

slide-38
SLIDE 38

Applications - Detecting Dynamic Watermarks in Java Code

The red bits are the dynamic watermark we injected.

38 / 58

slide-39
SLIDE 39

Applications - Detecting Dynamic Watermarks in Java Code

The red bits are the bits of the watermark discovered by the communites algorithm.

39 / 58

slide-40
SLIDE 40

Dependence Clusters vs. Dependence Communities

A Dependence cluster is a maximal set of mutually dependent vertices i.e. a maximal clique in the slice graph.

40 / 58

slide-41
SLIDE 41

Dependence Clusters vs. Dependence Communities

A Dependence cluster is a maximal set of mutually dependent vertices i.e. a maximal clique in the slice graph. Finding maximal cliques is also NP-Hard.

41 / 58

slide-42
SLIDE 42

Dependence Clusters vs. Dependence Communities

A Dependence cluster is a maximal set of mutually dependent vertices i.e. a maximal clique in the slice graph. Finding maximal cliques is also NP-Hard. A clique is a fully connected subgraph. This may be an overly strict requirement (Harman et al.).

42 / 58

slide-43
SLIDE 43

Dependence Clusters vs. Dependence Communities

A Dependence cluster is a maximal set of mutually dependent vertices i.e. a maximal clique in the slice graph. Finding maximal cliques is also NP-Hard. A clique is a fully connected subgraph. This may be an overly strict requirement (Harman et al.). Perhaps Dependence Communites are a ‘good enough’ approximation for what is required for Dependence Clusters.

43 / 58

slide-44
SLIDE 44

Dependence Clusters vs. Dependence Communities

Because it is hard to compute Dependence Clusters, they approximate by saying:

44 / 58

slide-45
SLIDE 45

Dependence Clusters vs. Dependence Communities

Because it is hard to compute Dependence Clusters, they approximate by saying: two nodes are in the same Dependence Cluster if and only if they have the same slice.

45 / 58

slide-46
SLIDE 46

Dependence Clusters vs. Dependence Communities

Because it is hard to compute Dependence Clusters, they approximate by saying: two nodes are in the same Dependence Cluster if and only if they have the same slice. These are cliques, but not, in general, maximal ones.

46 / 58

slide-47
SLIDE 47

Dependence Clusters vs. Dependence Communities

Because it is hard to compute Dependence Clusters, they approximate by saying: two nodes are in the same Dependence Cluster if and only if they have the same slice. These are cliques, but not, in general, maximal ones. It turns out that, If we apply the Louvain algorithm to the same graphs we get a partition with higher modularity. In other words it produces ‘clusters’ with a stronger ‘internal inter-dependence’ than those produced by Harman’s approximation.

47 / 58

slide-48
SLIDE 48

What does this mean?

It could be argued, therefore, that Dependence Communites may be a better approximation to Dependence Clusters.

48 / 58

slide-49
SLIDE 49

What does this mean?

It could be argued, therefore, that Dependence Communites may be a better approximation to Dependence Clusters. . . . or at least a better approximation to the properties of programs that authors are trying to capture using Dependence Clusters!

49 / 58

slide-50
SLIDE 50

Programs with Large Dependence Clusters are bad!

Is there a correlation betwen large Dependence Clusters and Large Dependence Communities?

50 / 58

slide-51
SLIDE 51

Programs with Large Dependence Clusters are bad!

Is there a correlation betwen large Dependence Clusters and Large Dependence Communities? R = 0.8, p < 0.00001

51 / 58

slide-52
SLIDE 52

Conclusions

52 / 58

slide-53
SLIDE 53

Conclusions

We have introduced the concept of Dependence Communites.

53 / 58

slide-54
SLIDE 54

Conclusions

We have introduced the concept of Dependence Communites. We find these by using an algorithm that attempts to partition the slice graph in order maximise the modularity.

54 / 58

slide-55
SLIDE 55

Conclusions

We have introduced the concept of Dependence Communites. We find these by using an algorithm that attempts to partition the slice graph in order maximise the modularity. There is a strong correlation between Dependence Communities and Dependence Clusters.

55 / 58

slide-56
SLIDE 56

Conclusions

We have introduced the concept of Dependence Communites. We find these by using an algorithm that attempts to partition the slice graph in order maximise the modularity. There is a strong correlation between Dependence Communities and Dependence Clusters. Programs that we investigated have a positive but, in most cases, not high modularity. Programs that we investigated have a positive but, in most cases, not high modularity.

56 / 58

slide-57
SLIDE 57

Conclusions

We have introduced the concept of Dependence Communites. We find these by using an algorithm that attempts to partition the slice graph in order maximise the modularity. There is a strong correlation between Dependence Communities and Dependence Clusters. Programs that we investigated have a positive but, in most cases, not high modularity. Programs that we investigated have a positive but, in most cases, not high modularity. Dependence Communities reflect semantic properties of a program.

57 / 58

slide-58
SLIDE 58

Thanks

Thanks for listening. Any questions?

58 / 58