[PPT] - Communi unity Det etec ection & & Modula larit ity PowerPoint Presentation

SLIDE 1

Communi unity Det etec ection & & Modula larit ity

The search for clustered and overlapping nodes es

1

SLIDE 2

“In the e en end, more e than they ey wanted ed freed eedom, they ey wanted ed sec

ecurity. They

ey wanted ed a comfortable e lif life a and they lo lost it it a all ll --

- secu

curity, co comfort a and freedom.... Whe hen n the he Athe heni nians ns f fina nally w want nted not

t to
give to
soc
ciety but for
r soc
ciety to
give to
them

em, when en t the f e freed eedom they ey w wished ed f for most was freed eedom from res esponsibility, then en Athen ens cea eased ed to be e free. ee.”

- Edw

dward d Gibbo bbon

2

SLIDE 3

Communi unity D Detection: n:

› Community Detection is the process of seeking out community structures within a network B ut what is a community? › Community structure is the occurrence

f groups of nodes in a network that are

more densely connected internally than with the rest of the network

3

SLIDE 4

A c communi unity i is essentia ially lly a a sub ubgraph h sel elec ected ed f from withi hin a n a net etwork

While this makes sense as a simplification a communities may not be a complete graph or may overlap with

ther communities

4

SLIDE 5

Barabasi si’s H s Hypothese ses

› A network’s community structure is uniquely encoded in its wiring diagram › A community corresponds to a connected subgraph (connectedness) › Communities correspond to locally dense neighborhoods of a network (density) › R andomly wired networks are not expected to have a community structure.

5

SLIDE 6

As n s networks s bec ecome e larger er a and more c comple lex it it is is harder er t to d det etec ect def efined ed communi unities

6

And nd t thus hus we m mus ust employ

y algor
rithms t

to

det

etec ect them em w wher ere e mer ere i e infer eren ence e fails

SLIDE 7

Basic ic P Partit itio ionin ing

› In the mini nimum um-cu cut method: the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. › Kernighan-Lin algorithm attempts to find an

ptimal series of interchange operations

between elements of A and B which maximizes the difference in total weights

7

SLIDE 8

In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 8

Kerni nigha han-Lin in A Alg lgorit ithm

SLIDE 9 1 function Kernighan-Lin(G(V,E)): 2 determine a balanced initial partition of the nodes into sets A and B 3 4 do 5 compute D values for all a in A and b in B 6 let gv, av, and bv be empty lists 7 for ( n := 1 to | V| / 2) 8 find a from A and b from B, such that g = D[ a] + D[ b] - 2* c(a, b) is maximal 9 remove a and b from further consideration in this pass 10 add g to gv, a to av, and b to bv 11 update D values for the elements of A = A \ a and B = B \ b 12 end for 13 find k which maximizes g_max, the sum of gv[ 1] ,...,gv[ k] 14 if ( g_m ax > 0) then 15 Exchange av[ 1] ,av[ 2] ,...,av[ k] with bv[ 1] ,bv[ 2] ,...,bv[ k] 16 until ( g_m ax < = 0) 17 return G( V,E)

In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 9

Kerni nigha han-Lin in A Alg lgorit ithm

SLIDE 10 1 function Kernighan-Lin(G(V,E)): 2 determine a balanced initial partition of the nodes into sets A and B 3 4 do 5 compute D values for all a in A and b in B 6 let gv, av, and bv be empty lists 7 for ( n := 1 to | V| / 2) 8 find a from A and b from B, such that g = D[ a] + D[ b] - 2* c(a, b) is maximal 9 remove a and b from further consideration in this pass 10 add g to gv, a to av, and b to bv 11 update D values for the elements of A = A \ a and B = B \ b 12 end for 13 find k which maximizes g_max, the sum of gv[ 1] ,...,gv[ k] 14 if ( g_m ax > 0) then 15 Exchange av[ 1] ,av[ 2] ,...,av[ k] with bv[ 1] ,bv[ 2] ,...,bv[ k] 16 until ( g_m ax < = 0) 17 return G( V,E)

› Partition a network into two groups of predefined size. This partition is called cut. › Inspect each a pair of nodes, one from each group. Identify the pair that results in the largest reduction of the cut size (links between the two groups) if we swap them › Swap them. › If no pair deduces the cut size, we swap the pair that increases the cut size the least. › The process is repeated until each node is moved once.

1

Kerni nigha han-Lin in A Alg lgorit ithm

SLIDE 11

Kerni nigha han-Li Lin Alg lgorit ithm

Bipartitioning market data

1 1

SLIDE 12

Kerni nigha han-Li Lin Alg lgorit ithm

Mixed Market Highs

SLIDE 13

Kerni nigha han-Li Lin Alg lgorit ithm

Stock Market Indexes

SLIDE 14

Kerni nigha han-Li Lin Alg lgorit ithm

Cryptocurrency Partition

SLIDE 15

Div ivis isiv ive Clu lusterin ing Divisive algorithms split communities by removing links that connect nodes with low similarity. › Girvan-Newman algorithm

Hie ierarchic ical C l Clu lusterin ing

Agglo lomerativ ive Clu lusterin ing Agglomerative algorithms merge nodes and communities with high similarity. › Clauset-Newman-Moore algorithm › Louvain algorithm

1 5

SLIDE 16

› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator

f highly central nodes in networks

› The Girvan–Newman algorithm extends this definition to the case

f edges, defining the "edge

betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.

1 6

Gi Girvan-Newman an Alg lgorit ithm

SLIDE 17

› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator

f highly central nodes in networks

› The Girvan–Newman algorithm extends this definition to the case

f edges, defining the "edge

betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.

1 7

Gi Girvan-Newman an Alg lgorit ithm

1. The betweenness of all existing edges in the network is calculated first. 2. The edge with the highest betweenness is removed. 3. The betweenness of all edges affected by the removal is recalculated. 4. Steps 2 and 3 are repeated until no edges remain.

SLIDE 18

Gi Girvan- NewmanAlg lgorit it hm hm

As applied to stock indexes

1 8

SLIDE 19

Gi Girvan- NewmanAlg lgorit it hm hm

Partitions stock indexes

SLIDE 20

Gi Girvan- NewmanAlg lgorit it hm hm

High Currency Values

SLIDE 21

Gi Girvan- NewmanAlg lgorit it hm hm

Cryptocurrency partitioned by volume

SLIDE 22

Div ivis isiv ive C Clu lusterin ing

Cryptocurrency and Foreign E xchange

SLIDE 23

› Modula larit ity is a scale value between -1 and 1 that measures the density of edges inside communities to edges outside communities › For a weighted graph, modularity is defined as:

23

Agglomer erative e Clus ustering ng

represents the edge weight

between nodes and ;

and are the sum of the

weights of the edges attached to nodes and , respectively;

is the sum of all of the edge

weights in the graph;

and are the communities of

the nodes; and

is a simple delta function.

SLIDE 24

Cl Clauset-Newman an- Moor

ore A

Algor

rithm

Partitioni ning ng C Cur urrenc ncies by O Optim imiz izin ing Modula larit ity

SLIDE 25

› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community

f each neighbor j

j of i i

25

Louv uvain A n Algorithm hm

SLIDE 26 inc = dict([ ] ) deg = dict([ ] ) links = graph.size(weight= weight) if links = = 0: raise ValueError("A graph without link has an undefined modularity") for node in graph: com = partition[ node] deg[ com] = deg.get(com, 0.) + graph.degree(node, weight= weight) for neighbor, datas in graph[ node] .items(): edge_weight = datas.get(weight, 1) if partition[ neighbor] = = com: if neighbor = = node: inc[ com] = inc.get(com, 0.) + float(edge_weight) else: inc[ com] = inc.get(com, 0.) + float(edge_weight) / 2. res = 0. for com in set(partition.values()): res + = (inc.get(com, 0.) / links) - \ (deg.get(com, 0.) / (2. * links)) * * 2 return res

› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community

f each neighbor j

j of i i

26

Louv uvain A n Algorithm hm

SLIDE 27

› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram

27

Louv uvain A n Algorithm hm

SLIDE 28 1. current_graph = graph.copy() 2. status = Status() 3. status.init(current_graph, weight, part_init) 4. status_list = list() 5. __one_level(current_graph, status, weight, resolution, random_state) 6. new_mod = __modularity(status) 7. partition = __renumber(status.node2com) 8. status_list.append(partition) 9. mod = new_mod 10. current_graph = induced_graph(partition, current_graph, weight) 11. status.init(current_graph, weight) 12. while True: 13. __one_level(current_graph, status, weight, resolution, random_state) 14. new_mod = __modularity(status) 15. if new_mod - mod < __MIN: 16. break 17. partition = __renumber(status.node2com) 18. status_list.append(partition) 19. mod = new_mod 20. current_graph = induced_graph(partition, current_graph, weight) 21. status.init(current_graph, weight) 22. return status_list[ : ]

› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram

28

Louv uvain A n Algorithm hm

SLIDE 29

A s set et of Den endrograms

In hierarchical levels

SLIDE 30

A s set et of Den endrograms

In hierarchical levels

SLIDE 31

A s set et of Den endrograms

In hierarchical levels

SLIDE 32

The b best t parti titi tion

SLIDE 33

Louv uvain n Alg lgorit ithm

Cryptocurrency under Agglomerative partitioning

33

SLIDE 34

Louv uvain n Alg lgorit ithm

Cryptocurrency under Agglomerative partitioning

SLIDE 35

Louv uvain n Alg lgorit ithm

Cryptocurrency by volume

SLIDE 36

Louv uvain n partit itio ionin ing o

f

the m e market ets

SLIDE 37

Louv uvain n partit itio ionin ing o

f

the he c coins ns and nd cu currency cy

SLIDE 38

› Cliq liques are subgraphs in which every node is connected to every

ther node in the clique.

› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos

sch a

algor

rithm

38

Cliq liques

SLIDE 39

1. BronKerbosch1(R, P, X): 2. if P and X are both empty: 3. report R as a maximal clique 4. for each vertex v in P: 5. BronKerbosch1(R ⋃ { v} , P ⋂ N(v), X ⋂ N(v)) 6. P : = P \ { v} 7. X : = X ⋃ { v}

› Cliq liques are subgraphs in which every node is connected to every

ther node in the clique.

› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos

sch a

algor

rithm

39

Cliq liques

SLIDE 40

The he I Ind ndex Graph h is t s too sm small

SLIDE 41

Cry ryptorre rrency

By vo y volume

SLIDE 42

Cry ryptorre rrency

Price n e net etwork

SLIDE 43

Forex ex A Aver erages es

Maxim imal C l Cliq lique Communi unities

SLIDE 44

Cur urrenc ncy w with h Cr Crypto

Maxim imal C l Cliq lique Communi unities

SLIDE 45

Data A Acquis isit itio ion

Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued

45

SLIDE 46

Data A Acquis isit itio ion

Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued

46

SLIDE 47

Our r pro rocess s seemed s stra raightforw rward rd

Dow

wnloa
ad

Dat ata Build ild Dat atab abas ase Anal alyze Dat ata

47

SLIDE 48

MARKET ET DATA TA R equires curation which must be automated with bash scripts

48

SLIDE 49

MARKET ET DATA TA Data from different types of prices was joined in the database

49

SLIDE 50

Automated Downloader Turnkey VM with Database SSH Tunnel & Psycopg2 Internet Port forwarded SSH Tunnel to Notebook

SLIDE 51

Quer eries es to C Correl elations

We can query which currencies we want to look at during specific intervals to generate a data matrix, which we can then find correlations between

SLIDE 52

Cor

rrelation
ns t

to

Networ
rks

› Create weighted network between every currency with the correlations as weights › Use Maximum Spanning Tree to trim network, keeping high correlations › Add labels on nodes

SLIDE 53

Pit itfalls lls

› Fully connected network has trivial metrics › Self loops will always be highest correlations › Hard to find true important nodes and relationships

SLIDE 54

S ig ignif ific icance of M Metric ics

› Weighted degree = nodes w/ most pos. Correlations › Shortest Path: Chain of most negatively correlated currencies › Betweenness centrality: currencies used as intermediary exchanges

SLIDE 55

Use for t r tra rading

› Find currency / index of interest › Use max spanning tree as well as min

SLIDE 56

Intro t

to M
Mod
dularity

Network: N nodes, L links, and a partition into nc communities, each having Nc nodes connected to each other by Lc links, where c=1 ,...,nc. If Lc is larger than the expected # of links between the Nc nodes given the network’s degree sequence, then the nodes of the subgraph Cc could be a true community

SLIDE 57

Modula larit ity F Formula las

We therefore measure the difference between the network’s real wiring diagram (Aij) and the expected number of links between i and j if the network is randomly wired (pij),

SLIDE 58

Modula larit ity E Example les

SLIDE 59

Modularity: K Key ey Proper erties es

› Higher Modularity Implies Better Partition The higher M for a partition, the better the corresponding community structure. The partition with the max modularity (M=0.41 ) accurately captures the two obvious

communities. M <= 1

in general

SLIDE 60

Key ey Proper erties es c cont.

› Zero and Negative Modularity By taking the whole network as a single community we obtain M=0, as in this case the two terms in the parenthesis are equal. If each node belongs to a separate community, we have Lc=0 and the sum has nc negative terms, hence M is negative

SLIDE 61

Lim imit its of M Modula larit ity

Given the important role modularity plays in community identification, we must be aware of some of its limitations.

SLIDE 62

R esolution Limit Modularity maximization forces small communities into larger ones. If we merge communities A and B into a single community, the network’ s modularity changes with

SLIDE 63

R esolution Limit highlighted › Consider the case when kAkB|2L < 1 › predicts ΔMAB > 0 if there is at least one link between the two communities (lAB ≥ 1 ). › Must merge A and B to maximize modularity. Assume kA ~ kB= k, if the total degree of the communities satisfies › Then modularity increases by merging A and B into a single community, even if they’ re distinct.

SLIDE 64

R esolution Limit consequences › Modularity maximization can’ t detect communities smaller than the resolution limit. For the WWW sample, modularity maximization will have difficulties resolving communities with total degree kC ≲ 1 ,730. › Real networks contain numerous small communities

SLIDE 65

Modula larit ity M Maxim ima