Communi unity Det etec ection & & Modula larit ity - - PowerPoint PPT Presentation

communi unity det etec ection modula larit ity
SMART_READER_LITE
LIVE PREVIEW

Communi unity Det etec ection & & Modula larit ity - - PowerPoint PPT Presentation

Communi unity Det etec ection & & Modula larit ity The search for clustered and overlapping nodes es 1 In the e en end, more e than they ey wanted ed freed eedom, they ey wanted ed sec ecurity. They ey wanted ed a


slide-1
SLIDE 1

Communi unity Det etec ection & & Modula larit ity

The search for clustered and overlapping nodes es

1

slide-2
SLIDE 2

“In the e en end, more e than they ey wanted ed freed eedom, they ey wanted ed sec

  • ecurity. They

ey wanted ed a comfortable e lif life a and they lo lost it it a all ll --

  • - secu

curity, co comfort a and freedom.... Whe hen n the he Athe heni nians ns f fina nally w want nted not

  • t to
  • give to
  • soc
  • ciety but for
  • r soc
  • ciety to
  • give to
  • them

em, when en t the f e freed eedom they ey w wished ed f for most was freed eedom from res esponsibility, then en Athen ens cea eased ed to be e free. ee.”

  • - Edw

dward d Gibbo bbon

2

slide-3
SLIDE 3

Communi unity D Detection: n:

› Community Detection is the process of seeking out community structures within a network B ut what is a community? › Community structure is the occurrence

  • f groups of nodes in a network that are

more densely connected internally than with the rest of the network

3

slide-4
SLIDE 4

A c communi unity i is essentia ially lly a a sub ubgraph h sel elec ected ed f from withi hin a n a net etwork

While this makes sense as a simplification a communities may not be a complete graph or may overlap with

  • ther communities

4

slide-5
SLIDE 5

Barabasi si’s H s Hypothese ses

› A network’s community structure is uniquely encoded in its wiring diagram › A community corresponds to a connected subgraph (connectedness) › Communities correspond to locally dense neighborhoods of a network (density) › R andomly wired networks are not expected to have a community structure.

5

slide-6
SLIDE 6

As n s networks s bec ecome e larger er a and more c comple lex it it is is harder er t to d det etec ect def efined ed communi unities

6

And nd t thus hus we m mus ust employ

  • y algor
  • rithms t

to

  • det

etec ect them em w wher ere e mer ere i e infer eren ence e fails

slide-7
SLIDE 7

Basic ic P Partit itio ionin ing

› In the mini nimum um-cu cut method: the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. › Kernighan-Lin algorithm attempts to find an

  • ptimal series of interchange operations

between elements of A and B which maximizes the difference in total weights

7

slide-8
SLIDE 8

In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 8

Kerni nigha han-Lin in A Alg lgorit ithm

slide-9
SLIDE 9 1 function Kernighan-Lin(G(V,E)): 2 determine a balanced initial partition of the nodes into sets A and B 3 4 do 5 compute D values for all a in A and b in B 6 let gv, av, and bv be empty lists 7 for ( n := 1 to | V| / 2) 8 find a from A and b from B, such that g = D[ a] + D[ b] - 2* c(a, b) is maximal 9 remove a and b from further consideration in this pass 10 add g to gv, a to av, and b to bv 11 update D values for the elements of A = A \ a and B = B \ b 12 end for 13 find k which maximizes g_max, the sum of gv[ 1] ,...,gv[ k] 14 if ( g_m ax > 0) then 15 Exchange av[ 1] ,av[ 2] ,...,av[ k] with bv[ 1] ,bv[ 2] ,...,bv[ k] 16 until ( g_m ax < = 0) 17 return G( V,E)

In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 9

Kerni nigha han-Lin in A Alg lgorit ithm

slide-10
SLIDE 10 1 function Kernighan-Lin(G(V,E)): 2 determine a balanced initial partition of the nodes into sets A and B 3 4 do 5 compute D values for all a in A and b in B 6 let gv, av, and bv be empty lists 7 for ( n := 1 to | V| / 2) 8 find a from A and b from B, such that g = D[ a] + D[ b] - 2* c(a, b) is maximal 9 remove a and b from further consideration in this pass 10 add g to gv, a to av, and b to bv 11 update D values for the elements of A = A \ a and B = B \ b 12 end for 13 find k which maximizes g_max, the sum of gv[ 1] ,...,gv[ k] 14 if ( g_m ax > 0) then 15 Exchange av[ 1] ,av[ 2] ,...,av[ k] with bv[ 1] ,bv[ 2] ,...,bv[ k] 16 until ( g_m ax < = 0) 17 return G( V,E)

› Partition a network into two groups of predefined size. This partition is called cut. › Inspect each a pair of nodes, one from each group. Identify the pair that results in the largest reduction of the cut size (links between the two groups) if we swap them › Swap them. › If no pair deduces the cut size, we swap the pair that increases the cut size the least. › The process is repeated until each node is moved once.

1

Kerni nigha han-Lin in A Alg lgorit ithm

slide-11
SLIDE 11

Kerni nigha han-Li Lin Alg lgorit ithm

Bipartitioning market data

1 1

slide-12
SLIDE 12

Kerni nigha han-Li Lin Alg lgorit ithm

Mixed Market Highs

slide-13
SLIDE 13

Kerni nigha han-Li Lin Alg lgorit ithm

Stock Market Indexes

slide-14
SLIDE 14

Kerni nigha han-Li Lin Alg lgorit ithm

Cryptocurrency Partition

slide-15
SLIDE 15

Div ivis isiv ive Clu lusterin ing Divisive algorithms split communities by removing links that connect nodes with low similarity. › Girvan-Newman algorithm

Hie ierarchic ical C l Clu lusterin ing

Agglo lomerativ ive Clu lusterin ing Agglomerative algorithms merge nodes and communities with high similarity. › Clauset-Newman-Moore algorithm › Louvain algorithm

1 5

slide-16
SLIDE 16

› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator

  • f highly central nodes in networks

› The Girvan–Newman algorithm extends this definition to the case

  • f edges, defining the "edge

betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.

1 6

Gi Girvan-Newman an Alg lgorit ithm

slide-17
SLIDE 17

› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator

  • f highly central nodes in networks

› The Girvan–Newman algorithm extends this definition to the case

  • f edges, defining the "edge

betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.

1 7

Gi Girvan-Newman an Alg lgorit ithm

1. The betweenness of all existing edges in the network is calculated first. 2. The edge with the highest betweenness is removed. 3. The betweenness of all edges affected by the removal is recalculated. 4. Steps 2 and 3 are repeated until no edges remain.

slide-18
SLIDE 18

Gi Girvan- NewmanAlg lgorit it hm hm

As applied to stock indexes

1 8

slide-19
SLIDE 19

Gi Girvan- NewmanAlg lgorit it hm hm

Partitions stock indexes

slide-20
SLIDE 20

Gi Girvan- NewmanAlg lgorit it hm hm

High Currency Values

slide-21
SLIDE 21

Gi Girvan- NewmanAlg lgorit it hm hm

Cryptocurrency partitioned by volume

slide-22
SLIDE 22

Div ivis isiv ive C Clu lusterin ing

Cryptocurrency and Foreign E xchange

slide-23
SLIDE 23

› Modula larit ity is a scale value between -1 and 1 that measures the density of edges inside communities to edges outside communities › For a weighted graph, modularity is defined as:

23

Agglomer erative e Clus ustering ng

  • represents the edge weight

between nodes and ;

  • and are the sum of the

weights of the edges attached to nodes and , respectively;

  • is the sum of all of the edge

weights in the graph;

  • and are the communities of

the nodes; and

  • is a simple delta function.
slide-24
SLIDE 24

Cl Clauset-Newman an- Moor

  • ore A

Algor

  • rithm

Partitioni ning ng C Cur urrenc ncies by O Optim imiz izin ing Modula larit ity

slide-25
SLIDE 25

› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community

  • f each neighbor j

j of i i

25

Louv uvain A n Algorithm hm

slide-26
SLIDE 26 inc = dict([ ] ) deg = dict([ ] ) links = graph.size(weight= weight) if links = = 0: raise ValueError("A graph without link has an undefined modularity") for node in graph: com = partition[ node] deg[ com] = deg.get(com, 0.) + graph.degree(node, weight= weight) for neighbor, datas in graph[ node] .items(): edge_weight = datas.get(weight, 1) if partition[ neighbor] = = com: if neighbor = = node: inc[ com] = inc.get(com, 0.) + float(edge_weight) else: inc[ com] = inc.get(com, 0.) + float(edge_weight) / 2. res = 0. for com in set(partition.values()): res + = (inc.get(com, 0.) / links) - \ (deg.get(com, 0.) / (2. * links)) * * 2 return res

› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community

  • f each neighbor j

j of i i

26

Louv uvain A n Algorithm hm

slide-27
SLIDE 27

› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram

27

Louv uvain A n Algorithm hm

slide-28
SLIDE 28 1. current_graph = graph.copy() 2. status = Status() 3. status.init(current_graph, weight, part_init) 4. status_list = list() 5. __one_level(current_graph, status, weight, resolution, random_state) 6. new_mod = __modularity(status) 7. partition = __renumber(status.node2com) 8. status_list.append(partition) 9. mod = new_mod 10. current_graph = induced_graph(partition, current_graph, weight) 11. status.init(current_graph, weight) 12. while True: 13. __one_level(current_graph, status, weight, resolution, random_state) 14. new_mod = __modularity(status) 15. if new_mod - mod < __MIN: 16. break 17. partition = __renumber(status.node2com) 18. status_list.append(partition) 19. mod = new_mod 20. current_graph = induced_graph(partition, current_graph, weight) 21. status.init(current_graph, weight) 22. return status_list[ : ]

› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram

28

Louv uvain A n Algorithm hm

slide-29
SLIDE 29

A s set et of Den endrograms

In hierarchical levels

slide-30
SLIDE 30

A s set et of Den endrograms

In hierarchical levels

slide-31
SLIDE 31

A s set et of Den endrograms

In hierarchical levels

slide-32
SLIDE 32

The b best t parti titi tion

slide-33
SLIDE 33

Louv uvain n Alg lgorit ithm

Cryptocurrency under Agglomerative partitioning

33

slide-34
SLIDE 34

Louv uvain n Alg lgorit ithm

Cryptocurrency under Agglomerative partitioning

slide-35
SLIDE 35

Louv uvain n Alg lgorit ithm

Cryptocurrency by volume

slide-36
SLIDE 36

Louv uvain n partit itio ionin ing o

  • f

the m e market ets

slide-37
SLIDE 37

Louv uvain n partit itio ionin ing o

  • f

the he c coins ns and nd cu currency cy

slide-38
SLIDE 38

› Cliq liques are subgraphs in which every node is connected to every

  • ther node in the clique.

› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos

  • sch a

algor

  • rithm

38

Cliq liques

slide-39
SLIDE 39

1. BronKerbosch1(R, P, X): 2. if P and X are both empty: 3. report R as a maximal clique 4. for each vertex v in P: 5. BronKerbosch1(R ⋃ { v} , P ⋂ N(v), X ⋂ N(v)) 6. P : = P \ { v} 7. X : = X ⋃ { v}

› Cliq liques are subgraphs in which every node is connected to every

  • ther node in the clique.

› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos

  • sch a

algor

  • rithm

39

Cliq liques

slide-40
SLIDE 40

The he I Ind ndex Graph h is t s too sm small

slide-41
SLIDE 41

Cry ryptorre rrency

By vo y volume

slide-42
SLIDE 42

Cry ryptorre rrency

Price n e net etwork

slide-43
SLIDE 43

Forex ex A Aver erages es

Maxim imal C l Cliq lique Communi unities

slide-44
SLIDE 44

Cur urrenc ncy w with h Cr Crypto

Maxim imal C l Cliq lique Communi unities

slide-45
SLIDE 45

Data A Acquis isit itio ion

Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued

45

slide-46
SLIDE 46

Data A Acquis isit itio ion

Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued

46

slide-47
SLIDE 47

Our r pro rocess s seemed s stra raightforw rward rd

Dow

  • wnloa
  • ad

Dat ata Build ild Dat atab abas ase Anal alyze Dat ata

47

slide-48
SLIDE 48

MARKET ET DATA TA R equires curation which must be automated with bash scripts

48

slide-49
SLIDE 49

MARKET ET DATA TA Data from different types of prices was joined in the database

49

slide-50
SLIDE 50

Automated Downloader Turnkey VM with Database SSH Tunnel & Psycopg2 Internet Port forwarded SSH Tunnel to Notebook

slide-51
SLIDE 51

Quer eries es to C Correl elations

We can query which currencies we want to look at during specific intervals to generate a data matrix, which we can then find correlations between

slide-52
SLIDE 52

Cor

  • rrelation
  • ns t

to

  • Networ
  • rks

› Create weighted network between every currency with the correlations as weights › Use Maximum Spanning Tree to trim network, keeping high correlations › Add labels on nodes

slide-53
SLIDE 53

Pit itfalls lls

› Fully connected network has trivial metrics › Self loops will always be highest correlations › Hard to find true important nodes and relationships

slide-54
SLIDE 54

S ig ignif ific icance of M Metric ics

› Weighted degree = nodes w/ most pos. Correlations › Shortest Path: Chain of most negatively correlated currencies › Betweenness centrality: currencies used as intermediary exchanges

slide-55
SLIDE 55

Use for t r tra rading

› Find currency / index of interest › Use max spanning tree as well as min

slide-56
SLIDE 56

Intro t

  • to M
  • Mod
  • dularity

Network: N nodes, L links, and a partition into nc communities, each having Nc nodes connected to each other by Lc links, where c=1 ,...,nc. If Lc is larger than the expected # of links between the Nc nodes given the network’s degree sequence, then the nodes of the subgraph Cc could be a true community

slide-57
SLIDE 57

Modula larit ity F Formula las

We therefore measure the difference between the network’s real wiring diagram (Aij) and the expected number of links between i and j if the network is randomly wired (pij),

slide-58
SLIDE 58

Modula larit ity E Example les

slide-59
SLIDE 59

Modularity: K Key ey Proper erties es

› Higher Modularity Implies Better Partition The higher M for a partition, the better the corresponding community structure. The partition with the max modularity (M=0.41 ) accurately captures the two obvious

  • communities. M <= 1

in general

slide-60
SLIDE 60

Key ey Proper erties es c cont.

› Zero and Negative Modularity By taking the whole network as a single community we obtain M=0, as in this case the two terms in the parenthesis are equal. If each node belongs to a separate community, we have Lc=0 and the sum has nc negative terms, hence M is negative

slide-61
SLIDE 61

Lim imit its of M Modula larit ity

Given the important role modularity plays in community identification, we must be aware of some of its limitations.

slide-62
SLIDE 62

R esolution Limit Modularity maximization forces small communities into larger ones. If we merge communities A and B into a single community, the network’ s modularity changes with

slide-63
SLIDE 63

R esolution Limit highlighted › Consider the case when kAkB|2L < 1 › predicts ΔMAB > 0 if there is at least one link between the two communities (lAB ≥ 1 ). › Must merge A and B to maximize modularity. Assume kA ~ kB= k, if the total degree of the communities satisfies › Then modularity increases by merging A and B into a single community, even if they’ re distinct.

slide-64
SLIDE 64

R esolution Limit consequences › Modularity maximization can’ t detect communities smaller than the resolution limit. For the WWW sample, modularity maximization will have difficulties resolving communities with total degree kC ≲ 1 ,730. › Real networks contain numerous small communities

slide-65
SLIDE 65

Modula larit ity M Maxim ima