Communi unity Det etec ection & & Modula larit ity
The search for clustered and overlapping nodes es
1
Communi unity Det etec ection & & Modula larit ity - - PowerPoint PPT Presentation
Communi unity Det etec ection & & Modula larit ity The search for clustered and overlapping nodes es 1 In the e en end, more e than they ey wanted ed freed eedom, they ey wanted ed sec ecurity. They ey wanted ed a
The search for clustered and overlapping nodes es
1
“In the e en end, more e than they ey wanted ed freed eedom, they ey wanted ed sec
ey wanted ed a comfortable e lif life a and they lo lost it it a all ll --
curity, co comfort a and freedom.... Whe hen n the he Athe heni nians ns f fina nally w want nted not
em, when en t the f e freed eedom they ey w wished ed f for most was freed eedom from res esponsibility, then en Athen ens cea eased ed to be e free. ee.”
dward d Gibbo bbon
2
Communi unity D Detection: n:
› Community Detection is the process of seeking out community structures within a network B ut what is a community? › Community structure is the occurrence
more densely connected internally than with the rest of the network
3
A c communi unity i is essentia ially lly a a sub ubgraph h sel elec ected ed f from withi hin a n a net etwork
While this makes sense as a simplification a communities may not be a complete graph or may overlap with
4
Barabasi si’s H s Hypothese ses
› A network’s community structure is uniquely encoded in its wiring diagram › A community corresponds to a connected subgraph (connectedness) › Communities correspond to locally dense neighborhoods of a network (density) › R andomly wired networks are not expected to have a community structure.
5
As n s networks s bec ecome e larger er a and more c comple lex it it is is harder er t to d det etec ect def efined ed communi unities
6
And nd t thus hus we m mus ust employ
to
etec ect them em w wher ere e mer ere i e infer eren ence e fails
Basic ic P Partit itio ionin ing
› In the mini nimum um-cu cut method: the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. › Kernighan-Lin algorithm attempts to find an
between elements of A and B which maximizes the difference in total weights
7
In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 8
Kerni nigha han-Lin in A Alg lgorit ithm
In order to create partitions A and B let be the internal cost of a, that is, the sum of the costs of edges between a and other nodes in A, and let be the external cost of a, that is, the sum of the costs of edges between A and nodes in B. Furthermore, let be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is where is the cost of the possible edge between a and b. 9
Kerni nigha han-Lin in A Alg lgorit ithm
› Partition a network into two groups of predefined size. This partition is called cut. › Inspect each a pair of nodes, one from each group. Identify the pair that results in the largest reduction of the cut size (links between the two groups) if we swap them › Swap them. › If no pair deduces the cut size, we swap the pair that increases the cut size the least. › The process is repeated until each node is moved once.
1
Kerni nigha han-Lin in A Alg lgorit ithm
Kerni nigha han-Li Lin Alg lgorit ithm
Bipartitioning market data
1 1
Kerni nigha han-Li Lin Alg lgorit ithm
Mixed Market Highs
Kerni nigha han-Li Lin Alg lgorit ithm
Stock Market Indexes
Kerni nigha han-Li Lin Alg lgorit ithm
Cryptocurrency Partition
Div ivis isiv ive Clu lusterin ing Divisive algorithms split communities by removing links that connect nodes with low similarity. › Girvan-Newman algorithm
Hie ierarchic ical C l Clu lusterin ing
Agglo lomerativ ive Clu lusterin ing Agglomerative algorithms merge nodes and communities with high similarity. › Clauset-Newman-Moore algorithm › Louvain algorithm
1 5
› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator
› The Girvan–Newman algorithm extends this definition to the case
betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.
1 6
Gi Girvan-Newman an Alg lgorit ithm
› the Girvan–Newman algorithm focuses on edges that are most likely "between" communities › Vertex Betweenness is an indicator
› The Girvan–Newman algorithm extends this definition to the case
betweenness" of an edge as the number of shortest paths between pairs of nodes that run along it.
1 7
Gi Girvan-Newman an Alg lgorit ithm
1. The betweenness of all existing edges in the network is calculated first. 2. The edge with the highest betweenness is removed. 3. The betweenness of all edges affected by the removal is recalculated. 4. Steps 2 and 3 are repeated until no edges remain.
Gi Girvan- NewmanAlg lgorit it hm hm
As applied to stock indexes
1 8
Gi Girvan- NewmanAlg lgorit it hm hm
Partitions stock indexes
Gi Girvan- NewmanAlg lgorit it hm hm
High Currency Values
Gi Girvan- NewmanAlg lgorit it hm hm
Cryptocurrency partitioned by volume
Div ivis isiv ive C Clu lusterin ing
Cryptocurrency and Foreign E xchange
› Modula larit ity is a scale value between -1 and 1 that measures the density of edges inside communities to edges outside communities › For a weighted graph, modularity is defined as:
23
Agglomer erative e Clus ustering ng
between nodes and ;
weights of the edges attached to nodes and , respectively;
weights in the graph;
the nodes; and
Cl Clauset-Newman an- Moor
Algor
Partitioni ning ng C Cur urrenc ncies by O Optim imiz izin ing Modula larit ity
› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community
j of i i
25
Louv uvain A n Algorithm hm
› First, each node in the network is assigned to its own community › Then for each node i, the change in modularity is calculated for removing i from its own community and moving it into the community
j of i i
26
Louv uvain A n Algorithm hm
› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram
27
Louv uvain A n Algorithm hm
› By utilizing this formula for modularity we can build communities by optimizing modularity value › These communities are organized hierarchically in a dendrogram
28
Louv uvain A n Algorithm hm
A s set et of Den endrograms
In hierarchical levels
A s set et of Den endrograms
In hierarchical levels
A s set et of Den endrograms
In hierarchical levels
The b best t parti titi tion
Louv uvain n Alg lgorit ithm
Cryptocurrency under Agglomerative partitioning
33
Louv uvain n Alg lgorit ithm
Cryptocurrency under Agglomerative partitioning
Louv uvain n Alg lgorit ithm
Cryptocurrency by volume
Louv uvain n partit itio ionin ing o
the m e market ets
Louv uvain n partit itio ionin ing o
the he c coins ns and nd cu currency cy
› Cliq liques are subgraphs in which every node is connected to every
› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos
algor
38
Cliq liques
1. BronKerbosch1(R, P, X): 2. if P and X are both empty: 3. report R as a maximal clique 4. for each vertex v in P: 5. BronKerbosch1(R ⋃ { v} , P ⋂ N(v), X ⋂ N(v)) 6. P : = P \ { v} 7. X : = X ⋃ { v}
› Cliq liques are subgraphs in which every node is connected to every
› Cliques allow nodes to be in multiple communities at a time › One might choose cliques of a fixed size k or find the maxim imal c l cliq liques › Maxim imal C l Cliq liques are t typic ically lly calcul ulated thr hroug ugh h the he Bron- Kerbos
algor
39
Cliq liques
The he I Ind ndex Graph h is t s too sm small
Cry ryptorre rrency
By vo y volume
Cry ryptorre rrency
Price n e net etwork
Forex ex A Aver erages es
Maxim imal C l Cliq lique Communi unities
Cur urrenc ncy w with h Cr Crypto
Maxim imal C l Cliq lique Communi unities
Data A Acquis isit itio ion
Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued
45
Data A Acquis isit itio ion
Fo Forex › The oanda api supports price streaming › Posts prices on their website › Cr Crypto › Coinmarket Cap provides us with easily scrapable prices S t S tock › Supposedly the yahoo api is discontinued
46
Our r pro rocess s seemed s stra raightforw rward rd
Dow
Dat ata Build ild Dat atab abas ase Anal alyze Dat ata
47
MARKET ET DATA TA R equires curation which must be automated with bash scripts
48
MARKET ET DATA TA Data from different types of prices was joined in the database
49
Automated Downloader Turnkey VM with Database SSH Tunnel & Psycopg2 Internet Port forwarded SSH Tunnel to Notebook
Quer eries es to C Correl elations
We can query which currencies we want to look at during specific intervals to generate a data matrix, which we can then find correlations between
Cor
to
› Create weighted network between every currency with the correlations as weights › Use Maximum Spanning Tree to trim network, keeping high correlations › Add labels on nodes
Pit itfalls lls
› Fully connected network has trivial metrics › Self loops will always be highest correlations › Hard to find true important nodes and relationships
S ig ignif ific icance of M Metric ics
› Weighted degree = nodes w/ most pos. Correlations › Shortest Path: Chain of most negatively correlated currencies › Betweenness centrality: currencies used as intermediary exchanges
Use for t r tra rading
› Find currency / index of interest › Use max spanning tree as well as min
Intro t
Network: N nodes, L links, and a partition into nc communities, each having Nc nodes connected to each other by Lc links, where c=1 ,...,nc. If Lc is larger than the expected # of links between the Nc nodes given the network’s degree sequence, then the nodes of the subgraph Cc could be a true community
Modula larit ity F Formula las
We therefore measure the difference between the network’s real wiring diagram (Aij) and the expected number of links between i and j if the network is randomly wired (pij),
Modula larit ity E Example les
Modularity: K Key ey Proper erties es
› Higher Modularity Implies Better Partition The higher M for a partition, the better the corresponding community structure. The partition with the max modularity (M=0.41 ) accurately captures the two obvious
in general
Key ey Proper erties es c cont.
› Zero and Negative Modularity By taking the whole network as a single community we obtain M=0, as in this case the two terms in the parenthesis are equal. If each node belongs to a separate community, we have Lc=0 and the sum has nc negative terms, hence M is negative
Lim imit its of M Modula larit ity
Given the important role modularity plays in community identification, we must be aware of some of its limitations.
R esolution Limit Modularity maximization forces small communities into larger ones. If we merge communities A and B into a single community, the network’ s modularity changes with
R esolution Limit highlighted › Consider the case when kAkB|2L < 1 › predicts ΔMAB > 0 if there is at least one link between the two communities (lAB ≥ 1 ). › Must merge A and B to maximize modularity. Assume kA ~ kB= k, if the total degree of the communities satisfies › Then modularity increases by merging A and B into a single community, even if they’ re distinct.
R esolution Limit consequences › Modularity maximization can’ t detect communities smaller than the resolution limit. For the WWW sample, modularity maximization will have difficulties resolving communities with total degree kC ≲ 1 ,730. › Real networks contain numerous small communities
Modula larit ity M Maxim ima