DiCeS: Detecting Communities in Network Streams Over the Cloud - - PowerPoint PPT Presentation

dices detecting communities in network streams over the
SMART_READER_LITE
LIVE PREVIEW

DiCeS: Detecting Communities in Network Streams Over the Cloud - - PowerPoint PPT Presentation

DiCeS: Detecting Communities in Network Streams Over the Cloud Panagiotis Liakos - Katia Papakonstantinopoulou Alexandros Ntoulas - Alex Delis University of Athens Athens University of Economics and Business 12 th IEEE


slide-1
SLIDE 1

DiCeS: Detecting Communities in Network Streams Over the Cloud

Panagiotis Liakos† - Katia Papakonstantinopoulou‡ Alexandros Ntoulas† - Alex Delis†

†University of Athens ‡Athens University of Economics and Business

12th IEEE International Conference on Cloud Computing, Milan, Italy

July 8th–13th, 2019

slide-2
SLIDE 2

Belgian Mobile Phone Network

Fast unfolding of community hierarchies in large networks: Blondel et al. UoA Panagiotis Liakos DiCeS-• Motivation 2/24

slide-3
SLIDE 3

Belgian Mobile Phone Network

Fast unfolding of community hierarchies in large networks: Blondel et al.

two large clusters

  • f communities

UoA Panagiotis Liakos DiCeS-• Motivation 2/24

slide-4
SLIDE 4

Belgian Mobile Phone Network

Fast unfolding of community hierarchies in large networks: Blondel et al.

two large clusters

  • f communities

l i m i t e d i n t e r a c t i

  • n

b e t w e e n c l u s t e r s !

UoA Panagiotis Liakos DiCeS-• Motivation 2/24

slide-5
SLIDE 5

Belgian Mobile Phone Network

Fast unfolding of community hierarchies in large networks: Blondel et al.

two large clusters

  • f communities

l i m i t e d i n t e r a c t i

  • n

b e t w e e n c l u s t e r s ! Brussels acts as a bridge!

UoA Panagiotis Liakos DiCeS-• Motivation 2/24

slide-6
SLIDE 6

Climate change conversation on Twitter

carbonbrief.org

UoA Panagiotis Liakos DiCeS-• Motivation 3/24

slide-7
SLIDE 7

Climate change conversation on Twitter

carbonbrief.org real-world networks are massive!

UoA Panagiotis Liakos DiCeS-• Motivation 3/24

slide-8
SLIDE 8

Climate change conversation on Twitter

carbonbrief.org real-world networks are massive! c h a n g e r a p i d l y !

UoA Panagiotis Liakos DiCeS-• Motivation 3/24

slide-9
SLIDE 9

Climate change conversation on Twitter

carbonbrief.org real-world networks are massive! c h a n g e r a p i d l y ! exhibit community structure!

UoA Panagiotis Liakos DiCeS-• Motivation 3/24

slide-10
SLIDE 10

Motivation

We want to extract the community structure of nodes in a network that changes rapidly.

Many useful applications:

we can launch accurate & successful advertising campaigns we can provide more informative & engaging social network feeds we can gain insights on the evolution of large real-world networks

Size of graph data appears to be ever-increasing:

Facebook has more than 2 billion registered users Google indexes more than 1 trillion unique URLs

UoA Panagiotis Liakos DiCeS-• Motivation 4/24

slide-11
SLIDE 11

Prior Contribution CoEuS [LND17] IEEE Big Data 2017

A novel community detection algorithm that operates on a graph stream, using space sublinear to the number of edges. Additionally: A PageRank-like A Novel Clustering Technique Edge Quality Variation for Community Size Determination

UoA Panagiotis Liakos DiCeS-• Our Approach 5/24

slide-12
SLIDE 12

CoEuS’ context

5 2 8 3 6 4 7 1 9 8 2 3

. . .

Communities initialized with seed-sets Graph stream

UoA Panagiotis Liakos DiCeS-• Our Approach 6/24

slide-13
SLIDE 13

CoEuS’ context

5 2 8 3 6 4 7 1 9 8 2 3

. . .

Communities initialized with seed-sets Graph stream

centralized by design

UoA Panagiotis Liakos DiCeS-• Our Approach 6/24

slide-14
SLIDE 14

DiCeS’ context

Worker node Worker node Worker node Worker node

. . .

8 9 2 3 7 5 6 3 8 1 4 7 5 2

9 9 3 2 7 5

. . .

UoA Panagiotis Liakos DiCeS-• Our Approach 7/24

slide-15
SLIDE 15

Our Contribution

We propose DiCeS, a novel distributed community detection algorithm for network streams. We implement DiCeS as a cloud application that handles streams of real-world networks at impressive rates. Using just 8 workers we can handle 50 million edges per hour. We achieve horizontal scalability that is close to linear. We offer significant improvements with regard to accuracy.

UoA Panagiotis Liakos DiCeS-• Our Approach 8/24

slide-16
SLIDE 16

Apache Storm

Apache Storm: Stream processing framework with broad use in production environments. Tuple: fundamental data unit Spout: source of tuples Bolt: responsible for transforming streams into the desired result Grouping: determines how the tuples are exchanged

UoA Panagiotis Liakos DiCeS-• Technologies Involved 9/24

slide-17
SLIDE 17

Redis

Redis: In-memory key-value data store. Ultra-fast read/write operations Complex data types:

Strings Sets Sorted Sets

Redis-cluster

UoA Panagiotis Liakos DiCeS-• Technologies Involved 10/24

slide-18
SLIDE 18

Design Principles

Scalability

Isolate the processing for every edge Distributed key-value store

Fault Tolerance

All edges must be processed Failing nodes must be restored

Interactivity

Updating the target communities Obtaining results on demand

UoA Panagiotis Liakos DiCeS-• Technologies Involved 11/24

slide-19
SLIDE 19

DiCeS’ Spout

Community initialization Stream ingestion

UoA Panagiotis Liakos DiCeS-• Cloud Components 12/24

slide-20
SLIDE 20

DiCeS’ Bolts Stream processing Community expansion Community pruning

UoA Panagiotis Liakos DiCeS-• Cloud Components 13/24

slide-21
SLIDE 21

Our topology

Distributed key-value store (Redis Cluster) Network stream Spout Processing Bolt Processing Bolt Processing Bolt Processing Bolt

. . .

Pruning Bolt Community seed-sets

UoA Panagiotis Liakos DiCeS-• Cloud Components 14/24

slide-22
SLIDE 22

Our topology

Distributed key-value store (Redis Cluster) Network stream Spout Processing Bolt Processing Bolt Processing Bolt Processing Bolt

. . .

Pruning Bolt Community seed-sets

$ storm rebalance topology-name [-n new-num-workers] [-e component=parallelism]*

UoA Panagiotis Liakos DiCeS-• Cloud Components 14/24

slide-23
SLIDE 23

DiCeS’ Bolt

Algorithm 1: DiCeS

input : A tuple emitted from the spout. begin if tuple.length == 1 then // renewed set of communities communities ← tuple[0]; else // handling of an edge u ← tuple[0]; v ← tuple[1]; degrees[u]+ = 1; degrees[v]+ = 1; foreach C ∈ {nc[u] ∪ nc[v]} do if u ∈ C then cDegrees[C][v]+ = cDegrees[C][u] degrees[u] ; if v ∈ C then cDegrees[C][u]+ = cDegrees[C][v] degrees[v] ; if u ∈ C then communities[C].put(v, cDegrees[C][v] degrees[v] ); nc[v].add(C); if v ∈ C then communities[C].put(u, cDegrees[C][u] degrees[u] ); nc[u].add(C); emit(1);

UoA Panagiotis Liakos DiCeS-• Cloud Components 15/24

slide-24
SLIDE 24

Dataset

Graphs Type Nodes Edges

  • Av. Degree
  • Av. Community Size

DBLP Co-authorship 317, 080 1, 049, 866 3.31 22.45 Amazon Co-purchasing 334, 863 925, 872 2.76 13.49 Youtube Social 1, 134, 890 2, 987, 624 2.63 14.59 LiveJournal Social 3, 997, 962 34, 681, 189 8.67 27.80 Orkut Social 3, 072, 441 117, 185, 083 38.14 215.72 Friendster Social 65, 608, 366 1, 806, 067, 135 27.53 46.81

Networks exceeding 1.8 billion links Accompanying ground-truth communities allow for the evaluation of accuracy

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 16/24

slide-25
SLIDE 25

Performance

100 200 300 400 500 600 A m a z

  • n

D B L P Y

  • u

t u b e L i v e J

  • u

r n a l O r k u t F r i e n d s t e r Average Processing Time per Edge Network 2 bolts 4 bolts 8 bolts

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 17/24

slide-26
SLIDE 26

Performance

100 200 300 400 500 600 A m a z

  • n

D B L P Y

  • u

t u b e L i v e J

  • u

r n a l O r k u t F r i e n d s t e r Average Processing Time per Edge Network 2 bolts 4 bolts 8 bolts

we can reduce our processing time by adding bolts

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 17/24

slide-27
SLIDE 27

Scalability

Worker nodes 2 4 8 Pending tuples (in thousands) 5 10 15 20 Execution time (in s) 100 200 300 400 500 600 UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 18/24

slide-28
SLIDE 28

Scalability

Worker nodes 2 4 8 Pending tuples (in thousands) 5 10 15 20 Execution time (in s) 100 200 300 400 500 600

maximum al- lowed pending tuples impacts the performance

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 18/24

slide-29
SLIDE 29

Scalability

Worker nodes 2 4 8 Pending tuples (in thousands) 5 10 15 20 Execution time (in s) 100 200 300 400 500 600

DiCeS offers near-linear scaling!

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 18/24

slide-30
SLIDE 30

Fault Tolerance

2 4 6 8 10 200 400 600 800 1000 Processing time (in sec) Total edges processed (in thousands) UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 19/24

slide-31
SLIDE 31

Fault Tolerance

2 4 6 8 10 200 400 600 800 1000 Processing time (in sec) Total edges processed (in thousands)

DiCeS recovers its speed almost immediately

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 19/24

slide-32
SLIDE 32

Average Degree & Number of Communities

100 200 300 400 500 600 700 Degree:10, Comm:2K Degree:10, Comm:4K Degree:20, Comm:2K Degree:20, Comm:4K Average Processing Time Per Edge

CoEuS DiCeS (8 bolts)

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 20/24

slide-33
SLIDE 33

Average Degree & Number of Communities

100 200 300 400 500 600 700 Degree:10, Comm:2K Degree:10, Comm:4K Degree:20, Comm:2K Degree:20, Comm:4K Average Processing Time Per Edge

CoEuS DiCeS (8 bolts)

less impact for DiCeS

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 20/24

slide-34
SLIDE 34

F1–score comparison

0.2 0.4 0.6 0.8 1 A m a z

  • n

D B L P Y

  • u

t u b e L i v e J

  • u

r n a l O r k u t F r i e n d s t e r F1-score Network

CoEuS DiCeS LEMON

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 21/24

slide-35
SLIDE 35

F1–score comparison

0.2 0.4 0.6 0.8 1 A m a z

  • n

D B L P Y

  • u

t u b e L i v e J

  • u

r n a l O r k u t F r i e n d s t e r F1-score Network

CoEuS DiCeS LEMON

DiCeS outper- forms CoEuS

UoA Panagiotis Liakos DiCeS-• Experimental Evaluation 21/24

slide-36
SLIDE 36

Conclusion

DiCeS is a streaming community detection virtual infrastructure for large-scale networks that evolve rapidly. DiCeS distributes load to worker nodes in the cloud. We can process almost 50 million edges per hour using only 8 worker nodes. DiCeS is shown to scale almost linearly.

UoA Panagiotis Liakos DiCeS-• Conclusion 22/24

slide-37
SLIDE 37

References

[LND17] Panagiotis Liakos, Alexandros Ntoulas, and Alex Delis. COEUS: community detection via seed-set expansion on graph streams. In 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017, pages 676–685, 2017. UoA Panagiotis Liakos DiCeS-• References 23/24

slide-38
SLIDE 38

thank you!

https://github.com/panagiotisl/DiCeS for further details email me at: p.liakos@di.uoa.gr

hive.di.uoa.gr/network-analysis/ www.madgik.di.uoa.gr/

UoA Panagiotis Liakos DiCeS-• Contact 24/24