Big Data Era 1 1 https://vimeo.com/102998774 The big problem: - - PowerPoint PPT Presentation

big data era
SMART_READER_LITE
LIVE PREVIEW

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: - - PowerPoint PPT Presentation

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: Scalability Visualization Algorithm Hardware 2 The big problem: Scalability Visualization Algorithm Hardware https://upload.wikimedia.org/wikipedia/commons/0/05/Sna_large.png


slide-1
SLIDE 1

1

Big Data Era

https://vimeo.com/102998774

1

slide-2
SLIDE 2

2

The big problem: Scalability

Hardware Algorithm Visualization

slide-3
SLIDE 3

3

The big problem: Scalability

Hardware Algorithm Visualization

https://upload.wikimedia.org/wikipedia/commons/0/05/Sna_large.png https://upload.wikimedia.org/wikipedia/commons/9/9b/Social_Network_Analysis_Visualization.png https://c1.staticflickr.com/5/4033/4520018121_6dd39e8d7e_z.jpg https://c1.staticflickr.com/1/1/916142_ddc2fd0140.jpg
slide-4
SLIDE 4

4

  • Randomly pick nodes /edges to construct a subgraph that

represents the original unfiltered graph:

Graph Sampling

slide-5
SLIDE 5

5

Which sampling strategy to use?

slide-6
SLIDE 6

6

[Leskovec and Faloutsos, KDD 2006]

Graph Sampling Evaluation

Random Walk (RW) v.s. Forest Fire (FF)

slide-7
SLIDE 7

7

Graph Sampling Evaluation in Visualization

Original Graph

  • Avg. node degree: 2.4

Power-law degree distribution

  • Avg. node degree: 2.4

Power-law degree distribution

Random Walk (RW) Forest Fire (FF) Distinct Visual Result!

slide-8
SLIDE 8

8

Graph Sampling Evaluation in Visualization

Statistical Features: Hub Inclusion Clustering Coeff. Discovery Quotient … ?

Data Mining Visualization

Similarity Measurements

slide-9
SLIDE 9

9

Graph Sampling Evaluation in Visualization

Statistical Features: Hub Inclusion Clustering Coeff. Discovery Quotient … Visual Factors: ?

Data Mining Visualization

Similarity Measurements

G1: Identify the key visual factors that makes the sampled graphs representative G2: Evaluate the performance of different sampling algorithms on these visual factors

Goals Procedure

Pilot Study Formal Studies

slide-10
SLIDE 10

10

  • Selected Sampling Methods
  • Pilot Study
  • Formal Studies
  • Perception of High Degree Nodes
  • Perception of Cluster Quality
  • Perception of Coverage Area

Outline

slide-11
SLIDE 11

11

Node-Based Sampling

Original Graph Random Node Sampling

slide-12
SLIDE 12

12

Node-Based Sampling

Original Graph Random Node Sampling

slide-13
SLIDE 13

13

Node-Based Sampling

Original Graph Random Node Sampling

slide-14
SLIDE 14

14

Node-Based Sampling

Original Graph Random Node Sampling

slide-15
SLIDE 15

15

Edge-Based Sampling

Original Graph Random Edge Sampling

slide-16
SLIDE 16

16

Edge-Based Sampling

Original Graph Random Edge Sampling

slide-17
SLIDE 17

17

Edge-Based Sampling

Original Graph Random Edge Sampling

slide-18
SLIDE 18

18

Traversal-Based Sampling: Random Walk

Original Graph Random Walk

slide-19
SLIDE 19

19

Traversal-Based Sampling: Random Walk

Original Graph Random Walk

slide-20
SLIDE 20

20

Traversal-Based Sampling: Random Jump

Original Graph Random Jump

slide-21
SLIDE 21

21

Traversal-Based Sampling: Random Jump

Original Graph Random Jump

slide-22
SLIDE 22

22

Traversal-Based Sampling: Forest Fire

Original Graph Forest Fire

slide-23
SLIDE 23

23

Traversal-Based Sampling: Forest Fire

Original Graph Forest Fire

slide-24
SLIDE 24

24

  • Selected Sampling Methods
  • Pilot Study
  • Formal Studies
  • Perception of High Degree Nodes
  • Perception of Cluster Quality
  • Perception of Coverage Area

Outline

slide-25
SLIDE 25

25

  • Task:
  • Identify the visual factors that strongly influence the representativeness of

sampled graphs

  • We also determine the sampling rate used in the formal studies.

Pilot Study

Dataset: 5 Real-World Graphs Visual Factor Candidates

slide-26
SLIDE 26

26

High Degree Nodes Cluster Quality Coverage Area

Pilot Study

Results (key visual factors)

  • Task:
  • Identify the visual factors that strongly influence the representativeness of

sampled graphs

  • We also determine the sampling rate used in the formal studies.

Visual Factor Candidates

slide-27
SLIDE 27

27

  • Selected Sampling Methods
  • Pilot Study
  • Formal Studies
  • Perception of High Degree Nodes
  • Perception of Cluster Quality
  • Perception of Coverage Area

Outline

slide-28
SLIDE 28

28

Formal Study I: High Degree Nodes

Original Graph

20 high degree nodes

Sampled Graph

8 high degree nodes?

A B A B

slide-29
SLIDE 29

29

Formal Study I: High Degree Nodes

slide-30
SLIDE 30

30

Formal Study I: High Degree Nodes

20 high degree nodes

N: 1024, D: S N: 2048, D: S N: 1024, D: L N: 2048, D: L

Experiment Setting Data Generation

slide-31
SLIDE 31

31

  • Discussions:
  • It is easier to perceive high degree nodes in the RW Samples
  • It is more difficult to perceive high degree nodes in RN Samples
  • Above results hold across datasets

Formal Study I: High Degree Nodes Results

slide-32
SLIDE 32

32

Number of high degree nodes perceived (Visualization): +

  • Discussions:
  • It will be easier to perceive high degree nodes in the RW Samples
  • It will be more difficult to perceive high degree nodes in RN Samples.
  • Above results hold across datasets

Formal Study I: High Degree Nodes Results

Number of high degree nodes remained (Data Mining): *

Contradiction with metric-based results!

RW FF

slide-33
SLIDE 33

33

Formal Study I: High Degree Nodes Results

Random Walk (RW) Forest Fire (FF)

16 high degree nodes remained 7 high degree nodes remained

slide-34
SLIDE 34

34

Formal Study I: High Degree Nodes Results

Random Walk (RW) Forest Fire (FF)

16 high degree nodes remained 7 high degree nodes remained 6 high degree nodes perceived 3 high degree nodes perceived

slide-35
SLIDE 35

35

  • Selected Sampling Methods
  • Pilot Study
  • Formal Studies
  • Perception of High Degree Nodes (more high degree nodes are perceived in RW)
  • Perception of Cluster Quality
  • Perception of Coverage Area

Outline

slide-36
SLIDE 36

36

Formal Study II: Cluster Quality

slide-37
SLIDE 37

37

Formal Study II: Cluster Quality

Experiment Setting Data Generation

slide-38
SLIDE 38

38

Formal Study II: Cluster Quality Results

  • Discussions:
  • RE and RJ best preserve the perceived cluster quality in samples
  • RN and FF struggles in preserving the perceived cluster quality
  • The performance of RW and FF depends on graph modularity
slide-39
SLIDE 39

39

Formal Study II: Cluster Quality Results

The number of clusters remained is important for perceiving the cluster quality in visualization!

slide-40
SLIDE 40

40

  • Selected Sampling Methods
  • Pilot Study
  • Formal Studies
  • Perception of High Degree Nodes (more high degree nodes are perceived in RW)
  • Perception of Cluster Quality (cluster number is important)
  • Perception of Coverage Area

Outline

slide-41
SLIDE 41

41

Formal Study III: Coverage Area

slide-42
SLIDE 42

42

Formal Study III: Coverage Area

Experiment Setting Data Generation

N: 1024, D: S N: 2048, D: S N: 1024, D: L N: 2048, D: L
slide-43
SLIDE 43

43

Formal Study III: Coverage Area Results

  • Discussions:
  • RE and RJ have the largest perceived coverage area
  • RW has a smallest perceived coverage area in most cases
  • RW and FF ’s performance vary depending on graph properties
1.44 FF RJ RW REN Data RN 23% 34% 30% 34% 33% G7 27% 45% 46% 46% 47% G6 25% 36% 34% 36% 33% G5 24% 39% 41% 41% 40% G4 22% 31% 24% 28% 28% G3 21% 29% 23% 28% 28% G2 23% 31% 24% 29% 29% G1 22% 29% 22% 28% 27% G4: (N:2048, D: L) G3: (N:2048, D: S) G1: (N:1024, D: S) G2: (N:1024, D: L) G5: (N:1024, M: L) G6: (N:1024, M: H) G7: (N:2048, M: L) G8: (N:2048, M: H) Overall 2.54 3.35 3.13 3.29 2.87 2.78 3.78 2.15 3.88 1.44 3.03 3.92 2.01 3.66 2.49 3.88 2.94 3.5 2.69 2(4)=67.99,p 0.006 2(4)=605.8,p 0.006 2(4)=234.7,p 0.006 2(4)=581.9,p 0.006 All All All All All REN RW All All REN RW REN RW REN RW FF RN,RW,RJ All All All Sah BA REN RW FF RN RW RJ REN RW All REN RW REN RW REN RW RJ All RN REN RW REN RW REN RW RJ RN FF RN RN FF REN RJ All RN,RW,FF All RN,RW,FF All REN RJ All All REN RJ REN RJ RN,RW,FF RN,RW,FF All All All 2.87 3.71 1.30 3.19 2.88 2.79 3.56 1.26 3.03 3.46 2.85 3.99 1.29 3.19 3.32 2.81 3.79 1.32 3.37 3.27 2(4)=2272.8,p 0.05 2(4)=481.4,p 0.006 2(4)=483.9,p 0.006 2(4)=542.5,p 0.006 2(4)=475.2,p 0.006 2.77 3.75 1.92 3.39 2.67 G8 21% 32% 29% 32% 29% All

Contradiction with metric-based results!

slide-44
SLIDE 44

44

Formal Study III: Coverage Area Results

RW RN

slide-45
SLIDE 45

45

  • We provided the first study of how graph sampling strategies can

influence the perception of node-link visualizations

  • Important visual factors: high degree nodes, cluster quality, and coverage area
  • Recommendations for sampling network visualizations:
  • Recommend Random Edge and Random Jump for global structure and

cluster quality

  • Recommend Random Walk for perceived high degree nodes
  • Use Random Node unless for specific requirements
  • Random Walk and Forest Fire are modularity sensitive

Conclusion

Graph sampling performance in visualization may VARY from previous metric-based results!

slide-46
SLIDE 46

Evaluation of Graph Sampling: A Visualization Approach

Yanhong Wu, Nan Cao, Daniel Archambault, Qiaomu Shen, Huamin Qu, and Weiwei Cui

Q&A

yanhong.wu@ust.hk http://yhwu.me