Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student - - PowerPoint PPT Presentation

mining large dynamic graphs and tensors
SMART_READER_LITE
LIVE PREVIEW

Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student - - PowerPoint PPT Presentation

Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student (kijungs@cs.cmu.edu) Thesis Committee Prof. Christos Faloutsos (Chair) Prof. Tom M. Mitchell Prof. Leman Akoglu Prof. Philip S. Yu Mining Large Dynamic Graphs


slide-1
SLIDE 1

Mining Large Dynamic Graphs and Tensors

Kijung Shin Ph.D. Student (kijungs@cs.cmu.edu)

slide-2
SLIDE 2

Thesis Committee

  • Prof. Christos Faloutsos (Chair)
  • Prof. Tom M. Mitchell
  • Prof. Leman Akoglu
  • Prof. Philip S. Yu

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

2/106

slide-3
SLIDE 3

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

3/106

Mining Large Dynamic Graphs and Tensors

slide-4
SLIDE 4

Graphs: Social Networks

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

4/106

slide-5
SLIDE 5

Graphs: Purchase History

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

5/106

slide-6
SLIDE 6

Graphs: Many More

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

6/106

slide-7
SLIDE 7

Properties of Real-world Graphs

  • Large: many nodes, more edges

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

7/106

2B+ active users 500M+ products

  • Dynamic: additions/deletions of nodes and edges

40B+ web pages 5M+ articles

slide-8
SLIDE 8

Properties of Real-world Graphs

  • Rich with Attributes: timestamps, scores, text, etc.

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

8/106 … …

slide-9
SLIDE 9

Matrices for Graphs

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

9/106

1 1 1 1 1

Graph Adjacency Matrix

slide-10
SLIDE 10

Tensors for Rich Graphs

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

10/106

1

  • Tensors: multi-dimensional array

3-order tensor (3-dimensional array) + Stars (4-order tensor) + Text (5-order tensor)

slide-11
SLIDE 11

Research Goal and Tasks

  • Goal:
  • Tasks
  • T1. Structure Analysis
  • T2. Anomaly Detection
  • T3. Behavior Modeling

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

11/106

To Understand Large Dynamic Graphs and Tensors

  • n User Behavior
slide-12
SLIDE 12

Tasks

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

12/106

Structure Anomaly & Fraud Behavior Model Contrast

slide-13
SLIDE 13

Completed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

13/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs Triangle Count

[ICDM17][PAKDD18] [submitted to KDD]

Anomalous Subgraph

[ICDM16]* [KAIS18]*

Purchase Behavior

[IJCAI17]

Degeneracy

[ICDM16]* [KAIS18]*

Tensors Summarization

[WSDM17]

Dense Subtensors

[PKDD16][WSDM17] [KDD17][TKDD18]

Progressive Behavior

[WWW18]

* Duplicated

slide-14
SLIDE 14

Approaches (Tools)

  • A1. Distributed or external-memory algorithms
  • A2. Streaming algorithms based on sampling
  • A3. Approximation algorithms
  • and their combinations

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

14/106

slide-15
SLIDE 15

Roadmap

  • Overview
  • Completed Work <<
  • T1. Structure Analysis
  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

15/106

slide-16
SLIDE 16

Completed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

16/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs Triangle Count

[ICDM17][PAKDD18] [submitted to KDD]

Anomalous Subgraph

[ICDM16]* [KAIS18]*

Purchase Behavior

[IJCAI17]

Degeneracy

[ICDM16]* [KAIS18]*

Tensors Summarization

[WSDM17]

Dense Subtensors

[PKDD16][WSDM17] [KDD17][TKDD18]

Progressive Behavior

[WWW18]

* Duplicated skip

slide-17
SLIDE 17

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis

▪T1.1 Waiting-Room Sampling << ▪T1.2-T1.3 Related Completed Work

  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

17/106 Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017

slide-18
SLIDE 18

Graph Stream Model

  • Widely-used data model for graphs
  • Sequence of edges
  • graph is given over time as a sequence of edges
  • appropriate for dynamic graphs
  • Limited memory
  • cannot store all edges in the stream
  • only samples or summaries
  • appropriate for large graphs

18/106

Sources Destination

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-19
SLIDE 19

Relaxed Graph Stream Model

  • Chronological order
  • edges are streamed in the order that they are created
  • natural for dynamic graphs
  • temporal patterns can exist
  • algorithms can exploit the patterns

19/106

Sources Destination

Created at 9:21 AM Created at 9:08 AM Created at 9:02 AM

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-20
SLIDE 20

Triangles in a Graph

  • A triangle is 3 nodes connected to each other
  • The count of triangles has many applications
  • Community detection, spam detection, query optimization

20/106

  • Global triangle count: count of

all triangles in the graph

  • Local triangle count: count of the

triangles incident to each node

3 2 1 2 3 4 1 3 2 1 T1.1 / T1.2 / T1.3 Completed / Proposed

slide-21
SLIDE 21

Problem Definition

  • Given:
  • a sequence of edges in the chronological order
  • memory budget 𝑙 (i.e., up to 𝑙 edges can be stored)
  • Estimate: count of global triangles
  • To Minimize: estimation error

21/106 T1.1 / T1.2 / T1.3 Completed / Proposed

“What are temporal patterns in real graph streams?” “How can we exploit the patterns for accurate triangle counting?”

slide-22
SLIDE 22

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis

▪T1.1 Waiting-Room Sampling

  • Temporal Pattern <<
  • Algorithm
  • Experiments

▪T1.2-T1.3 Related Completed Work

  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

22/106 T1.1 / T1.2 / T1.3 Completed / Proposed

slide-23
SLIDE 23

Time Interval of a Triangle

  • Time interval of a triangle:

23/106

arrival order

  • f its last edge

arrival order

  • f its first edge

arrival order

1 2 3 4 5 6 7 8

Time interval

Time interval: 7 − 2 = 5

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-24
SLIDE 24

Time Interval Distribution

  • Temporal Locality:
  • average time interval is
  • 2X shorter in the chronological order
  • than in a random order

24/106

random arrival order chronological arrival order

random order chronological

  • rder

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-25
SLIDE 25

Temporal Locality

  • One interpretation:
  • edges are more likely to form
  • triangles with edges close in time
  • than with edges far in time
  • Another interpretation:
  • new edges are more likely to form
  • triangles with recent edges
  • than with old edges

25/106

“How can we exploit temporal locality for accurate triangle counting?”

chronological

  • rder

random

  • rder

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-26
SLIDE 26

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis

▪T1.1 Waiting-Room Sampling

  • Temporal Pattern
  • Algorithm <<
  • Experiments

▪T1.2-T1.3 Related Completed Work

  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

26/106 T1.1 / T1.2 / T1.3 Completed / Proposed

slide-27
SLIDE 27

Algorithm Overview

  • ∆: estimate of triangle count
  • 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered

27/106 T1.1 / T1.2 / T1.3 Completed / Proposed

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step (3) Sampling Step 𝑣 − 𝑤 new edge

slide-28
SLIDE 28

Algorithm Overview (cont.)

  • ∆: estimate of triangle count
  • 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered

28/106

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 memory new edge (1) Arrival Step

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-29
SLIDE 29

Algorithm Overview (cont.)

  • ∆: estimate of triangle count
  • 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered

29/106

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑦 ∆← ∆ + 1/𝑞𝑣𝑤𝑦

discover!

memory (1) Arrival Step (2) Counting Step 𝑣 − 𝑤 new edge

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-30
SLIDE 30

Algorithm Overview (cont.)

  • ∆: estimate of triangle count
  • 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered

30/106

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧

discover!

(2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step 𝑣 − 𝑤 new edge

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-31
SLIDE 31

Algorithm Overview (cont.)

  • ∆: estimate of triangle count
  • 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered

31/106

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step (3) Sampling Step 𝑣 − 𝑤 new edge

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-32
SLIDE 32

Goal of Sampling Step

  • to maximize discovering probability 𝑞𝑣𝑤𝑥
  • Theorem. Variance of our estimate:
  • Theorem. Unbiasedness of our estimate:

𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑗𝑝𝑜 𝐹𝑠𝑠𝑝𝑠 = 𝐶𝑗𝑏𝑡 + 𝑊𝑏𝑠𝑗𝑏𝑜𝑑𝑓

32/106

Var ∆ ≈ σ(𝑣,𝑤,𝑥) (1/𝑞𝑣𝑤𝑥 − 1) Bias[∆] = Exp ∆ −True count = 0 True Count

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-33
SLIDE 33

Increasing Discovering Prob.

  • Recall Temporal Locality:
  • new edges are more likely to form
  • triangles with recent edges
  • than with old edges
  • Waiting-Room Sampling (WRS)
  • treats recent edges better than old edges
  • to exploit temporal locality

33/106

“How can we increase discovering probabilities of triangles?”

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-34
SLIDE 34

Waiting-Room Sampling (WRS)

  • Divides memory space into two parts
  • Waiting Room: latest edges are always stored
  • Reservoir: the remaining edges are sampled

34/106

𝑓79 𝑓78 𝑓77 𝑓76 Waiting Room (FIFO) Reservoir (Random Replace) 𝛽% of budget 100 − 𝛽 % of budget 𝑓80 New edge 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-35
SLIDE 35

WRS: Sampling Steps (Step 1)

35/106

𝒇𝟖𝟕 Popped edge 𝑓79 𝑓78 𝑓77 𝒇𝟖𝟕 Waiting Room (FIFO) Reservoir (Random Replace) 𝒇𝟗𝟏 New edge 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 𝒇𝟗𝟏 𝑓79 𝑓78 𝑓77 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 Waiting Room (FIFO) Reservoir (Random Replace)

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-36
SLIDE 36

WRS: Sampling Steps (Step 2)

36/106

Popped edge 𝒇𝟖𝟕 𝑓80 𝑓79 𝑓78 𝑓77 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 𝑓61 𝑓7 𝑓18 𝑓25 𝒇𝟖𝟕 𝑓1 𝑓28 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 Waiting Room (FIFO) replace! store discard

  • r
  • r

Reservoir (Random Replace)

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-37
SLIDE 37

Summary of Algorithm

37/106

𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 memory new edge (1) Arrival Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑦 ∆← ∆ + 1/𝑞𝑣𝑤𝑦

discover!

(2) Discovery Step 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (3) Sampling Step Waiting-Room Sampling!

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-38
SLIDE 38

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis

▪T1.1 Waiting-Room Sampling

  • Temporal Pattern
  • Algorithm
  • Experiments <<

▪T1.2-T1.3 Related Completed Work

  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

38/106 T1.1 / T1.2 / T1.3 Completed / Proposed

slide-39
SLIDE 39

Experimental Results: Accuracy

39/106

  • Datasets:
  • WRS is most accurate (reduces error up to 𝟓𝟖%)

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-40
SLIDE 40

Discovering Probability

  • WRS increases discovering probability 𝑞𝑣𝑤𝑥
  • WRS discovers up to 3 × more triangles

40/106

WRS Triest-IMPR MASCOT better

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-41
SLIDE 41

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis

▪T1.1 Waiting-Room Sampling ▪T1.2-T1.3 Related Completed Work <<

  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

41/106

slide-42
SLIDE 42

T1.2 Distributed Counting of Triangles

  • Goal: to utilize multiple machines for triangle

counting in a graph stream?

42/106

Sources Workers Aggregators Broadcast Shuffle Sources Workers Aggregators Multicast Shuffle

Tri-Fly [PAKDD18] DiSLR [submitted to KDD]

T1.1 / T1.2 / T1.3 Completed / Proposed Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos, “Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams”, PAKDD 2018

slide-43
SLIDE 43

T1.2 Performance of Tri-Fly and DiSLR

  • 𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑗𝑝𝑜 𝐹𝑠𝑠𝑝𝑠 = 𝐶𝑗𝑏𝑡 + 𝑊𝑏𝑠𝑗𝑏𝑜𝑑𝑓

43/106

DiSLR Tri-Fly

40X better better 40X 30X

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-44
SLIDE 44

T1.3 Estimation of Degeneracy

  • Goal: to estimate the degeneracy* in a graph stream?
  • Core-Triangle Pattern
  • 3:1 power law between the triangle count and the degeneracy

44/106

*degeneracy: maximum 𝑙 such that a subgraph where every node has degree at least 𝑙 exists.

T1.1 / T1.2 / T1.3 Completed / Proposed Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in kCores

  • f Real-world Graphs with Applications”, KAIS 2018 (previously ICDM 2016)
slide-45
SLIDE 45

T1.3 Core-D Algorithm

  • Core-D: one-pass streaming algorithm for degeneracy

45/106

መ 𝑒 = exp(𝛽 ⋅ log(෡ ∆) + 𝛾)

Estimated Degeneracy Estimated Triangle Count (obtained by WRS, etc.)

Core-D

better

T1.1 / T1.2 / T1.3 Completed / Proposed

slide-46
SLIDE 46

Structure Analysis of Graphs

Models:

  • Relaxed graph stream model
  • Distributed graph stream model

Patterns:

  • Temporal locality
  • Core-Triangle pattern

Algorithms:

  • WRS, Tri-Fly, and DiSLR
  • Core-D

Analyses: bias and variance

46/106 T1.1 / T1.2 / T1.3 Completed / Proposed

slide-47
SLIDE 47

Completed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

47/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs Triangle Count

[ICDM17][PAKDD18] [submitted to KDD]

Anomalous Subgraph

[ICDM16]* [KAIS18]*

Purchase Behavior

[IJCAI17]

Degeneracy

[ICDM16]* [KAIS18]*

Tensors Summarization

[WSDM17]

Dense Subtensors

[PKDD16][WSDM17] [KDD17][TKDD18]

Progressive Behavior

[WWW18]

* Duplicated skip skip

slide-48
SLIDE 48

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection

▪T2.1 M-Zoom << ▪T2.2-T2.3 Related Completed Work

  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

48/106

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for Dense Subtensor Mining”, TKDD 2018 (previously ECML/PKDD 2016)

slide-49
SLIDE 49

Motivation: Review Fraud

49/106

Bob’s Carol’s Alice’s

Alice

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-50
SLIDE 50

Fraud Forms Dense Block

50/106

Restaurants Accounts Restaurants Accounts Adjacency Matrix

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-51
SLIDE 51

Problem: Natural Dense Subgraphs

  • Question. How can we

distinguish them?

51/106

Restaurants Accounts Adjacency Matrix suspicious dense blocks formed by fraudsters natural dense blocks (core, community, etc.)

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-52
SLIDE 52

Solution: Tensor Modeling

  • Along the time axis…
  • Natural dense blocks are

sparse (formed gradually)

  • Suspicious dense blocks are

dense (synchronized behavior)

  • In the tensor model
  • Suspicious dense blocks

become denser than natural dense blocks

52/106

Restaurants Accounts

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-53
SLIDE 53

Solution: Tensor Modeling (cont.)

  • High-order tensor modeling:
  • any side information can be used additionally

53/106

IP Address Keywords Number of stars “Given a large-scale high-order tensor, how can we find dense blocks in it?”

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-54
SLIDE 54

Problem Definition

  • Given: (1) 𝑺: an 𝑂-order tensor,

(2) 𝝇: a density measure, (3) 𝒍: the number of blocks we aim to find

  • Find: 𝒍 distinct dense blocks maximizing 𝝇

54/106

𝑺 = 𝒍 = 𝟒 , ,

} {

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-55
SLIDE 55

Density Measures

  • How should we define “density” (i.e., 𝜍)?
  • no one absolute answer
  • depends on data, types of anomalies, etc.
  • Goal: flexible algorithm working well with various

reasonable measures

  • Arithmetic avg. degree ρ𝐵
  • Geometric avg. degree ρ𝐻
  • Suspiciousness (KL Divergence) ρ𝑇
  • Traditional Density: ρ𝑈 𝐶 = EntrySum 𝐶 /Vol(B)
  • maximized by a single entry with the maximum value

55/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-56
SLIDE 56

Clarification of Blocks (Subtensors)

56/106

Restaurants Accounts Restaurants Accounts

  • The concept of blocks (subtensors) is independent of

the orders of rows and columns

  • Entries in a block do not need to be adjacent

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-57
SLIDE 57

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection

▪T2.1 M-Zoom [PKDD 16]

  • Algorithm <<
  • Experiments

▪T2.2-T2.3 Related Completed Work

  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

57/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-58
SLIDE 58

Single Dense Block Detection

  • Greedy search
  • Starts from the entire tensor

58/106

5 3 0 4 6 1 2 0 0

1 0 1

𝜍 = 2.9

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-59
SLIDE 59

Single Dense Block Detection (cont.)

  • Remove a slice to maximize density 𝜍

59/106

5 3 0 4 6 1 2 0 0

𝜍 = 3

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-60
SLIDE 60

60/106

5 3 4 6 2 0

𝜍 =3.3

  • Remove a slice to maximize density 𝜍

Single Dense Block Detection (cont.)

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-61
SLIDE 61

61/106

5 3 4 6 2 0

𝜍 = 3.6

  • Remove a slice to maximize density 𝜍

Single Dense Block Detection (cont.)

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-62
SLIDE 62

1 2 3 4 2 4 6 8 Density Iteration

  • Until all slices are removed

62/106

𝜍 = 0

Single Dense Block Detection (cont.)

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-63
SLIDE 63
  • Output: return the densest block so far

63/106

5 3 4 6 2 0

𝜍 = 3.6

Single Dense Block Detection (cont.)

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-64
SLIDE 64

Speeding Up Process

  • Lemma 1 [Remove Minimum Sum First]

Among slices in the same dimension, removing the slice with smallest sum of entries increases 𝜍 most

64/106

12 > 9 > 2

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-65
SLIDE 65

Accuracy Guarantee

  • Theorem 1 [Approximation Guarantee]

65/106

M-Zoom Result Order Densest Block

  • Theorem 2 [Near-linear Time Complexity]

# Entries in each mode

𝑷(𝑶𝑵 log 𝑴) 𝝇𝑩 𝑪 ≥ 𝟐 𝑶 𝝇𝑩 𝑪∗

Order # Non-zeros

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-66
SLIDE 66

Optional Post Process

  • Local search
  • grow or shrink until a local maximum is reached

66/106

grow shrink

𝝇 = 𝟑 𝝇 = 𝟐. 𝟗 𝝇 = 𝟒. 𝟑𝟘

T2.1 / T2.2 / T2.3 Completed / Proposed

result of our previous greedy search

slide-67
SLIDE 67

Optional Post Process (cont.)

  • Local search
  • grow or shrink until a local maximum is reached

67/106

grow shrink

𝝇 = 𝟒. 𝟑𝟔 𝝇 = 𝟒. 𝟑𝟘 𝝇 = 𝟒. 𝟒𝟒

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-68
SLIDE 68

Optional Post Process (cont.)

  • Local search
  • grow or shrink until a local maximum is reached

68/106

grow

𝝇 = 𝟒. 𝟑𝟘 𝝇 = 𝟒. 𝟒𝟒

shrink

𝝇 = 𝟒. 𝟗

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-69
SLIDE 69

Optional Post Process (cont.)

  • Local search
  • grow or shrink until a local maximum is reached
  • Return the local maximum

69/106

𝝇 = 𝟒. 𝟒𝟒

grow

𝝇 = 𝟒. 𝟗

shrink

𝝇 = 𝟒 Local maximum

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-70
SLIDE 70

Multiple Block Detection

  • Deflation: Remove found blocks before finding others

70/106

Find Find Find Restore Remove Remove

T2.1 / T2.2 / T2.3 Completed / Proposed

slide-71
SLIDE 71

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection

▪T2.1 M-Zoom [PKDD 16]

  • Algorithm
  • Experiments <<

▪T2.2-T2.3 Related Completed Work

  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

71/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-72
SLIDE 72

Speed & Accuracy

72/106

  • Datasets: ….

2X Density metric: 𝜍𝐻 3X 2X

T2.1 / T2.2 / T2.3 Completed / Proposed

Density metric: 𝜍𝑇 Density metric: 𝜍𝐵

slide-73
SLIDE 73

Discoveries in Practice

11 accounts revised 10 pages 2,305 times within 16 hours Accounts Korean Wikipedia Pages Accounts English Wikipedia Pages 8 accounts revised 12 pages 2.5 million times

100%

73/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-74
SLIDE 74

Discoveries in Practice (cont.)

9 accounts gives 1 product 369 reviews with the same rating within 22 hours Accounts App Market (4-order) a block whose volume = 2 and mass = 2 millions TCP Dump (7-order) Protocols

100% 100%

74/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-75
SLIDE 75

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection

▪M-Zoom ▪T2.2-T2.3 Related Completed Work <<

  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

75/106

slide-76
SLIDE 76

T2.2 Extension to Web-scale Tensors

  • Goal: to find dense blocks in a disk-resident or

distributed tensor

  • D-Cube: gives the same accuracy guarantee of M-Zoom

with much less iterations

76/106

Entry sum in slices Average 100 B nonzeros in 5 hours

T2.1 / T2.2 / T2.3 Completed / Proposed 76/106

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “D-Cube: Dense-Block Detection in Terabyte-Scale Tensors”, WSDM 2017

slide-77
SLIDE 77

T2.3 Extension to Dynamic Tensors

  • Goal: to maintain a dense block in a dynamic tensor that

changes over time

  • DenseStream: incrementally computes a dense block

with the same accuracy guarantee of M-Zoom

77/106 T2.1 / T2.2 / T2.3 Completed / Proposed 77/106 T2.1 / T2.2 / T2.3 Completed / Proposed 77/106

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017

slide-78
SLIDE 78

Anomaly Detection in Tensors

  • Algorithms:
  • M-Zoom, D-Cube, and DenseStream
  • Analyses: approximation guarantees
  • Discoveries:
  • Edit war, vandalism, and bot activities
  • Network intrusion
  • Spam reviews

78/106 T2.1 / T2.2 / T2.3 Completed / Proposed

slide-79
SLIDE 79

Completed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

79/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs Triangle Count

[ICDM17][PAKDD18] [submitted to KDD]

Anomalous Subgraph

[ICDM16]* [KAIS18]*

Purchase Behavior

[IJCAI17]

Degeneracy

[ICDM16]* [KAIS18]*

Tensors Summarization

[WSDM17]

Dense Subtensors

[PKDD16][WSDM17] [KDD17][TKDD18]

Progressive Behavior

[WWW18]

* Duplicated skip skip skip

slide-80
SLIDE 80

Motivation

80/106 …

? ? ?

Welcome to

profile

profile profile

Start Goal

T3.1 Completed / Proposed 80/106 T2.1 / T2.2 / T2.3 Completed / Proposed 80/106 T2.1 / T2.2 / T2.3 Completed / Proposed 80/106

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan, “Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018

slide-81
SLIDE 81

Problem Definition

  • Given:
  • behavior log
  • number of desired latent stages: 𝑙
  • Find: 𝑙 progression stages
  • types of actions
  • frequency of actions
  • transitions to other stages
  • To best describe the given behavior log

81/106

Users Action types

T3.1 Completed / Proposed

slide-82
SLIDE 82

Behavior Model

  • Generative process:
  • Θ𝑡: action-type distribution in stage 𝑡
  • 𝜚𝑡: time-gap distribution in stage 𝑡
  • 𝜔𝑡: next-stage distribution in stage 𝑡
  • Constraint: “no decline” (progression but no cyclic patterns)

82/106

𝜔0 Θ1 𝜔1 𝜚1 Θ2 𝜚2 𝜔2 Θ2 𝜚2 𝜔2 𝜔3 Θ3 𝜚3

1 2 3 1 2 3 1 2 3 2

Welcome to connect message connect jobs T3.1 Completed / Proposed

slide-83
SLIDE 83

Optimization Algorithm

  • Goal: to fit our model to given data
  • parameters: distributions (i.e., Θ𝑡, 𝜚𝑡, 𝜔𝑡 𝑡) and latent stages
  • repeat until convergence
  • assignment step: assign latent stages while fixing prob. distributions
  • update step: update prob. distributions while fixing latent stages

▪e.g., Θ𝑡 ← ratio of the types of actions in stage 𝑡

83/106

1 2 3

“no decline” → Dynamic Programming

T3.1 Completed / Proposed

slide-84
SLIDE 84

Scalability & Convergence

  • Three versions of our algorithm
  • In-memory
  • Out-of-core (or external-memory)
  • Distributed

84/106

1 trillion actions in 2 hours 5 latent stages 10 15 20

T3.1 Completed / Proposed

slide-85
SLIDE 85

Progression of Users in LinkedIn

85/106

Build one’s Profile Onboarding Process Poke around the service Grow one’s Social Network Consume Newsfeeds Join Have 30 connections

T3.1 Completed / Proposed

slide-86
SLIDE 86

Completed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

86/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs Triangle Count

[ICDM17][PAKDD18] [submitted to KDD]

Anomalous Subgraph

[ICDM16]* [KAIS18]*

Purchase Behavior

[IJCAI17]

Degeneracy

[ICDM16]* [KAIS18]*

Tensors Summarization

[WSDM17]

Dense Subtensors

[PKDD16][WSDM17] [KDD17][TKDD18]

Progressive Behavior

[WWW18]

* Duplicated skip skip skip

slide-87
SLIDE 87

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work <<
  • Conclusion

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

87/106

slide-88
SLIDE 88

Proposed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

88/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs

  • P1. Triangle

Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors

  • P2. Fast and

Scalable Tucker Decomposition

* Duplicated

slide-89
SLIDE 89

Proposed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

89/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs

  • P1. Triangle

Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors

  • P2. Fast and

Scalable Tucker Decomposition

* Duplicated

slide-90
SLIDE 90

P1: Problem Definition

  • Given:
  • a fully dynamic graph stream,

▪i.e., list of edge insertions and edge deletions

  • Memory budget 𝑙
  • Estimate: the counts of global and local triangles
  • To Minimize: estimation error

90/106

… , , + , , − , , + , , − , …

P1 / P2 / P3 Completed / Proposed

slide-91
SLIDE 91

P1: Goal

91/106

Method Accuracy Handle Deletions? Triest-FD Lowest Yes MASCOT Low No Triest-IMPR High No WRS Highest No

Proposed Highest Yes

P1 / P2 / P3 Completed / Proposed

slide-92
SLIDE 92

Proposed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

92/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs

  • P1. Triangle

Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors

  • P2. Fast and

Scalable Tucker Decomposition

* Duplicated

slide-93
SLIDE 93

P2: Problem Definition

  • Tucker Decomposition (a.k.a High-order PCA)
  • Given: an 𝑂-order input tensor 𝒀
  • Find: 𝑂 factor matrices 𝐵(1)… 𝐵(𝑂) & core-tensor 𝒁
  • To satisfy:

93/106

𝒀 [input] 𝒁 𝐵(3) 𝐵(1) 𝐵(2)

P1 / P2 / P3 Completed / Proposed

slide-94
SLIDE 94

P2: Standard Algorithms

94/106

Materialized

Input Intermediate Data Output

(large & sparse) (small & dense) (large & dense)

Scalability bottleneck

SVD 400GB - 4TB 2GB 2GB

P1 / P2 / P3 Completed / Proposed

slide-95
SLIDE 95

P2: Completed Work

95/106

Input Intermediate Data Output

(large & sparse) (small & dense) (large & dense)

  • Our completed work [WSDM17]

On-the-fly SVD

Incurs repeated computation

P1 / P2 / P3 Completed / Proposed Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu, “S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017.

slide-96
SLIDE 96

P2: Proposed Work

96/106

Input Intermediate Data Output

(large & sparse) (small & dense) (small & dense)

  • Proposed algorithm

Materialized On-the-fly

Partially materialize intermediate data!

P1 / P2 / P3 Completed / Proposed

slide-97
SLIDE 97

P2: Expected Performance Gain

  • Which part of intermediate data should we

materialize?

  • Exploit skewed degree distributions!

97/106

% of Materialized Data % of Saved Computation

P1 / P2 / P3 Completed / Proposed

slide-98
SLIDE 98

Proposed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

98/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs

  • P1. Triangle

Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors

  • P2. Fast and

Scalable Tucker Decomposition

* Duplicated

slide-99
SLIDE 99
  • P3. Polarization Modeling
  • Polarization in social networks: division into

contrasting groups

99/106

Use of marijuana should be: Legal Illegal OR

“How do people choose between two ways of polarization?”

change

  • f beliefs

change

  • f edges

P1 / P2 / P3 Completed / Proposed

slide-100
SLIDE 100
  • P3. Problem Definition
  • Given: time-evolving social network with nodes’

beliefs on controversial issues

  • e.g., legalizing marijuana
  • Find: actor-based model with a utility function
  • depending on network features, beliefs, etc.
  • To best describe: the polarization in data
  • Applications:
  • predict future edges
  • predict the cascades of beliefs

100/106 P1 / P2 / P3 Completed / Proposed

slide-101
SLIDE 101

Proposed Work by Topics

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

101/106

  • T1. Structure

Analysis

  • T2. Anomaly

Detection

  • T3. Behavior

Modeling Graphs

  • P1. Triangle

Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors

  • P2. Fast and

Scalable Tucker Decomposition

* Duplicated

slide-102
SLIDE 102

Timeline

  • Mar-May 2018
  • P1. Triangle counting in fully dynamic graph streams
  • Jun-Aug 2018
  • P3. Polarization modeling
  • Sep-Oct 2018
  • P2. Fast and scalable tucker decomposition
  • Nov 2018 –April 2019
  • Thesis Writing & Job Application
  • May 2019
  • Defense

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

102/106

slide-103
SLIDE 103

Roadmap

  • Overview
  • Completed Work
  • T1. Structure Analysis
  • T2. Anomaly Detection
  • T3. Behavior Modeling
  • Proposed Work
  • Conclusion <<

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

103/106

slide-104
SLIDE 104

Conclusion

  • Goal:

To Understand Large Dynamic Graphs and Tensors

  • Subtasks:
  • structure analysis
  • anomaly detection
  • behavior modeling
  • Approaches:
  • distributed or external-memory algorithms
  • streaming algorithms based on sampling
  • approximation algorithms

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

104/106

slide-105
SLIDE 105

References (Completed work)

[1] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees”, ECML/PKDD 2016 [2] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “CoreScope: Graph Mining Using k-Core Analysis - Patterns, Anomalies and Algorithms”, ICDM 2016 [3] Kijung Shin, “Mining Large Dynamic Graphs and Tensors for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017 [4] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu, “S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017 [5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “D-Cube: Dense-Block Detection in Terabyte- Scale Tensors”, WSDM 2017 [6] Kijung Shin, Euiwoong Lee, Dhivya Eswaran, and Ariel D. Procaccia, “Why You Should Charge Your Friends for Borrowing Your Stuff”, IJCAI 2017 [7] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017 [8] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for Dense Subtensor Mining”, TKDD 2018 [9] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in k-Cores of Real-world Graphs with Applications”, KAIS 2018 [10] Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan, “Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018 [11] Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. “Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. PAKDD 2018.” PAKDD 2018

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

105/106

slide-106
SLIDE 106

Thank You

  • Papers, software, data: http://www.cs.cmu.edu/~kijungs/proposal/
  • Email: kijungs@cs.cmu.edu
  • Thanks to:
  • Sponsors:
  • Admins:
  • Collaborators:

Mining Large Dynamic Graphs and Tensors (by Kijung Shin)

106/106