Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student - - PowerPoint PPT Presentation
Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student - - PowerPoint PPT Presentation
Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Student (kijungs@cs.cmu.edu) Thesis Committee Prof. Christos Faloutsos (Chair) Prof. Tom M. Mitchell Prof. Leman Akoglu Prof. Philip S. Yu Mining Large Dynamic Graphs
Thesis Committee
- Prof. Christos Faloutsos (Chair)
- Prof. Tom M. Mitchell
- Prof. Leman Akoglu
- Prof. Philip S. Yu
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
2/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
3/106
Mining Large Dynamic Graphs and Tensors
Graphs: Social Networks
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
4/106
Graphs: Purchase History
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
5/106
Graphs: Many More
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
6/106
Properties of Real-world Graphs
- Large: many nodes, more edges
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
7/106
2B+ active users 500M+ products
- Dynamic: additions/deletions of nodes and edges
40B+ web pages 5M+ articles
Properties of Real-world Graphs
- Rich with Attributes: timestamps, scores, text, etc.
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
8/106 … …
Matrices for Graphs
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
9/106
1 1 1 1 1
Graph Adjacency Matrix
Tensors for Rich Graphs
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
10/106
1
- Tensors: multi-dimensional array
3-order tensor (3-dimensional array) + Stars (4-order tensor) + Text (5-order tensor)
…
Research Goal and Tasks
- Goal:
- Tasks
- T1. Structure Analysis
- T2. Anomaly Detection
- T3. Behavior Modeling
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
11/106
To Understand Large Dynamic Graphs and Tensors
- n User Behavior
Tasks
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
12/106
Structure Anomaly & Fraud Behavior Model Contrast
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
13/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs Triangle Count
[ICDM17][PAKDD18] [submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
Purchase Behavior
[IJCAI17]
Degeneracy
[ICDM16]* [KAIS18]*
Tensors Summarization
[WSDM17]
Dense Subtensors
[PKDD16][WSDM17] [KDD17][TKDD18]
Progressive Behavior
[WWW18]
* Duplicated
Approaches (Tools)
- A1. Distributed or external-memory algorithms
- A2. Streaming algorithms based on sampling
- A3. Approximation algorithms
- and their combinations
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
14/106
Roadmap
- Overview
- Completed Work <<
- T1. Structure Analysis
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
15/106
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
16/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs Triangle Count
[ICDM17][PAKDD18] [submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
Purchase Behavior
[IJCAI17]
Degeneracy
[ICDM16]* [KAIS18]*
Tensors Summarization
[WSDM17]
Dense Subtensors
[PKDD16][WSDM17] [KDD17][TKDD18]
Progressive Behavior
[WWW18]
* Duplicated skip
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
▪T1.1 Waiting-Room Sampling << ▪T1.2-T1.3 Related Completed Work
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
17/106 Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017
Graph Stream Model
- Widely-used data model for graphs
- Sequence of edges
- graph is given over time as a sequence of edges
- appropriate for dynamic graphs
- Limited memory
- cannot store all edges in the stream
- only samples or summaries
- appropriate for large graphs
18/106
Sources Destination
T1.1 / T1.2 / T1.3 Completed / Proposed
Relaxed Graph Stream Model
- Chronological order
- edges are streamed in the order that they are created
- natural for dynamic graphs
- temporal patterns can exist
- algorithms can exploit the patterns
19/106
Sources Destination
Created at 9:21 AM Created at 9:08 AM Created at 9:02 AM
T1.1 / T1.2 / T1.3 Completed / Proposed
Triangles in a Graph
- A triangle is 3 nodes connected to each other
- The count of triangles has many applications
- Community detection, spam detection, query optimization
20/106
- Global triangle count: count of
all triangles in the graph
- Local triangle count: count of the
triangles incident to each node
3 2 1 2 3 4 1 3 2 1 T1.1 / T1.2 / T1.3 Completed / Proposed
Problem Definition
- Given:
- a sequence of edges in the chronological order
- memory budget 𝑙 (i.e., up to 𝑙 edges can be stored)
- Estimate: count of global triangles
- To Minimize: estimation error
21/106 T1.1 / T1.2 / T1.3 Completed / Proposed
“What are temporal patterns in real graph streams?” “How can we exploit the patterns for accurate triangle counting?”
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
▪T1.1 Waiting-Room Sampling
- Temporal Pattern <<
- Algorithm
- Experiments
▪T1.2-T1.3 Related Completed Work
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
22/106 T1.1 / T1.2 / T1.3 Completed / Proposed
Time Interval of a Triangle
- Time interval of a triangle:
23/106
–
arrival order
- f its last edge
arrival order
- f its first edge
arrival order
1 2 3 4 5 6 7 8
Time interval
Time interval: 7 − 2 = 5
T1.1 / T1.2 / T1.3 Completed / Proposed
Time Interval Distribution
- Temporal Locality:
- average time interval is
- 2X shorter in the chronological order
- than in a random order
24/106
random arrival order chronological arrival order
random order chronological
- rder
T1.1 / T1.2 / T1.3 Completed / Proposed
Temporal Locality
- One interpretation:
- edges are more likely to form
- triangles with edges close in time
- than with edges far in time
- Another interpretation:
- new edges are more likely to form
- triangles with recent edges
- than with old edges
25/106
“How can we exploit temporal locality for accurate triangle counting?”
chronological
- rder
random
- rder
T1.1 / T1.2 / T1.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
▪T1.1 Waiting-Room Sampling
- Temporal Pattern
- Algorithm <<
- Experiments
▪T1.2-T1.3 Related Completed Work
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
26/106 T1.1 / T1.2 / T1.3 Completed / Proposed
Algorithm Overview
- ∆: estimate of triangle count
- 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered
27/106 T1.1 / T1.2 / T1.3 Completed / Proposed
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step (3) Sampling Step 𝑣 − 𝑤 new edge
Algorithm Overview (cont.)
- ∆: estimate of triangle count
- 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered
28/106
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 memory new edge (1) Arrival Step
T1.1 / T1.2 / T1.3 Completed / Proposed
Algorithm Overview (cont.)
- ∆: estimate of triangle count
- 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered
29/106
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑦 ∆← ∆ + 1/𝑞𝑣𝑤𝑦
discover!
memory (1) Arrival Step (2) Counting Step 𝑣 − 𝑤 new edge
T1.1 / T1.2 / T1.3 Completed / Proposed
Algorithm Overview (cont.)
- ∆: estimate of triangle count
- 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered
30/106
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧
discover!
(2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step 𝑣 − 𝑤 new edge
T1.1 / T1.2 / T1.3 Completed / Proposed
Algorithm Overview (cont.)
- ∆: estimate of triangle count
- 𝑞𝑣𝑤𝑥: probability that triangle (𝑣, 𝑤, 𝑥) is discovered
31/106
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑧 ∆← ∆ + 1/𝑞𝑣𝑤𝑧 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (2) Counting Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 memory (1) Arrival Step (3) Sampling Step 𝑣 − 𝑤 new edge
T1.1 / T1.2 / T1.3 Completed / Proposed
Goal of Sampling Step
- to maximize discovering probability 𝑞𝑣𝑤𝑥
- Theorem. Variance of our estimate:
- Theorem. Unbiasedness of our estimate:
𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑗𝑝𝑜 𝐹𝑠𝑠𝑝𝑠 = 𝐶𝑗𝑏𝑡 + 𝑊𝑏𝑠𝑗𝑏𝑜𝑑𝑓
32/106
Var ∆ ≈ σ(𝑣,𝑤,𝑥) (1/𝑞𝑣𝑤𝑥 − 1) Bias[∆] = Exp ∆ −True count = 0 True Count
T1.1 / T1.2 / T1.3 Completed / Proposed
Increasing Discovering Prob.
- Recall Temporal Locality:
- new edges are more likely to form
- triangles with recent edges
- than with old edges
- Waiting-Room Sampling (WRS)
- treats recent edges better than old edges
- to exploit temporal locality
33/106
“How can we increase discovering probabilities of triangles?”
T1.1 / T1.2 / T1.3 Completed / Proposed
Waiting-Room Sampling (WRS)
- Divides memory space into two parts
- Waiting Room: latest edges are always stored
- Reservoir: the remaining edges are sampled
34/106
𝑓79 𝑓78 𝑓77 𝑓76 Waiting Room (FIFO) Reservoir (Random Replace) 𝛽% of budget 100 − 𝛽 % of budget 𝑓80 New edge 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28
T1.1 / T1.2 / T1.3 Completed / Proposed
WRS: Sampling Steps (Step 1)
35/106
𝒇𝟖𝟕 Popped edge 𝑓79 𝑓78 𝑓77 𝒇𝟖𝟕 Waiting Room (FIFO) Reservoir (Random Replace) 𝒇𝟗𝟏 New edge 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 𝒇𝟗𝟏 𝑓79 𝑓78 𝑓77 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 Waiting Room (FIFO) Reservoir (Random Replace)
T1.1 / T1.2 / T1.3 Completed / Proposed
WRS: Sampling Steps (Step 2)
36/106
Popped edge 𝒇𝟖𝟕 𝑓80 𝑓79 𝑓78 𝑓77 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 𝑓61 𝑓7 𝑓18 𝑓25 𝒇𝟖𝟕 𝑓1 𝑓28 𝑓61 𝑓7 𝑓18 𝑓25 𝑓40 𝑓1 𝑓28 Waiting Room (FIFO) replace! store discard
- r
- r
Reservoir (Random Replace)
T1.1 / T1.2 / T1.3 Completed / Proposed
Summary of Algorithm
37/106
𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 memory new edge (1) Arrival Step 𝑣 | 𝑦 𝑣 | 𝑧 𝑤 | 𝑦 𝑤 | 𝑧 𝑣 − 𝑤 𝑣 − 𝑤 𝑦 ∆← ∆ + 1/𝑞𝑣𝑤𝑦
discover!
(2) Discovery Step 𝑣 | 𝑦 𝑣 | 𝑤 𝑤 | 𝑦 𝑤 | 𝑧 (3) Sampling Step Waiting-Room Sampling!
T1.1 / T1.2 / T1.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
▪T1.1 Waiting-Room Sampling
- Temporal Pattern
- Algorithm
- Experiments <<
▪T1.2-T1.3 Related Completed Work
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
38/106 T1.1 / T1.2 / T1.3 Completed / Proposed
Experimental Results: Accuracy
39/106
- Datasets:
- WRS is most accurate (reduces error up to 𝟓𝟖%)
T1.1 / T1.2 / T1.3 Completed / Proposed
Discovering Probability
- WRS increases discovering probability 𝑞𝑣𝑤𝑥
- WRS discovers up to 3 × more triangles
40/106
WRS Triest-IMPR MASCOT better
T1.1 / T1.2 / T1.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
▪T1.1 Waiting-Room Sampling ▪T1.2-T1.3 Related Completed Work <<
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
41/106
T1.2 Distributed Counting of Triangles
- Goal: to utilize multiple machines for triangle
counting in a graph stream?
42/106
Sources Workers Aggregators Broadcast Shuffle Sources Workers Aggregators Multicast Shuffle
Tri-Fly [PAKDD18] DiSLR [submitted to KDD]
T1.1 / T1.2 / T1.3 Completed / Proposed Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos, “Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams”, PAKDD 2018
T1.2 Performance of Tri-Fly and DiSLR
- 𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑗𝑝𝑜 𝐹𝑠𝑠𝑝𝑠 = 𝐶𝑗𝑏𝑡 + 𝑊𝑏𝑠𝑗𝑏𝑜𝑑𝑓
43/106
DiSLR Tri-Fly
40X better better 40X 30X
T1.1 / T1.2 / T1.3 Completed / Proposed
T1.3 Estimation of Degeneracy
- Goal: to estimate the degeneracy* in a graph stream?
- Core-Triangle Pattern
- 3:1 power law between the triangle count and the degeneracy
44/106
*degeneracy: maximum 𝑙 such that a subgraph where every node has degree at least 𝑙 exists.
T1.1 / T1.2 / T1.3 Completed / Proposed Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in kCores
- f Real-world Graphs with Applications”, KAIS 2018 (previously ICDM 2016)
T1.3 Core-D Algorithm
- Core-D: one-pass streaming algorithm for degeneracy
45/106
መ 𝑒 = exp(𝛽 ⋅ log( ∆) + 𝛾)
Estimated Degeneracy Estimated Triangle Count (obtained by WRS, etc.)
Core-D
better
T1.1 / T1.2 / T1.3 Completed / Proposed
Structure Analysis of Graphs
Models:
- Relaxed graph stream model
- Distributed graph stream model
Patterns:
- Temporal locality
- Core-Triangle pattern
Algorithms:
- WRS, Tri-Fly, and DiSLR
- Core-D
Analyses: bias and variance
46/106 T1.1 / T1.2 / T1.3 Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
47/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs Triangle Count
[ICDM17][PAKDD18] [submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
Purchase Behavior
[IJCAI17]
Degeneracy
[ICDM16]* [KAIS18]*
Tensors Summarization
[WSDM17]
Dense Subtensors
[PKDD16][WSDM17] [KDD17][TKDD18]
Progressive Behavior
[WWW18]
* Duplicated skip skip
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
▪T2.1 M-Zoom << ▪T2.2-T2.3 Related Completed Work
- T3. Behavior Modeling
- Proposed Work
- Conclusion
48/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for Dense Subtensor Mining”, TKDD 2018 (previously ECML/PKDD 2016)
Motivation: Review Fraud
49/106
Bob’s Carol’s Alice’s
Alice
T2.1 / T2.2 / T2.3 Completed / Proposed
Fraud Forms Dense Block
50/106
Restaurants Accounts Restaurants Accounts Adjacency Matrix
T2.1 / T2.2 / T2.3 Completed / Proposed
Problem: Natural Dense Subgraphs
- Question. How can we
distinguish them?
51/106
Restaurants Accounts Adjacency Matrix suspicious dense blocks formed by fraudsters natural dense blocks (core, community, etc.)
T2.1 / T2.2 / T2.3 Completed / Proposed
Solution: Tensor Modeling
- Along the time axis…
- Natural dense blocks are
sparse (formed gradually)
- Suspicious dense blocks are
dense (synchronized behavior)
- In the tensor model
- Suspicious dense blocks
become denser than natural dense blocks
52/106
Restaurants Accounts
T2.1 / T2.2 / T2.3 Completed / Proposed
Solution: Tensor Modeling (cont.)
- High-order tensor modeling:
- any side information can be used additionally
53/106
IP Address Keywords Number of stars “Given a large-scale high-order tensor, how can we find dense blocks in it?”
T2.1 / T2.2 / T2.3 Completed / Proposed
Problem Definition
- Given: (1) 𝑺: an 𝑂-order tensor,
(2) 𝝇: a density measure, (3) 𝒍: the number of blocks we aim to find
- Find: 𝒍 distinct dense blocks maximizing 𝝇
54/106
𝑺 = 𝒍 = 𝟒 , ,
} {
T2.1 / T2.2 / T2.3 Completed / Proposed
Density Measures
- How should we define “density” (i.e., 𝜍)?
- no one absolute answer
- depends on data, types of anomalies, etc.
- Goal: flexible algorithm working well with various
reasonable measures
- Arithmetic avg. degree ρ𝐵
- Geometric avg. degree ρ𝐻
- Suspiciousness (KL Divergence) ρ𝑇
- Traditional Density: ρ𝑈 𝐶 = EntrySum 𝐶 /Vol(B)
- maximized by a single entry with the maximum value
55/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Clarification of Blocks (Subtensors)
56/106
Restaurants Accounts Restaurants Accounts
- The concept of blocks (subtensors) is independent of
the orders of rows and columns
- Entries in a block do not need to be adjacent
T2.1 / T2.2 / T2.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
▪T2.1 M-Zoom [PKDD 16]
- Algorithm <<
- Experiments
▪T2.2-T2.3 Related Completed Work
- T3. Behavior Modeling
- Proposed Work
- Conclusion
57/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Single Dense Block Detection
- Greedy search
- Starts from the entire tensor
58/106
5 3 0 4 6 1 2 0 0
1 0 1
𝜍 = 2.9
T2.1 / T2.2 / T2.3 Completed / Proposed
Single Dense Block Detection (cont.)
- Remove a slice to maximize density 𝜍
59/106
5 3 0 4 6 1 2 0 0
𝜍 = 3
T2.1 / T2.2 / T2.3 Completed / Proposed
60/106
5 3 4 6 2 0
𝜍 =3.3
- Remove a slice to maximize density 𝜍
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3 Completed / Proposed
61/106
5 3 4 6 2 0
𝜍 = 3.6
- Remove a slice to maximize density 𝜍
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3 Completed / Proposed
1 2 3 4 2 4 6 8 Density Iteration
- Until all slices are removed
62/106
𝜍 = 0
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3 Completed / Proposed
- Output: return the densest block so far
63/106
5 3 4 6 2 0
𝜍 = 3.6
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3 Completed / Proposed
Speeding Up Process
- Lemma 1 [Remove Minimum Sum First]
Among slices in the same dimension, removing the slice with smallest sum of entries increases 𝜍 most
64/106
12 > 9 > 2
T2.1 / T2.2 / T2.3 Completed / Proposed
Accuracy Guarantee
- Theorem 1 [Approximation Guarantee]
65/106
M-Zoom Result Order Densest Block
- Theorem 2 [Near-linear Time Complexity]
# Entries in each mode
𝑷(𝑶𝑵 log 𝑴) 𝝇𝑩 𝑪 ≥ 𝟐 𝑶 𝝇𝑩 𝑪∗
Order # Non-zeros
T2.1 / T2.2 / T2.3 Completed / Proposed
Optional Post Process
- Local search
- grow or shrink until a local maximum is reached
66/106
grow shrink
𝝇 = 𝟑 𝝇 = 𝟐. 𝟗 𝝇 = 𝟒. 𝟑𝟘
T2.1 / T2.2 / T2.3 Completed / Proposed
result of our previous greedy search
Optional Post Process (cont.)
- Local search
- grow or shrink until a local maximum is reached
67/106
grow shrink
𝝇 = 𝟒. 𝟑𝟔 𝝇 = 𝟒. 𝟑𝟘 𝝇 = 𝟒. 𝟒𝟒
T2.1 / T2.2 / T2.3 Completed / Proposed
Optional Post Process (cont.)
- Local search
- grow or shrink until a local maximum is reached
68/106
grow
𝝇 = 𝟒. 𝟑𝟘 𝝇 = 𝟒. 𝟒𝟒
shrink
𝝇 = 𝟒. 𝟗
T2.1 / T2.2 / T2.3 Completed / Proposed
Optional Post Process (cont.)
- Local search
- grow or shrink until a local maximum is reached
- Return the local maximum
69/106
𝝇 = 𝟒. 𝟒𝟒
grow
𝝇 = 𝟒. 𝟗
shrink
𝝇 = 𝟒 Local maximum
T2.1 / T2.2 / T2.3 Completed / Proposed
Multiple Block Detection
- Deflation: Remove found blocks before finding others
70/106
Find Find Find Restore Remove Remove
T2.1 / T2.2 / T2.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
▪T2.1 M-Zoom [PKDD 16]
- Algorithm
- Experiments <<
▪T2.2-T2.3 Related Completed Work
- T3. Behavior Modeling
- Proposed Work
- Conclusion
71/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Speed & Accuracy
72/106
- Datasets: ….
2X Density metric: 𝜍𝐻 3X 2X
T2.1 / T2.2 / T2.3 Completed / Proposed
Density metric: 𝜍𝑇 Density metric: 𝜍𝐵
Discoveries in Practice
11 accounts revised 10 pages 2,305 times within 16 hours Accounts Korean Wikipedia Pages Accounts English Wikipedia Pages 8 accounts revised 12 pages 2.5 million times
100%
73/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Discoveries in Practice (cont.)
9 accounts gives 1 product 369 reviews with the same rating within 22 hours Accounts App Market (4-order) a block whose volume = 2 and mass = 2 millions TCP Dump (7-order) Protocols
100% 100%
74/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
▪M-Zoom ▪T2.2-T2.3 Related Completed Work <<
- T3. Behavior Modeling
- Proposed Work
- Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
75/106
T2.2 Extension to Web-scale Tensors
- Goal: to find dense blocks in a disk-resident or
distributed tensor
- D-Cube: gives the same accuracy guarantee of M-Zoom
with much less iterations
76/106
Entry sum in slices Average 100 B nonzeros in 5 hours
T2.1 / T2.2 / T2.3 Completed / Proposed 76/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “D-Cube: Dense-Block Detection in Terabyte-Scale Tensors”, WSDM 2017
T2.3 Extension to Dynamic Tensors
- Goal: to maintain a dense block in a dynamic tensor that
changes over time
- DenseStream: incrementally computes a dense block
with the same accuracy guarantee of M-Zoom
77/106 T2.1 / T2.2 / T2.3 Completed / Proposed 77/106 T2.1 / T2.2 / T2.3 Completed / Proposed 77/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017
Anomaly Detection in Tensors
- Algorithms:
- M-Zoom, D-Cube, and DenseStream
- Analyses: approximation guarantees
- Discoveries:
- Edit war, vandalism, and bot activities
- Network intrusion
- Spam reviews
78/106 T2.1 / T2.2 / T2.3 Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
79/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs Triangle Count
[ICDM17][PAKDD18] [submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
Purchase Behavior
[IJCAI17]
Degeneracy
[ICDM16]* [KAIS18]*
Tensors Summarization
[WSDM17]
Dense Subtensors
[PKDD16][WSDM17] [KDD17][TKDD18]
Progressive Behavior
[WWW18]
* Duplicated skip skip skip
Motivation
80/106 …
? ? ?
Welcome to
profile
profile profile
Start Goal
T3.1 Completed / Proposed 80/106 T2.1 / T2.2 / T2.3 Completed / Proposed 80/106 T2.1 / T2.2 / T2.3 Completed / Proposed 80/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan, “Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018
Problem Definition
- Given:
- behavior log
- number of desired latent stages: 𝑙
- Find: 𝑙 progression stages
- types of actions
- frequency of actions
- transitions to other stages
- To best describe the given behavior log
81/106
Users Action types
T3.1 Completed / Proposed
Behavior Model
- Generative process:
- Θ𝑡: action-type distribution in stage 𝑡
- 𝜚𝑡: time-gap distribution in stage 𝑡
- 𝜔𝑡: next-stage distribution in stage 𝑡
- Constraint: “no decline” (progression but no cyclic patterns)
82/106
𝜔0 Θ1 𝜔1 𝜚1 Θ2 𝜚2 𝜔2 Θ2 𝜚2 𝜔2 𝜔3 Θ3 𝜚3
1 2 3 1 2 3 1 2 3 2
Welcome to connect message connect jobs T3.1 Completed / Proposed
Optimization Algorithm
- Goal: to fit our model to given data
- parameters: distributions (i.e., Θ𝑡, 𝜚𝑡, 𝜔𝑡 𝑡) and latent stages
- repeat until convergence
- assignment step: assign latent stages while fixing prob. distributions
- update step: update prob. distributions while fixing latent stages
▪e.g., Θ𝑡 ← ratio of the types of actions in stage 𝑡
83/106
1 2 3
“no decline” → Dynamic Programming
T3.1 Completed / Proposed
Scalability & Convergence
- Three versions of our algorithm
- In-memory
- Out-of-core (or external-memory)
- Distributed
84/106
1 trillion actions in 2 hours 5 latent stages 10 15 20
T3.1 Completed / Proposed
Progression of Users in LinkedIn
85/106
Build one’s Profile Onboarding Process Poke around the service Grow one’s Social Network Consume Newsfeeds Join Have 30 connections
T3.1 Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
86/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs Triangle Count
[ICDM17][PAKDD18] [submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
Purchase Behavior
[IJCAI17]
Degeneracy
[ICDM16]* [KAIS18]*
Tensors Summarization
[WSDM17]
Dense Subtensors
[PKDD16][WSDM17] [KDD17][TKDD18]
Progressive Behavior
[WWW18]
* Duplicated skip skip skip
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work <<
- Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
87/106
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
88/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs
- P1. Triangle
Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors
- P2. Fast and
Scalable Tucker Decomposition
* Duplicated
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
89/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs
- P1. Triangle
Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors
- P2. Fast and
Scalable Tucker Decomposition
* Duplicated
P1: Problem Definition
- Given:
- a fully dynamic graph stream,
▪i.e., list of edge insertions and edge deletions
- Memory budget 𝑙
- Estimate: the counts of global and local triangles
- To Minimize: estimation error
90/106
… , , + , , − , , + , , − , …
P1 / P2 / P3 Completed / Proposed
P1: Goal
91/106
Method Accuracy Handle Deletions? Triest-FD Lowest Yes MASCOT Low No Triest-IMPR High No WRS Highest No
Proposed Highest Yes
P1 / P2 / P3 Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
92/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs
- P1. Triangle
Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors
- P2. Fast and
Scalable Tucker Decomposition
* Duplicated
P2: Problem Definition
- Tucker Decomposition (a.k.a High-order PCA)
- Given: an 𝑂-order input tensor 𝒀
- Find: 𝑂 factor matrices 𝐵(1)… 𝐵(𝑂) & core-tensor 𝒁
- To satisfy:
93/106
≈
𝒀 [input] 𝒁 𝐵(3) 𝐵(1) 𝐵(2)
P1 / P2 / P3 Completed / Proposed
P2: Standard Algorithms
94/106
Materialized
Input Intermediate Data Output
(large & sparse) (small & dense) (large & dense)
Scalability bottleneck
SVD 400GB - 4TB 2GB 2GB
P1 / P2 / P3 Completed / Proposed
P2: Completed Work
95/106
Input Intermediate Data Output
(large & sparse) (small & dense) (large & dense)
- Our completed work [WSDM17]
On-the-fly SVD
Incurs repeated computation
P1 / P2 / P3 Completed / Proposed Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu, “S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017.
P2: Proposed Work
96/106
Input Intermediate Data Output
(large & sparse) (small & dense) (small & dense)
- Proposed algorithm
Materialized On-the-fly
Partially materialize intermediate data!
P1 / P2 / P3 Completed / Proposed
P2: Expected Performance Gain
- Which part of intermediate data should we
materialize?
- Exploit skewed degree distributions!
97/106
% of Materialized Data % of Saved Computation
P1 / P2 / P3 Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
98/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs
- P1. Triangle
Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors
- P2. Fast and
Scalable Tucker Decomposition
* Duplicated
- P3. Polarization Modeling
- Polarization in social networks: division into
contrasting groups
99/106
Use of marijuana should be: Legal Illegal OR
“How do people choose between two ways of polarization?”
change
- f beliefs
change
- f edges
P1 / P2 / P3 Completed / Proposed
- P3. Problem Definition
- Given: time-evolving social network with nodes’
beliefs on controversial issues
- e.g., legalizing marijuana
- Find: actor-based model with a utility function
- depending on network features, beliefs, etc.
- To best describe: the polarization in data
- Applications:
- predict future edges
- predict the cascades of beliefs
100/106 P1 / P2 / P3 Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
101/106
- T1. Structure
Analysis
- T2. Anomaly
Detection
- T3. Behavior
Modeling Graphs
- P1. Triangle
Counting in Fully Dynamic Stream P3. Polarization Modeling Tensors
- P2. Fast and
Scalable Tucker Decomposition
* Duplicated
Timeline
- Mar-May 2018
- P1. Triangle counting in fully dynamic graph streams
- Jun-Aug 2018
- P3. Polarization modeling
- Sep-Oct 2018
- P2. Fast and scalable tucker decomposition
- Nov 2018 –April 2019
- Thesis Writing & Job Application
- May 2019
- Defense
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
102/106
Roadmap
- Overview
- Completed Work
- T1. Structure Analysis
- T2. Anomaly Detection
- T3. Behavior Modeling
- Proposed Work
- Conclusion <<
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
103/106
Conclusion
- Goal:
To Understand Large Dynamic Graphs and Tensors
- Subtasks:
- structure analysis
- anomaly detection
- behavior modeling
- Approaches:
- distributed or external-memory algorithms
- streaming algorithms based on sampling
- approximation algorithms
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
104/106
References (Completed work)
[1] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees”, ECML/PKDD 2016 [2] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “CoreScope: Graph Mining Using k-Core Analysis - Patterns, Anomalies and Algorithms”, ICDM 2016 [3] Kijung Shin, “Mining Large Dynamic Graphs and Tensors for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017 [4] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu, “S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017 [5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “D-Cube: Dense-Block Detection in Terabyte- Scale Tensors”, WSDM 2017 [6] Kijung Shin, Euiwoong Lee, Dhivya Eswaran, and Ariel D. Procaccia, “Why You Should Charge Your Friends for Borrowing Your Stuff”, IJCAI 2017 [7] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017 [8] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for Dense Subtensor Mining”, TKDD 2018 [9] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in k-Cores of Real-world Graphs with Applications”, KAIS 2018 [10] Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan, “Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018 [11] Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. “Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. PAKDD 2018.” PAKDD 2018
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
105/106
Thank You
- Papers, software, data: http://www.cs.cmu.edu/~kijungs/proposal/
- Email: kijungs@cs.cmu.edu
- Thanks to:
- Sponsors:
- Admins:
- Collaborators:
Mining Large Dynamic Graphs and Tensors (by Kijung Shin)
106/106