Mining Big Data Streams Better Algorithms or Faster Systems? - - PowerPoint PPT Presentation
Mining Big Data Streams Better Algorithms or Faster Systems? - - PowerPoint PPT Presentation
Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG
Agenda
SAMOA (Scalable Advanced Massive Online Analysis) VHT (Vertical Hoeffding Tree) PKG (Partial Key Grouping)
2
System Algorithm API
Apache SAMOA
Scalable Advanced Massive Online Analysis
- G. De Francisci Morales, A. Bifet
JMLR 2015
3
Taxonomy
4
Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza
SAMOA
Non Distributed Batch R, WEKA, … Stream MOA
Architecture
5
SA
SAMOA%
Status Status
6
https://samoa.incubator.apache.org
Status Status
Parallel algorithms
6
https://samoa.incubator.apache.org
Status Status
Parallel algorithms Classification (Vertical Hoeffding Tree)
6
https://samoa.incubator.apache.org
Status Status
Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream)
6
https://samoa.incubator.apache.org
Status Status
Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules)
6
https://samoa.incubator.apache.org
Status Status
Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines
6
https://samoa.incubator.apache.org
VHT
Vertical Hoeffding Tree
- A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis
BigData 2016
7
Hoeffding Tree
Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute, x2 the second most informative one Hoeffding bound: split if
8
∆G(x1, x2) > ✏ = r R2 ln(1/) 2n
- P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
Parallel Decision Trees
9
Parallel Decision Trees
Which kind of parallelism?
9
Parallel Decision Trees
Which kind of parallelism? Task
9
Parallel Decision Trees
Which kind of parallelism? Task Data
9
Data Attributes Instances
Parallel Decision Trees
Which kind of parallelism? Task Data Horizontal
9
Data Attributes Instances
Parallel Decision Trees
Which kind of parallelism? Task Data Horizontal Vertical
9
Data Attributes Instances
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Stats Stats Stats Stream Model Attributes Splits
Vertical Parallelism
10
Single attribute tracked in single node
Stats Stats Stats Stream Model Attributes Splits
Advantages of Vertical
High number of attributes => high level of parallelism (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation
11
PKG
Partial Key Grouping
- M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini
ICDE 2015, ICDE 2016
12
10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 100 101 102 103 104 105 106 107 108 CCDF key frequency words in tweets wikipedia links
Systems Challenges
Skewed key distribution
13
Key Grouping and Skew
14
Source Source Worker Worker Worker Stream
Problem Statement
Input stream of messages Load of worker Imbalance of the system Goal: partitioning function that minimizes imbalance
15
m = ht, k, vi Li(t) = |{hτ, k, vi : Pτ(k) = i ^ τ t}| Pt : K → N i ∈ W I(t) = max
i (Li(t)) − avg i
(Li(t)), for i ∈ W
Shuffle Grouping
16
Source Source Worker Worker Stream Aggr. Worker
Existing Stream Partitioning
Key Grouping Memory and communication efficient :) Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :(
17
Solution: PKG
Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting: assign each key independently with PoTC Load information in distributed system Local load estimation: estimate worker load locally at each source
18
Power of Both Choices
19
Source Source Worker Worker Stream Aggr. Worker
Comparison
20
Stream Grouping Pros Cons
Key Grouping Memory efficient Load imbalance Shuffle Grouping Load balance Memory overhead Aggregation O(W) Partial Key Grouping Memory efficient Load balance Aggregation O(1)
Graph Streams
Betweenness centrality in evolving graphs (TKDE '15) Dynamic graph summarization (BigData '16) Top-k densest subgraph in evolving graphs (CIKM '17) Mining frequent patterns in evolving graphs (w.i.p.)
21