Mining Big Data Streams Better Algorithms or Faster Systems? - - PowerPoint PPT Presentation

mining big data streams
SMART_READER_LITE
LIVE PREVIEW

Mining Big Data Streams Better Algorithms or Faster Systems? - - PowerPoint PPT Presentation

Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG


slide-1
SLIDE 1

Mining Big Data Streams

Better Algorithms or Faster Systems?


 
 Gianmarco De Francisci Morales
 gdfm@acm.org
 QCRI

slide-2
SLIDE 2

Agenda

SAMOA
 (Scalable Advanced Massive Online Analysis) VHT
 (Vertical Hoeffding Tree) PKG
 (Partial Key Grouping)

2

System Algorithm API

slide-3
SLIDE 3

Apache SAMOA

Scalable Advanced Massive Online Analysis


  • G. De Francisci Morales, A. Bifet


JMLR 2015

3

slide-4
SLIDE 4

Taxonomy

4

Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza

SAMOA

Non Distributed Batch R, WEKA, … Stream MOA

slide-5
SLIDE 5

Architecture

5

SA

SAMOA%

slide-6
SLIDE 6

Status Status

6

https://samoa.incubator.apache.org

slide-7
SLIDE 7

Status Status

Parallel algorithms

6

https://samoa.incubator.apache.org

slide-8
SLIDE 8

Status Status

Parallel algorithms Classification (Vertical Hoeffding Tree)

6

https://samoa.incubator.apache.org

slide-9
SLIDE 9

Status Status

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream)

6

https://samoa.incubator.apache.org

slide-10
SLIDE 10

Status Status

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules)

6

https://samoa.incubator.apache.org

slide-11
SLIDE 11

Status Status

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines


6

https://samoa.incubator.apache.org

slide-12
SLIDE 12

VHT

Vertical Hoeffding Tree


  • A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis


BigData 2016

7

slide-13
SLIDE 13

Hoeffding Tree

Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute,
 x2 the second most informative one Hoeffding bound: split if

8

∆G(x1, x2) > ✏ = r R2 ln(1/) 2n

  • P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
slide-14
SLIDE 14

Parallel Decision Trees

9

slide-15
SLIDE 15

Parallel Decision Trees

Which kind of parallelism?

9

slide-16
SLIDE 16

Parallel Decision Trees

Which kind of parallelism? Task

9

slide-17
SLIDE 17

Parallel Decision Trees

Which kind of parallelism? Task Data

9

Data Attributes Instances

slide-18
SLIDE 18

Parallel Decision Trees

Which kind of parallelism? Task Data Horizontal

9

Data Attributes Instances

slide-19
SLIDE 19

Parallel Decision Trees

Which kind of parallelism? Task Data Horizontal Vertical

9

Data Attributes Instances

slide-20
SLIDE 20

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-21
SLIDE 21

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-22
SLIDE 22

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-23
SLIDE 23

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-24
SLIDE 24

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-25
SLIDE 25

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-26
SLIDE 26

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-27
SLIDE 27

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-28
SLIDE 28

Vertical Parallelism

10

Stats Stats Stats Stream Model Attributes Splits

slide-29
SLIDE 29

Vertical Parallelism

10

Single attribute tracked in single node

Stats Stats Stats Stream Model Attributes Splits

slide-30
SLIDE 30

Advantages of Vertical

High number of attributes => high level of parallelism
 (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation

11

slide-31
SLIDE 31

PKG

Partial Key Grouping


  • M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini


ICDE 2015, ICDE 2016

12

slide-32
SLIDE 32

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 100 101 102 103 104 105 106 107 108 CCDF key frequency words in tweets wikipedia links

Systems Challenges

Skewed key distribution

13

slide-33
SLIDE 33

Key Grouping and Skew

14

Source Source Worker Worker Worker Stream

slide-34
SLIDE 34

Problem Statement

Input stream of messages Load of worker Imbalance of the system Goal: partitioning function that minimizes imbalance

15

m = ht, k, vi Li(t) = |{hτ, k, vi : Pτ(k) = i ^ τ  t}| Pt : K → N i ∈ W I(t) = max

i (Li(t)) − avg i

(Li(t)), for i ∈ W

slide-35
SLIDE 35

Shuffle Grouping

16

Source Source Worker Worker Stream Aggr. Worker

slide-36
SLIDE 36

Existing Stream Partitioning

Key Grouping Memory and communication efficient :) Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :(

17

slide-37
SLIDE 37

Solution: PKG

Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting:
 assign each key independently with PoTC Load information in distributed system Local load estimation:
 estimate worker load locally at each source

18

slide-38
SLIDE 38

Power of Both Choices

19

Source Source Worker Worker Stream Aggr. Worker

slide-39
SLIDE 39

Comparison

20

Stream Grouping Pros Cons

Key Grouping Memory efficient Load imbalance Shuffle Grouping Load balance Memory overhead Aggregation O(W) Partial Key Grouping Memory efficient Load balance Aggregation O(1)

slide-40
SLIDE 40

Graph Streams

Betweenness centrality in evolving graphs (TKDE '15) Dynamic graph summarization (BigData '16) Top-k densest subgraph in evolving graphs (CIKM '17) Mining frequent patterns in evolving graphs (w.i.p.)

21