SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis - - PowerPoint PPT Presentation

samoa a platform for mining big data streams
SMART_READER_LITE
LIVE PREVIEW

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis - - PowerPoint PPT Presentation

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1 SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2


slide-1
SLIDE 1

SAMOA: A Platform for Mining Big Data Streams

Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1

slide-2
SLIDE 2

What is Big Data?

§ Search queries § Facebook posts § Emails § Tweets § Photo shares § Clicks on ads § …

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2

slide-3
SLIDE 3

How BIG is your data?

§ Volume (+ Variety)

§ Too large for RAM of single commodity server

§ Velocity

§ Too fast for CPU of single commodity server

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 3

slide-4
SLIDE 4

What is the Streaming Paradigm?

§ High amount of data, high speed of arrival § Updated models at “real” time § Potentially infinite sequence of data § Change over time (concept drift)

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 4

slide-5
SLIDE 5

Mining Big Data Streams

§ Approximation algorithms:

§ Single pass, one data item at a time § Sub-linear space and time per data item § Small error with high probability

§ A platform solution:

§ Support different algorithms & processing engines § Distributed § Scalable

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 5

slide-6
SLIDE 6

What is SAMOA?

§ Scalable Advanced Massive Online Analysis § A platform for mining big data streams

§ Framework for developing new distributed stream mining algorithms § Framework for deploying algorithms on new distributed stream processing engines

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 6

slide-7
SLIDE 7

Taxonomy

Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm

SAMOA

Non Distributed Batch R, WEKA, … Stream MOA

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 7

slide-8
SLIDE 8

SAMOA Architecture

SA

SAMOA%

Machine Learning Algorithms Distributed Stream Processing Engines

Flink

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 8

slide-9
SLIDE 9

Why is SAMOA important?

§ Program once, run everywhere

§ Reuse existing infrastructure

§ Avoid deploy cycles

§ No system downtime § No complex backup/update process § No need to select update frequency

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 9

slide-10
SLIDE 10

ML Developer API

Processing Item Processor Stream

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 10

slide-11
SLIDE 11

ML Developer API

TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne);

!

Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo);

!

Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 11

slide-12
SLIDE 12

Deployment

SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings

  • API. Algorithm developer

depends only on this To S4 cluster To Storm cluster

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 12

slide-13
SLIDE 13

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 13

slide-14
SLIDE 14

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 14

slide-15
SLIDE 15

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 15

slide-16
SLIDE 16

Easy to test!

bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation

  • d /tmp/dump.csv
  • i 1000000 -f 100000
  • l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)
  • s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 16

slide-17
SLIDE 17

Case study: Decision Trees

§ VHT: Vertical Hoeffding Tree

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 17

Task parallelism

slide-18
SLIDE 18

Case study: VHT

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 18

Stats Stats Stats Stream Histograms Model Instances Model Updates Horizontal Parallelism

slide-19
SLIDE 19

Case study: VHT

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 19

Stats Stats Stats Stream Model Attributes Splits Vertical Parallelism

slide-20
SLIDE 20

Benefits of Vertical Parallelism

§ High number of attributes:

§ high level parallelism (e.g., documents)

§ vs. task parallelism:

§ obvious parallelism observed

§ vs. horizontal parallelism:

§ reduced memory usage (no model replication) § parallelized split computation

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 20

slide-21
SLIDE 21

Vertical Hoeffding Tree

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 21

Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping

slide-22
SLIDE 22

Preliminary results: Tweets

§ Zipf skew: 1.5 § Bag of words: 100, 1000, 10000 (attributes) § Size of tweet: ~15 words § Instances: 1,000,000 § Class: positive or negative (Gaussian random variable) § 10 runs § Local vs. Storm virtual cluster

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 22

slide-23
SLIDE 23

Results: Accuracy

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 23

20 40 60 80 100 4 8 16 local Correct Classification % Parallelism Level

Classification Accuracy vs. Parallelism Level vs. Number of Attributes

100 words 1000 words 10000 words

slide-24
SLIDE 24

Results: Speedup

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 24

1 2 3 4 5 4 8 16 Speedup Parallelism Level

Speedup vs. Parallelism Level vs. Number of Attributes

100 words 1000 words 10000 words

slide-25
SLIDE 25

Is SAMOA for you?

§ Are you dealing with:

§ Big fast data? § Possibly endless streams of data? § Evolving data?

§ Do you need updated models at real time? § Do you want to test an algorithm on different DSPEs?

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 25

slide-26
SLIDE 26

SAMOA Team

Albert Bifet Gianmarco De Francici Morales Nicolas Kourtellis Matthieu Morel Arinto Murdopo Olivier Van Laere

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 26

slide-27
SLIDE 27

Status

§ Apache Incubator

§ Released version 0.3.0 in July

§ Execution Engines § Input:

§ Local FS § HDFS § Kafka [pending]

Heron?

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 27

slide-28
SLIDE 28

Algorithms in SAMOA

§ Existing:

§ Vertical Hoeffding Tree (classification) § CluStream (clustering) § Adaptive Model Rules (regression)

§ Pending:

§ Distributed Naïve Bayes § Stochastic Gradient Descent § Adaptive + Boosting VHT § Parallelized Gradient Boosted Decision Tree § PARMA (frequent pattern mining) § …

§ Check Samoa Roadmap for more

Looking for contributors!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 28

slide-29
SLIDE 29

SAMOA: A Platform for Mining Big Data Streams

@ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 29