samoa a platform for mining big data streams
play

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis - PowerPoint PPT Presentation

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1 SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2


  1. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1 SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona

  2. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2 What is Big Data? § Search queries § Facebook posts § Emails § Tweets § Photo shares § Clicks on ads § …

  3. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 3 How BIG is your data? § Volume (+ Variety) § Too large for RAM of single commodity server § Velocity § Too fast for CPU of single commodity server

  4. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 4 What is the Streaming Paradigm? § High amount of data, high speed of arrival § Updated models at “real” time § Potentially infinite sequence of data § Change over time (concept drift)

  5. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 5 Mining Big Data Streams § Approximation algorithms: § Single pass, one data item at a time § Sub-linear space and time per data item § Small error with high probability § A platform solution: § Support different algorithms & processing engines § Distributed § Scalable

  6. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 6 What is SAMOA? § Scalable Advanced Massive Online Analysis § A platform for mining big data streams § Framework for developing new distributed stream mining algorithms § Framework for deploying algorithms on new distributed stream processing engines

  7. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 7 Taxonomy Machine Learning Non Distributed Distributed Batch Stream Batch Stream Hadoop S4, Storm R, SAMOA Mahout WEKA, MOA …

  8. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 8 SAMOA Architecture Machine Learning Algorithms SAMOA% SA Distributed Stream Flink Processing Engines

  9. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 9 Why is SAMOA important? § Program once, run everywhere § Reuse existing infrastructure § Avoid deploy cycles § No system downtime § No complex backup/update process § No need to select update frequency

  10. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 10 ML Developer API Processing Item Processor Stream

  11. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 11 ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);

  12. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 12 Deployment S4 bindings To S4 cluster SAMOA-S4.jar samoa-s4-deployable.s4r API. Algorithm developer SAMOA-API.jar depends only on this samoa-storm-deployable.jar SAMOA-Storm.jar Storm bindings To Storm cluster

  13. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 13 Easy to get!

  14. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 14 Easy to get!

  15. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 15 Easy to get!

  16. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 16 Easy to test! bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4 -k) -s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"

  17. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 17 Case study: Decision Trees § VHT: Vertical Hoeffding Tree Task parallelism

  18. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 18 Case study: VHT Model Stats Instances Histograms Stream Stats Stats Model Updates Horizontal Parallelism

  19. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 19 Case study: VHT Model Stats Attributes Stream Stats Stats Splits Vertical Parallelism

  20. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 20 Benefits of Vertical Parallelism § High number of attributes: § high level parallelism (e.g., documents) § vs. task parallelism: § obvious parallelism observed § vs. horizontal parallelism: § reduced memory usage (no model replication) § parallelized split computation

  21. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 21 Vertical Hoeffding Tree Source (n) Model (n) Stats (n) Evaluator (1) Stream Instance Control Shuffle Grouping Key Grouping Split All Grouping Result

  22. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 22 Preliminary results: Tweets § Zipf skew: 1.5 § Bag of words: 100, 1000, 10000 (attributes) § Size of tweet: ~15 words § Instances: 1,000,000 § Class: positive or negative (Gaussian random variable) § 10 runs § Local vs. Storm virtual cluster

  23. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 23 Results: Accuracy Classification Accuracy vs. 100 words Parallelism Level vs. 1000 words Number of Attributes 10000 words 100 Correct Classification % 80 60 40 20 0 4 8 16 local Parallelism Level

  24. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 24 Results: Speedup Speedup vs. 100 words Parallelism Level vs. 1000 words Number of Attributes 10000 words 5 4 Speedup 3 2 1 0 4 8 16 Parallelism Level

  25. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 25 Is SAMOA for you? § Are you dealing with: § Big fast data? § Possibly endless streams of data? § Evolving data? § Do you need updated models at real time? § Do you want to test an algorithm on different DSPEs?

  26. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 26 SAMOA Team Albert Bifet Matthieu Morel Gianmarco Arinto Murdopo De Francici Morales Olivier Van Laere Nicolas Kourtellis

  27. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 27 Status § Apache Incubator § Released version 0.3.0 in July § Execution Engines Heron? § Input: § Local FS § HDFS § Kafka [pending]

  28. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 28 Algorithms in SAMOA § Existing: § Vertical Hoeffding Tree (classification) § CluStream (clustering) § Adaptive Model Rules (regression) § Pending: § Distributed Naïve Bayes § Stochastic Gradient Descent Looking for § Adaptive + Boosting VHT contributors! § Parallelized Gradient Boosted Decision Tree § PARMA (frequent pattern mining) § … § Check Samoa Roadmap for more

  29. 29 SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 SAMOA: A Platform for Mining Big Data Streams @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend