Data Stream Mining Albert Bifet @abifet Dagstuhl, 31 October 2017 - - PowerPoint PPT Presentation

data stream mining
SMART_READER_LITE
LIVE PREVIEW

Data Stream Mining Albert Bifet @abifet Dagstuhl, 31 October 2017 - - PowerPoint PPT Presentation

Data Stream Mining Albert Bifet @abifet Dagstuhl, 31 October 2017 Projects MOA (University of Waikato) (10.000 downloads/year, 50 contributors) Apache SAMOA (Yahoo Labs) 2 Data Set Classifier Algorithm builds Model Model Analytic


slide-1
SLIDE 1

Data Stream Mining

Albert Bifet @abifet

Dagstuhl, 31 October 2017

slide-2
SLIDE 2

Projects

  • MOA (University of Waikato)

(10.000 downloads/year, 50 contributors)

  • Apache SAMOA (Yahoo Labs)

2

slide-3
SLIDE 3

Analytic Standard Approach

Finite training sets
 Static models

3

Data Set Model

Classifier Algorithm builds Model

slide-4
SLIDE 4

Data Stream Approach

Infinite training sets
 Dynamic models

4

D M

Update Model

D M D M D M D M D M D M D M D M D M D M D M

slide-5
SLIDE 5

Data Stream Mining

  • Maintain models online
  • Incorporate data on the fly
  • Unbounded training sets
  • Resource efficient
  • Detect changes and adapts
  • Dynamic models

5

slide-6
SLIDE 6

MOA

  • {M}assive {O}nline {A}nalysis is a framework for online learning

from data streams.

  • It is closely related to WEKA
  • It includes a collection of offline and online as well as tools for

evaluation:

  • classification, regression
  • clustering, frequent pattern mining
  • Easy to extend, design and run experiments
slide-7
SLIDE 7

7

slide-8
SLIDE 8

HOEFFDING TREE

  • Sample of stream enough for near optimal decision
  • Estimate merit of alternatives from prefix of stream
  • Choose sample size based on statistical principles
  • When to expand a leaf?
  • Let x1 be the most informative attribute,


x2 the second most informative one

  • Hoeffding bound: split if G(x1) - G(x2) > ε
  • P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

= r R2 ln(1/δ) 2n

slide-9
SLIDE 9

Adaptive Random Forest

  • Why Random Forests?
  • Off-the-shelf learner
  • Good learning performance Related publication

Adaptive random forests for evolving data stream classification. Gomes, H M; Bifet, A; Read, J; Barddal, J P; Enembreck, F; Pfharinger, B; Holmes, G; Abdessalem, T. Machine Learning, Springer, 2017.

  • Based on the original Random Forest by Breiman

9

slide-10
SLIDE 10

ADWIN

10

slide-11
SLIDE 11

ADWIN

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

APACHE SAMOA

13

http://samoa-project.net

  • G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
slide-14
SLIDE 14

Summary

  • Data Streaming useful for finding approximate solutions with

reasonable amount of time & limited resources

  • MOA: Massive Online Analytics
  • Available and open-source
  • http://moa.cms.waikato.ac.nz/
  • SAMOA: A Platform for Mining Big Data Streams
  • Available and open-source (incubating @ASF)
  • http://samoa.incubator.apache.org

14