Data Stream Classification using Random Feature Functions and Novel - - PowerPoint PPT Presentation

data stream classification using random feature functions
SMART_READER_LITE
LIVE PREVIEW

Data Stream Classification using Random Feature Functions and Novel - - PowerPoint PPT Presentation

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noahs Ark Lab,


slide-1
SLIDE 1

Data Stream Classification using Random Feature Functions and Novel Method Combinations

Jesse Read(1), Albert Bifet(2)

Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong

August 21, 2015

slide-2
SLIDE 2

Classification in Data Streams

Setting: sequence is potentially infinite high speed of arrival stream is one-way, can’t ‘go back’ Implications memory is limited adapt to concept drift

slide-3
SLIDE 3

Classification in Data Streams

slide-4
SLIDE 4

Methods for Data Streams

Naive Bayes Stochastic Gradient Descent Incremental Decision Trees (Hoeffding Tree) Lazy/prototype methods (e.g., kNN) Batch-learning (e.g., SVMs, decision trees)

slide-5
SLIDE 5

Stochastic Gradient Descent (SGD)

update weights incrementally

x1 x2 x3 y

slide-6
SLIDE 6

Stochastic Gradient Descent (SGD)

update weights incrementally

x1 x2 x3 y

relatively poor performance fiddly hyper parameters, particularly with multiple layers

slide-7
SLIDE 7

Hoeffding Trees

Very Fast Decision Trees: obtain, incrementally, a certain level of confidence (Hoeffding bound) on which attribute to split on

x1

>0.3

  • ≤0.3
  • y

x3

>−2.9

  • ≤−2.9
  • x2

=A

  • =B
  • y

y y

slide-8
SLIDE 8

Hoeffding Trees

Very Fast Decision Trees: obtain, incrementally, a certain level of confidence (Hoeffding bound) on which attribute to split on

x1

>0.3

  • ≤0.3
  • y

x3

>−2.9

  • ≤−2.9
  • x2

=A

  • =B
  • y

y y

tree may grow conservatively, although using Naive Bayes at the leaves mitigates this need to start again when concept changes

slide-9
SLIDE 9

k-Nearest Neighbours (kNN)

maintain a dynamic buffer of instances, compare each test instance to these

4 3 2 1 1 2 3 x1 4 3 2 1 1 2 3 x2

c1 c2 c3 c4 c5 c6 ?

slide-10
SLIDE 10

k-Nearest Neighbours (kNN)

maintain a dynamic buffer of instances, compare each test instance to these

4 3 2 1 1 2 3 x1 4 3 2 1 1 2 3 x2

c1 c2 c3 c4 c5 c6 ?

automatically adapts to concept drift limited buffer size, sensitive to number of attributes

slide-11
SLIDE 11

Preliminary Comparison

Table: Results under prequential evaluation

Dataset #Att. #Ins. HT SGD kNN LB-HT Electricity 8 45,312 79.2 57.6 78.4 89.8 CoverType 54 581,012 80.3 60.7 92.2 91.7 SUSY 8 5,000,000 78.2 76.5 67.5 78.7 kNN performs relatively poorly on larger streams Hoeffding Tree works well, particularly in Leveraging Bagging ensemble SGD performs poorly overall, less so on larger streams

slide-12
SLIDE 12

A Modular View of Datastream Classification

y z(1)

3

z(1)

2

z(1)

1

y(3)

T

y(2)

T

y(1)

T

z4 z3 z2 z1 x5 x4 x3 x2 x1

e.g., x → filter → ensemble of HT → filter → SGD → y

slide-13
SLIDE 13

Random Projections as a Filter

Back-propagation often too fiddly for data streams Can do random projection instead

2 4 6 8 1 2 3
  • 1
  • 2
  • 2
  • 1
1 2

Not a new idea, but fits nicely into MOA as a filter can be used with any classifier

slide-14
SLIDE 14

Results: Random Projections

10 10

1

10

2

10

3

h/d 55 60 65 70 75 80 85 90 Accuracy

ELEC HT HTf kNN kNNf SGD SGDf

10 10

1

10

2

10

3

h/d 45 50 55 60 65 70 75 80 Accuracy

SUSY HT HTf kNN kNNf SGD SGDf

SGD responds best to a randomly projected input space For kNN, it depends on the stream No advantage for HT, however can put, for example, kNN

  • r SGDf (‘filtered’-SGD) in the leaves of HT
slide-15
SLIDE 15

Figure: SUSY, First 50,000 examples, in 100 windows. Accuracy

20 40 60 80 100 60 65 70 75 80

HT kNN kNNf SGD SGDf HT-SGD HT-SGDf HT-kNN LB-HT LB-SGDf

SGDf works well on larger data streams ...also at the leaves of a Hoeffding Tree (HT-SGDf ) filter helps kNN (but only worth it on smaller streams)

slide-16
SLIDE 16

Figure: SUSY, 5,000,000 examples, in 100 windows. Accuracy

20 40 60 80 100 60 65 70 75 80

HT kNN kNNf SGD SGDf HT-SGD HT-SGDf HT-kNN LB-HT LB-SGDf

SGDf works well on larger data streams ...also at the leaves of a Hoeffding Tree (HT-SGDf ) filter helps kNN (but only worth it on smaller streams)

slide-17
SLIDE 17

Other results (see paper for details): HT-kNN (kNN in the leaves) often outperforms benchmark HT(-NB) and competitive with LB-HT (tied best overall) which has an ensemble of 10 HTs

Figure: Running time (s) on SUSY over 5,000,000 examples

Method Time (s) SGD 25 SGDf 118 kNN 1464 kNNf 4714 HT 45 HT-SGDf 159 HT-kNN 1428 LB-HT 530 LB-SGDf 1040

slide-18
SLIDE 18

Summary

A modular view of data stream classification allows a large number of method combinations to be trialled New competitive combinations were found Random feature functions can work well in a number of different contexts No extra parameter calibration is necessary on these methods Future work: prune and replace nodes overtime

slide-19
SLIDE 19

Data Stream Classification using Random Feature Functions and Novel Method Combinations

Jesse Read(1), Albert Bifet(2)

Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong

August 21, 2015