Data Stream Classification using Random Feature Functions and Novel - PowerPoint PPT Presentation

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015

Classification in Data Streams Setting: sequence is potentially infinite high speed of arrival stream is one-way, can’t ‘go back’ Implications memory is limited adapt to concept drift

Classification in Data Streams

Methods for Data Streams Naive Bayes Stochastic Gradient Descent Incremental Decision Trees (Hoeffding Tree) Lazy/prototype methods (e.g., k NN) Batch-learning (e.g., SVMs, decision trees)

Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3

Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3 relatively poor performance fiddly hyper parameters, particularly with multiple layers

� � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y

� � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y tree may grow conservatively, although using Naive Bayes at the leaves mitigates this need to start again when concept changes

k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1

k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1 automatically adapts to concept drift limited buffer size , sensitive to number of attributes

Preliminary Comparison Table: Results under prequential evaluation Dataset #Att. #Ins. HT SGD kNN LB-HT Electricity 8 45,312 79.2 57.6 78.4 89.8 CoverType 54 581,012 80.3 60.7 92.2 91.7 SUSY 8 5,000,000 78.2 76.5 67.5 78.7 k NN performs relatively poorly on larger streams Hoeffding Tree works well, particularly in Leveraging Bagging ensemble SGD performs poorly overall, less so on larger streams

A Modular View of Datastream Classification y z (1) z (1) z (1) 1 2 3 y (1) y (2) y (3) T T T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x 5 e.g., x → filter → ensemble of HT → filter → SGD → y

Random Projections as a Filter Back-propagation often too fiddly for data streams Can do random projection instead 8 6 4 2 0 3 2 1 2 0 1 -1 0 -1 -2 -2 Not a new idea, but fits nicely into MOA as a filter can be used with any classifier

Results: Random Projections ELEC SUSY 90 80 HT HT HT f HT f 85 75 kNN kNN kNN f kNN f 80 70 SGD SGD SGD f SGD f 75 65 Accuracy Accuracy 70 60 65 55 60 50 55 45 0 1 2 3 0 1 2 3 10 10 10 10 10 10 10 10 h/d h/d SGD responds best to a randomly projected input space For k NN, it depends on the stream No advantage for HT, however can put, for example, k NN or SGD f (‘filtered’-SGD) in the leaves of HT

Figure: SUSY , First 50,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)

Figure: SUSY , 5,000,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)

Other results (see paper for details): HT-kNN ( k NN in the leaves) often outperforms benchmark HT ( -NB ) and competitive with LB-HT (tied best overall) which has an ensemble of 10 HTs Figure: Running time (s) on SUSY over 5,000,000 examples Method Time (s) 25 SGD 118 SGD f 1464 kNN 4714 kNN f 45 HT 159 HT-SGD f 1428 HT-kNN 530 LB-HT 1040 LB-SGD f

Summary A modular view of data stream classification allows a large number of method combinations to be trialled New competitive combinations were found Random feature functions can work well in a number of different contexts No extra parameter calibration is necessary on these methods Future work: prune and replace nodes overtime

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015

Data Stream Classification using Random Feature Functions and Novel - PowerPoint PPT Presentation

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noahs Ark Lab,

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Image Analysis System Example: Image Classification System pre feature feature segmentation

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU Feature: a

Graph Classification Classification Outline Introduction, Overview Classification using

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II)

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo

Expectations or Guarantees? I Want It All! A Crossroad between Games and MDPs V. Bruy` ere

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher

Last week 1. We introduced the L p spaces: f is A -measurable L p = f :

Last week 1. We talked about some Hilbert space facts you knew from before. 2. We looked at the