Online machine learning with decision trees Max Halford University - PowerPoint PPT Presentation

Online machine learning with decision trees Max Halford University of Toulouse Online machine learning with decision trees Max Halford 1 / 46 Thursday 7 th May, 2020

Decision trees “ Most successful general-purpose algorithm in modern times. ” [HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in difgerent ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG + 11, CG16, KMF + 17, PGV + 18] Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context. Online machine learning with decision trees Max Halford 2 / 46

3 / 46 1 Banana dataset on OpenML Max Halford Online machine learning with decision trees Toy example: the banana dataset 1 Training set Decision function with 1 tree Decision function with 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1 x 1

Online (supervised) machine learning the model, allowing the training set to also act as a validation set. No need Max Halford Online machine learning with decision trees 2. Real dri�t : 𝑄(𝑍 ∣ 𝑌) changes: 1. Virtual dri�t : 𝑄(𝑌) changes Ideally, concept dri�t [GŽB + 14] should be taken into account: for cross-validation! 𝑧 can be obtained right before 𝑧 is shown to ̂ Progressive validation [BKL99]: Online != out-of-core: 4 / 46 Model learns from samples (𝑦, 𝑧) ∈ 𝐽𝑆 𝑜×𝑞 × 𝐽𝑆 𝑜×𝑙 which arrive in sequence • Online: samples are only seen once • Out-of-core: samples can be revisited ▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time

Online decision trees A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this: 1. Store and update feature distributions 2. Build the trees without looking at the data (!!) Bagging and boosting can be done online [OR01] Online machine learning with decision trees Max Halford 5 / 46

Consistency Trees fall under the non-parametric regression framework Goal: estimate a regression function 𝑔(𝑦) = 𝐽𝐹(𝑍 ∣ 𝑌 = 𝑦) Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept driħt handling Online machine learning with decision trees Max Halford 6 / 46 We estimate 𝑔 with an approximation 𝑔 𝑜 trained with 𝑜 samples 𝑔 𝑜 is consistent if 𝐽𝐹(𝑔 𝑜 (𝑌) − 𝑔(𝑌)) 2 → 0 as 𝑜 → +∞

Hoefgding trees Split thresholds 𝑢 are chosen by minimising an impurity criterion The impurity looks at the distribution of 𝑍 in each child An impurity criterion depends on 𝑄(𝑍 ∣ 𝑌 < 𝑢) 𝑄(𝑍 ∣ 𝑌 < 𝑢) can be obtained via Bayes’ rule: 𝑄(𝑌 < 𝑢) For classification, assuming 𝑌 is numeric: Online machine learning with decision trees Max Halford 7 / 46 𝑄(𝑍 ∣ 𝑌 < 𝑢) = 𝑄(𝑌 < 𝑢 ∣ 𝑍) × 𝑄(𝑍) • P(Y) is a counter • P(X < t) can be represented with a histogram • P(X < t | Y) can be represented with one histogram per class

Hoefgding tree construction algorithm A Hoefgding tree starts ofg as a leaf 𝑄(𝑍) , 𝑄(𝑌 < 𝑢) , and 𝑄(𝑌 < 𝑢 ∣ 𝑍) are updated every time a sample arrives Every so oħten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoefgding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when driħt occurs [HSD01] Online machine learning with decision trees Max Halford 8 / 46

Hoefgding trees on the banana dataset 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 9 / 46

Mondrian trees Construction follows a Mondrian process [RT + 08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19] Figure: Composition A by Piet Mondrian Online machine learning with decision trees Max Halford 10 / 46

The Mondrian process Sample 𝜀 ∼ 𝑓𝑦𝑞(∑ 𝑞 Split if 𝜀 < 𝜇 The chances of splitting decrease with the size of the cells 𝜇 is a soħt maximum depth parameter More information in these slides Online machine learning with decision trees Max Halford 11 / 46 Let 𝑣 𝑘 and 𝑚 𝑘 be the bounds of feature 𝑘 in a cell 𝑘=1 𝑣 𝑘 − 𝑚 𝑘 ) Features are uniformly chosen in proportion to 𝑣 𝑘 − 𝑚 𝑘

Mondrian trees on the banana dataset Online machine learning with decision trees Max Halford 12 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1

Aggregated Mondrian trees on the banana dataset Online machine learning with decision trees Max Halford 13 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1

Purely random trees Features 𝑦 are assumed to in [0, 1] 𝑞 Trees are constructed independently from the data, before it even arrives: 1. Pick a feature at random 2. Pick a split point at random 3. Repeat until desired depth is reached When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 𝑍 Consistency depends on: 1. The height of a tree – denoted ℎ 2. The amount of features that are “relevant” Bias analysis performed in [AG14] Word of caution: this is difgerent from extremely randomised trees [GEW06] Online machine learning with decision trees Max Halford 14 / 46

Uniform random trees Features and split points are chosen completely at random Let ℎ be the height of the tree Online machine learning with decision trees Max Halford 15 / 46 Consistent when ℎ → +∞ and ℎ 𝑜 → 0 as ℎ → +∞ [BDL08]

Uniform random trees 𝑦 1 𝑦 2 𝑦 1 𝑦 1 Online machine learning with decision trees Max Halford 16 / 46

Uniform random trees on the banana dataset 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 17 / 46

Centered random trees Features are chosen completely at random Split points are the mid-points of a feature’s current range Online machine learning with decision trees Max Halford 18 / 46 Consistent when ℎ → +∞ and 2 ℎ 𝑜 → 0 as ℎ → +∞ [Sco16]

Centered random trees Online machine learning with decision trees Max Halford 19 / 46 x 2 x 1 x 1 x 1

Centered random trees on the banana dataset Online machine learning with decision trees Max Halford 20 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1

How about a compromise? 2 ] Sample 𝑡 in [𝑏 + 𝜀(𝑐 − 𝑏), 𝑐 − 𝜀(𝑐 − 𝑏)] 𝜀 = 0 ⟹ 𝑡 ∈ [𝑏, 𝑐] (uniform) 𝑦 1 𝑦 2 𝜀 = 0.2 Online machine learning with decision trees Max Halford 21 / 46 Choose 𝜀 ∈ [0, 1 𝜀 = 1 2 ⟹ 𝑡 = 𝑏+𝑐 2 (centered)

Some examples 𝑦 1 𝑦 2 𝜀 = 0.1 𝑦 1 𝜀 = 0.25 𝑦 1 𝜀 = 0.4 Online machine learning with decision trees Max Halford 22 / 46

Banana dataset with 𝜀 = 0.2 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 23 / 46

Impact of 𝛿 on performance 0.55 Max Halford Online machine learning with decision trees Height = 9 Height = 7 Height = 5 Height = 3 Height = 1 log loss 0.65 0.60 0.50 0.0 0.45 0.40 0.35 0.30 𝜀 0.5 0.4 0.3 0.2 0.1 24 / 46

Tree regularisation A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees: 1. Set a lower limit on the number of samples in each leaf 2. Limit the maximum depth 3. Discard irrelevant nodes aħter training (pruning) None of these are designed to take into account the streaming aspect of online decision trees Online machine learning with decision trees Max Halford 25 / 46

Hierarchical smoothing Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might! Let 𝐻(𝑦 𝑢 ) be the nodes that go from the root to the leaf for a sample 𝑦 𝑢 Curtailment [ZE01]: use the first node in 𝐻(𝑦 𝑢 ) with at least 𝑙 samples Aggregated Mondrian trees [MGS19] use context weighting trees Online machine learning with decision trees Max Halford 26 / 46

Online machine learning with decision trees Max Halford University - PowerPoint PPT Presentation

Online machine learning with decision trees Max Halford University of Toulouse Online machine learning with decision trees Max Halford 1 / 46 Thursday 7 th May, 2020 Decision trees Most successful general-purpose algorithm in modern

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Supervised Learning via Decision Trees Lecture 4 Supervised Learning via Decision Trees October

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

CDBG Disaster Recovery Overview The money and the grantees $3.483 billion for New Yorks ESDC

A Hierarchy of Proof Rules for Checking Differential Invariance of Algebraic Sets Khalil Ghorbal 1

NIR on the Mesa i965 backend Track : Graphics devroom Room : K.3.401 Day : Sunday Start : 11:00 End

s ltr -t \r F .E J : g $ , I *J b I n0 \r-) \) {s i \4 L z- fJ.,- T.iF t { f

Using Google Slides To begin, you first need to create a Google Slide presentation. Open your

A GRCx Event: Seaport Resilience Report Deep Dive GRCx is an interactive program series from the

Gallium3D Graphics Done Right Zack Rusin zack@tungstengraphics.com Contents Recap

First Quarter 2017 Earnings Call May 4, 2017 Important Note to Investors This presentation

Online machine learning with decision trees Max Halford University - PowerPoint PPT Presentation

Online machine learning with decision trees Max Halford University of Toulouse Online machine learning with decision trees Max Halford 1 / 46 Thursday 7 th May, 2020 Decision trees Most successful general-purpose algorithm in modern

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Supervised Learning via Decision Trees Lecture 4 Supervised Learning via Decision Trees October

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

CDBG Disaster Recovery Overview The money and the grantees $3.483 billion for New Yorks ESDC

A Hierarchy of Proof Rules for Checking Differential Invariance of Algebraic Sets Khalil Ghorbal 1

NIR on the Mesa i965 backend Track : Graphics devroom Room : K.3.401 Day : Sunday Start : 11:00 End

s ltr -t \r F .E J *: *g $ , I *J b I n0 \r-) \) {s i \4 L z- fJ.,- T.iF t { f

Using Google Slides To begin, you first need to create a Google Slide presentation. Open your

A GRCx Event: Seaport Resilience Report Deep Dive GRCx is an interactive program series from the

Gallium3D Graphics Done Right Zack Rusin zack@tungstengraphics.com Contents Recap

First Quarter 2017 Earnings Call May 4, 2017 Important Note to Investors This presentation

s ltr -t \r F .E J : g $ , I *J b I n0 \r-) \) {s i \4 L z- fJ.,- T.iF t { f