Online machine learning with decision trees Max Halford University - - PowerPoint PPT Presentation

online machine learning with decision trees
SMART_READER_LITE
LIVE PREVIEW

Online machine learning with decision trees Max Halford University - - PowerPoint PPT Presentation

Online machine learning with decision trees Max Halford University of Toulouse Online machine learning with decision trees Max Halford 1 / 46 Thursday 7 th May, 2020 Decision trees Most successful general-purpose algorithm in modern


slide-1
SLIDE 1

Online machine learning with decision trees

Max Halford

University of Toulouse Thursday 7th May, 2020

Online machine learning with decision trees Max Halford 1 / 46

slide-2
SLIDE 2

Decision trees

“Most successful general-purpose algorithm in modern times.” [HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in difgerent ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG+11, CG16, KMF+17, PGV+18] Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context.

Online machine learning with decision trees Max Halford 2 / 46

slide-3
SLIDE 3

Toy example: the banana dataset 1

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

Training set

0.0 0.2 0.4 0.6 0.8 1.0

x1

Decision function with 1 tree

0.0 0.2 0.4 0.6 0.8 1.0

x1

Decision function with 10 trees

1Banana dataset on OpenML Online machine learning with decision trees Max Halford 3 / 46

slide-4
SLIDE 4

Online (supervised) machine learning

Model learns from samples (𝑦, 𝑧) ∈ 𝐽𝑆𝑜×𝑞 × 𝐽𝑆𝑜×𝑙 which arrive in sequence Online != out-of-core:

  • Online: samples are only seen once
  • Out-of-core: samples can be revisited

Progressive validation [BKL99]: ̂ 𝑧 can be obtained right before 𝑧 is shown to the model, allowing the training set to also act as a validation set. No need for cross-validation! Ideally, concept drit [GŽB+14] should be taken into account:

  • 1. Virtual drit: 𝑄(𝑌) changes
  • 2. Real drit: 𝑄(𝑍 ∣ 𝑌) changes:

▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time

Online machine learning with decision trees Max Halford 4 / 46

slide-5
SLIDE 5

Online decision trees

A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this:

  • 1. Store and update feature distributions
  • 2. Build the trees without looking at the data (!!)

Bagging and boosting can be done online [OR01]

Online machine learning with decision trees Max Halford 5 / 46

slide-6
SLIDE 6

Consistency

Trees fall under the non-parametric regression framework Goal: estimate a regression function 𝑔(𝑦) = 𝐽𝐹(𝑍 ∣ 𝑌 = 𝑦) We estimate 𝑔 with an approximation 𝑔𝑜 trained with 𝑜 samples 𝑔𝑜 is consistent if 𝐽𝐹(𝑔𝑜(𝑌) − 𝑔(𝑌))2 → 0 as 𝑜 → +∞ Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept driħt handling

Online machine learning with decision trees Max Halford 6 / 46

slide-7
SLIDE 7

Hoefgding trees

Split thresholds 𝑢 are chosen by minimising an impurity criterion The impurity looks at the distribution of 𝑍 in each child An impurity criterion depends on 𝑄(𝑍 ∣ 𝑌 < 𝑢) 𝑄(𝑍 ∣ 𝑌 < 𝑢) can be obtained via Bayes’ rule: 𝑄(𝑍 ∣ 𝑌 < 𝑢) = 𝑄(𝑌 < 𝑢 ∣ 𝑍) × 𝑄(𝑍) 𝑄(𝑌 < 𝑢) For classification, assuming 𝑌 is numeric:

  • P(Y) is a counter
  • P(X < t) can be represented with a histogram
  • P(X < t | Y) can be represented with one histogram per class

Online machine learning with decision trees Max Halford 7 / 46

slide-8
SLIDE 8

Hoefgding tree construction algorithm

A Hoefgding tree starts ofg as a leaf 𝑄(𝑍), 𝑄(𝑌 < 𝑢), and 𝑄(𝑌 < 𝑢 ∣ 𝑍) are updated every time a sample arrives Every so oħten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoefgding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when driħt occurs [HSD01]

Online machine learning with decision trees Max Halford 8 / 46

slide-9
SLIDE 9

Hoefgding trees on the banana dataset

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1

0.0 0.2 0.4 0.6 0.8 1.0

𝑦2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1 10 trees

Online machine learning with decision trees Max Halford 9 / 46

slide-10
SLIDE 10

Mondrian trees

Construction follows a Mondrian process [RT+08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19]

Figure: Composition A by Piet Mondrian

Online machine learning with decision trees Max Halford 10 / 46

slide-11
SLIDE 11

The Mondrian process

Let 𝑣𝑘 and 𝑚𝑘 be the bounds of feature 𝑘 in a cell Sample 𝜀 ∼ 𝑓𝑦𝑞(∑

𝑞 𝑘=1 𝑣𝑘 − 𝑚𝑘)

Split if 𝜀 < 𝜇 The chances of splitting decrease with the size of the cells 𝜇 is a soħt maximum depth parameter Features are uniformly chosen in proportion to 𝑣𝑘 − 𝑚𝑘 More information in these slides

Online machine learning with decision trees Max Halford 11 / 46

slide-12
SLIDE 12

Mondrian trees on the banana dataset

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

x1 10 trees

Online machine learning with decision trees Max Halford 12 / 46

slide-13
SLIDE 13

Aggregated Mondrian trees on the banana dataset

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

x1 10 trees

Online machine learning with decision trees Max Halford 13 / 46

slide-14
SLIDE 14

Purely random trees

Features 𝑦 are assumed to in [0, 1]𝑞 Trees are constructed independently from the data, before it even arrives:

  • 1. Pick a feature at random
  • 2. Pick a split point at random
  • 3. Repeat until desired depth is reached

When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 𝑍 Consistency depends on:

  • 1. The height of a tree – denoted ℎ
  • 2. The amount of features that are “relevant”

Bias analysis performed in [AG14] Word of caution: this is difgerent from extremely randomised trees [GEW06]

Online machine learning with decision trees Max Halford 14 / 46

slide-15
SLIDE 15

Uniform random trees

Features and split points are chosen completely at random Let ℎ be the height of the tree Consistent when ℎ → +∞ and ℎ

𝑜 → 0 as ℎ → +∞ [BDL08]

Online machine learning with decision trees Max Halford 15 / 46

slide-16
SLIDE 16

Uniform random trees

𝑦1 𝑦2 𝑦1 𝑦1

Online machine learning with decision trees Max Halford 16 / 46

slide-17
SLIDE 17

Uniform random trees on the banana dataset

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1

0.0 0.2 0.4 0.6 0.8 1.0

𝑦2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1 10 trees

Online machine learning with decision trees Max Halford 17 / 46

slide-18
SLIDE 18

Centered random trees

Features are chosen completely at random Split points are the mid-points of a feature’s current range Consistent when ℎ → +∞ and 2ℎ

𝑜 → 0 as ℎ → +∞ [Sco16]

Online machine learning with decision trees Max Halford 18 / 46

slide-19
SLIDE 19

Centered random trees

x1 x2 x1 x1

Online machine learning with decision trees Max Halford 19 / 46

slide-20
SLIDE 20

Centered random trees on the banana dataset

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

x1 10 trees

Online machine learning with decision trees Max Halford 20 / 46

slide-21
SLIDE 21

How about a compromise?

Choose 𝜀 ∈ [0, 1

2]

Sample 𝑡 in [𝑏 + 𝜀(𝑐 − 𝑏), 𝑐 − 𝜀(𝑐 − 𝑏)] 𝜀 = 0 ⟹ 𝑡 ∈ [𝑏, 𝑐] (uniform) 𝜀 = 1

2 ⟹ 𝑡 = 𝑏+𝑐 2 (centered) 𝑦1 𝑦2 𝜀 = 0.2

Online machine learning with decision trees Max Halford 21 / 46

slide-22
SLIDE 22

Some examples

𝑦1 𝑦2 𝜀 = 0.1 𝑦1 𝜀 = 0.25 𝑦1 𝜀 = 0.4

Online machine learning with decision trees Max Halford 22 / 46

slide-23
SLIDE 23

Banana dataset with 𝜀 = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1

0.0 0.2 0.4 0.6 0.8 1.0

𝑦2 Single tree

0.0 0.2 0.4 0.6 0.8 1.0

𝑦1 10 trees

Online machine learning with decision trees Max Halford 23 / 46

slide-24
SLIDE 24

Impact of 𝛿 on performance

0.0 0.1 0.2 0.3 0.4 0.5

𝜀

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

log loss

Height = 1 Height = 3 Height = 5 Height = 7 Height = 9

Online machine learning with decision trees Max Halford 24 / 46

slide-25
SLIDE 25

Tree regularisation

A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees:

  • 1. Set a lower limit on the number of samples in each leaf
  • 2. Limit the maximum depth
  • 3. Discard irrelevant nodes aħter training (pruning)

None of these are designed to take into account the streaming aspect of

  • nline decision trees

Online machine learning with decision trees Max Halford 25 / 46

slide-26
SLIDE 26

Hierarchical smoothing

Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might! Let 𝐻(𝑦𝑢) be the nodes that go from the root to the leaf for a sample 𝑦𝑢 Curtailment [ZE01]: use the first node in 𝐻(𝑦𝑢) with at least 𝑙 samples Aggregated Mondrian trees [MGS19] use context weighting trees

Online machine learning with decision trees Max Halford 26 / 46

slide-27
SLIDE 27

A simple averaging scheme

Idea: make each node in 𝐻(𝑦𝑢) contribute to a weighted average Let

  • 𝑙 be the number of samples in a node
  • 𝑒 be the depth of a node

Then, the contribution of each node is weighted by: 𝑥 = 𝑙 × (1 + 𝛿)𝑒 The more samples a leaf contains, the more it matters The deeper a leaf is, the more it matters 𝛿 ∈ 𝐽𝑆 controls the importance of both values I like to call this path averaging

Online machine learning with decision trees Max Halford 27 / 46

slide-28
SLIDE 28

Averaging on the banana dataset

0.00 0.25 0.50 0.75 1.00

𝑦1

0.0 0.2 0.4 0.6 0.8 1.0

𝑦2 Leaf predictions

0.00 0.25 0.50 0.75 1.00

𝑦1 Smoothing with 𝛿 = 5

0.00 0.25 0.50 0.75 1.00

𝑦1 Smoothing with 𝛿 = 20

Notice what happens in the corners.

Online machine learning with decision trees Max Halford 28 / 46

slide-29
SLIDE 29

Impact of 𝛿 on predictive performance

3 6 9 12 15 18

𝛿

0.20 0.22 0.24 0.26 0.28 0.30 0.32

log loss

Leaf predictions

Online machine learning with decision trees Max Halford 29 / 46

slide-30
SLIDE 30

Dealing with concept driħt

Each node contains a running average of the 𝑧 values it has seen Instead, we can maintain an exponentially weighted moving average (EWMA): ̄ 𝑧𝑢 = 𝛽𝑧𝑢 + (1 − 𝛽) ̄ 𝑧𝑢−1 𝛽 determines the influence of the most recent values

Online machine learning with decision trees Max Halford 30 / 46

slide-31
SLIDE 31

Hard driħt: flip 𝑧 values aħter 2000 samples

500 1000 1500 2000 2500 3000 3500 4000

number of samples

0.3 0.4 0.5 0.6 0.7 0.8

log loss

No smoothing 𝛽 = 0.3 𝛽 = 0.7

Online machine learning with decision trees Max Halford 31 / 46

slide-32
SLIDE 32

Soħt driħt: slowly rotate samples around barycenter

500 1000 1500 2000 2500 3000 3500 4000

number of samples

0.6 0.7 0.8 0.9 1.0

log loss

No smoothing 𝛽 = 0.3 𝛽 = 0.7

Online machine learning with decision trees Max Halford 32 / 46

slide-33
SLIDE 33

Feature selection

Final paragraph from [MGS19] A limitation of AMF, however, is that it does not perform feature selection. It would be interesting to develop an on- line feature selection procedure that could indicate along which coor- dinates the splits should be sampled in Mondrian trees, and prove that such a procedure performs dimension reduction in some sense. This is a challenging question in the context of online learning which deserves future investigations. Online feature selection is a diffjcult problem!

Online machine learning with decision trees Max Halford 33 / 46

slide-34
SLIDE 34

A solution?

  • 1. Initially, we don’t know the importance of each feature, so we pick them at

random

  • 2. Aħter some time, we can measure the quality of each split within each tree
  • 3. We can derive the feature importances from the splits each feature

participates in

  • 4. Every so oħten we can build a new tree by sampling feature relative to their

importances

  • 5. The selection probabilities should be conditioned on the features already

chosen This is still work in progress, but there is hope.

Online machine learning with decision trees Max Halford 34 / 46

slide-35
SLIDE 35

Parameters recap

𝑛: number of trees ℎ: height of each tree 𝜀: the amount of padding 𝛿: determines how the path averaging works 𝛽: exponentially weighted moving average parameter

Online machine learning with decision trees Max Halford 35 / 46

slide-36
SLIDE 36

Some useful Python libraries

scikit-garden – Mondrian trees

  • nelearn – Aggregated Mondrian trees

scikit-multiflow – Hoefgding trees scikit-learn – General-purpose batch machine learning creme – General-purpose online machine learning

Online machine learning with decision trees Max Halford 36 / 46

slide-37
SLIDE 37

Train/test benchmarks

Moons Noisy linear Higgs Higgs* Batch log reg .324 .244 .640 .677 Batch log reg with Fourier features .193 .213 .698 .641 Batch random forest .225 .210 .615 .639 NN with 2 layers .171 .196 .653 .637 Online log reg .334 .323 .662 .677 Mondrian forest .349 .316 .692 .905 Aggregated Mondrian forest .205 .199 .671 .649 Hoefgding forest .330 .258 .664 .649 Padded trees (us) .185 .193 .678 .644

Online machine learning with decision trees Max Halford 37 / 46

slide-38
SLIDE 38

Streaming benchmarks

Work in progress!

Online machine learning with decision trees Max Halford 38 / 46

slide-39
SLIDE 39

Slides are available at maxhalford.github.io/slides/online-decision-trees.pdf Feedback is more than welcome. Stay safe!

Online machine learning with decision trees Max Halford 39 / 46

slide-40
SLIDE 40

References

Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014. Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(Sep):2015–2033, 2008. Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twelth annual conference on Computational learning theory, pages 203–208, 1999.

Online machine learning with decision trees Max Halford 40 / 46

slide-41
SLIDE 41

References

Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. Gérard Biau and Erwan Scornet. A random forest guided tour. Test, 25(2):197–227, 2016. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.

Online machine learning with decision trees Max Halford 41 / 46

slide-42
SLIDE 42

References

Pedro Domingos and Geofg Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80, 2000. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006. João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept driħt adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.

Online machine learning with decision trees Max Halford 42 / 46

slide-43
SLIDE 43

References

Jeremy Howard and Mike Bowles. The two most important algorithms in predictive modeling today. In Strata Conference presentation, February, volume 28, 2012. Geofg Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97–106, 2001. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly effjcient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017.

Online machine learning with decision trees Max Halford 43 / 46

slide-44
SLIDE 44

References

Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Effjcient online random forests. In Advances in neural information processing systems, pages 3140–3148, 2014. Jaouad Mourtada, Stéphane Gaïfgas, and Erwan Scornet. Amf: Aggregated mondrian forests for online learning. arXiv preprint arXiv:1906.10529, 2019. Nikunj Chandrakant Oza and Stuart Russell. Online ensemble learning. University of California, Berkeley, 2001.

Online machine learning with decision trees Max Halford 44 / 46

slide-45
SLIDE 45

References

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems, pages 6638–6648, 2018.

  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
  • M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
  • D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.

Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Online machine learning with decision trees Max Halford 45 / 46

slide-46
SLIDE 46

References

Daniel M Roy, Yee Whye Teh, et al. The mondrian process. In NIPS, pages 1377–1384, 2008. Erwan Scornet. On the asymptotics of random forests. Journal of Multivariate Analysis, 146:72–83, 2016. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616. Citeseer, 2001.

Online machine learning with decision trees Max Halford 46 / 46