CSC 311: Introduction to Machine Learning
Lecture 5 - Decision Trees & Bias-Variance Decomposition Roger Grosse Chris Maddison Juhan Bae Silviu Pitis
University of Toronto, Fall 2020
Intro ML (UofT) CSC311-Lec5 1 / 49
CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees - - PowerPoint PPT Presentation
CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance Decomposition Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec5 1 / 49 Today Decision
Intro ML (UofT) CSC311-Lec5 1 / 49
◮ Simple but powerful learning algorithm ◮ Used widely in Kaggle competitions ◮ Lets us motivate concepts from information theory (entropy, mutual
◮ Lets us motivate methods for combining different classifiers. Intro ML (UofT) CSC311-Lec5 2 / 49
Intro ML (UofT) CSC311-Lec5 3 / 49
Intro ML (UofT) CSC311-Lec5 4 / 49
Intro ML (UofT) CSC311-Lec5 5 / 49
Yes No Yes No Yes No
Intro ML (UofT) CSC311-Lec5 6 / 49
◮ discrete output ◮ leaf value ym typically set to the most common value in
◮ continuous output ◮ leaf value ym typically set to the mean value in {t(m1), . . . , t(mk)} Intro ML (UofT) CSC311-Lec5 7 / 49
Intro ML (UofT) CSC311-Lec5 8 / 49
Intro ML (UofT) CSC311-Lec5 9 / 49
◮ Decision trees are universal function approximators.
◮ If you are interested, check: Hyafil & Rivest’76.
Intro ML (UofT) CSC311-Lec5 10 / 49
◮ Start with the whole training set and an empty decision tree. ◮ Pick a feature and candidate split that would most reduce the loss. ◮ Split on that feature and recurse on subpartitions.
◮ Let’s see if misclassification rate is a good loss. Intro ML (UofT) CSC311-Lec5 11 / 49
Intro ML (UofT) CSC311-Lec5 12 / 49
Intro ML (UofT) CSC311-Lec5 13 / 49
Intro ML (UofT) CSC311-Lec5 14 / 49
◮ If all examples in leaf have same class: good, low uncertainty ◮ If each class has same amount of examples in leaf: bad, high
Intro ML (UofT) CSC311-Lec5 15 / 49
◮ If you’re interested, check: Information Theory by Robert Ash.
Intro ML (UofT) CSC311-Lec5 16 / 49
Intro ML (UofT) CSC311-Lec5 17 / 49
Intro ML (UofT) CSC311-Lec5 18 / 49
0.2 0.4 0.6 0.8 1.0 probability p of heads 0.2 0.4 0.6 0.8 1.0 entropy
Intro ML (UofT) CSC311-Lec5 19 / 49
◮ Variable has a uniform like distribution over many outcomes ◮ Flat histogram ◮ Values sampled from it are less predictable
◮ Distribution is concentrated on only a few outcomes ◮ Histogram is concentrated in a few areas ◮ Values sampled from it are more predictable
Intro ML (UofT) CSC311-Lec5 20 / 49
◮ For example, X = sign(Y ).
◮ Or equivalently, the expected reduction in our uncertainty about Y
Intro ML (UofT) CSC311-Lec5 21 / 49
Intro ML (UofT) CSC311-Lec5 22 / 49
p(x) ,
y p(x, y)
Intro ML (UofT) CSC311-Lec5 23 / 49
Intro ML (UofT) CSC311-Lec5 24 / 49
Intro ML (UofT) CSC311-Lec5 25 / 49
◮ H is always non-negative ◮ Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X) ◮ If X and Y independent, then X does not affect our uncertainty
◮ But knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0 ◮ By knowing X, we can only decrease uncertainty about Y :
Intro ML (UofT) CSC311-Lec5 26 / 49
Intro ML (UofT) CSC311-Lec5 27 / 49
Intro ML (UofT) CSC311-Lec5 28 / 49
7 log2( 2 7) − 5 7 log2( 5 7) ≈ 0.86
7 · 0.81 + 3 7 · 0.92) ≈ 0.006
Intro ML (UofT) CSC311-Lec5 29 / 49
7 log2( 2 7) − 5 7 log2( 5 7) ≈ 0.86
7 · 0 + 5 7 · 0.97) ≈ 0.17!!
Intro ML (UofT) CSC311-Lec5 30 / 49
Yes No Yes No Yes No
Intro ML (UofT) CSC311-Lec5 31 / 49
◮ if no examples – return majority from parent ◮ else if all examples in same class – return class ◮ else loop to step 1
Intro ML (UofT) CSC311-Lec5 32 / 49
[from: Russell & Norvig] Intro ML (UofT) CSC311-Lec5 33 / 49
Intro ML (UofT) CSC311-Lec5 34 / 49
Intro ML (UofT) CSC311-Lec5 35 / 49
◮ Computational efficiency (avoid redundant, spurious attributes) ◮ Avoid over-fitting training examples ◮ Human interpretability
◮ Useful principle, but hard to formalize (how to define simplicity?) ◮ See Domingos, 1999, “The role of Occam’s razor in knowledge
Intro ML (UofT) CSC311-Lec5 36 / 49
◮ You have exponentially less data at lower levels ◮ Too big of a tree can overfit the data ◮ Greedy algorithms don’t necessarily yield the global optimum
◮ Split based on a threshold, chosen to maximize information gain
Intro ML (UofT) CSC311-Lec5 37 / 49
Intro ML (UofT) CSC311-Lec5 38 / 49
◮ E.g., (possibly weighted) majority vote
◮ Different algorithm ◮ Different choice of hyperparameters ◮ Trained on different data ◮ Trained with different weighting of the training examples
Intro ML (UofT) CSC311-Lec5 39 / 49
◮ This will help us understand ensembling methods. Intro ML (UofT) CSC311-Lec5 40 / 49
◮ Bias and variance of what? Intro ML (UofT) CSC311-Lec5 41 / 49
Intro ML (UofT) CSC311-Lec5 42 / 49
Intro ML (UofT) CSC311-Lec5 43 / 49
Intro ML (UofT) CSC311-Lec5 44 / 49
◮ Fix a query point x. ◮ Repeat: ◮ Sample a random training dataset D i.i.d. from the data generating
◮ Run the learning algorithm on D to get a prediction y at x. ◮ Sample the (true) target from the conditional distribution p(t|x). ◮ Compute the loss L(y, t).
Intro ML (UofT) CSC311-Lec5 45 / 49
2(y − t)2.
◮ Here, we are treating t as a random variable and choosing y.
∗ + Var[t | x]
Intro ML (UofT) CSC311-Lec5 46 / 49
◮ This is the best we can ever hope to do with any learning
◮ Notice that this term doesn’t depend on y.
Intro ML (UofT) CSC311-Lec5 47 / 49
⋆ − 2y⋆y + y2] + Var(t)
⋆ − 2y⋆E[y] + E[y2] + Var(t)
⋆ − 2y⋆E[y] + E[y]2 + Var(y) + Var(t)
variance
Bayes error
Intro ML (UofT) CSC311-Lec5 48 / 49
variance
Bayes error
◮ bias: how wrong the expected prediction is (corresponds to
◮ variance: the amount of variability in the predictions (corresponds
◮ Bayes error: the inherent unpredictability of the targets
Intro ML (UofT) CSC311-Lec5 49 / 49
◮ We average over points x from the data distribution. Intro ML (UofT) CSC311-Lec5 50 / 49