 
              Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives bootstrap for uncertainty estimation bagging for variance reduction random forests boosting AdaBoost gradient boosting relationship to L1 regularization 2
Bootstrap Bootstrap a simple approach to estimate the uncertainty in prediction non-parametric bootstrap ( n ) ( n ) given the dataset D = {( x , y )} N n =1 subsample with replacement B datasets of size N ( n , b ) ( n , b ) N = {( x , y )} , b = 1, … , B D n =1 b train a model on each of these bootstrap datasets (called bootstrap samples ) produce a measure of uncertainty from these models for model parameters for predictions 3 . 1
Bootstrap: Bootstrap: example example k 2 ( x − μ ) e − ( x ) = ϕ s 2 k Recall: linear model with nonlinear Gaussian bases (N=100) noise ( n ) ( n ) = sin( x ) + cos( ∣ x ( n ) ∣ ) + y ϵ before adding noise 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-(x-mu)**2) 5 mu = np.linspace(0,10,10) #10 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') our fit to data using 10 Gaussian bases 3 . 2
Bootstrap: example Bootstrap: example k 2 ( x − μ ) e − ( x ) = ϕ s 2 k Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters each color is a different weight w d 1 1 1 1 #Phi: N x D #Phi: N x D #Phi: N x D #Phi: N x D 2 2 2 2 #y: N #y: N #y: N #y: N 3 3 3 3 B = 500 B = 500 B = 500 B = 500 4 4 4 4 ws = np.zeros((B,D)) ws = np.zeros((B,D)) ws = np.zeros((B,D)) ws = np.zeros((B,D)) 5 5 5 5 for b in range(B): for b in range(B): for b in range(B): for b in range(B): 6 6 6 6 inds = np.random.randint(N, size=(N)) inds = np.random.randint(N, size=(N)) inds = np.random.randint(N, size=(N)) inds = np.random.randint(N, size=(N)) 7 7 7 7 Phi_b = Phi[inds,:] #N x D Phi_b = Phi[inds,:] #N x D Phi_b = Phi[inds,:] #N x D Phi_b = Phi[inds,:] #N x D 8 8 8 8 y_b = y[inds] #N y_b = y[inds] #N y_b = y[inds] #N y_b = y[inds] #N 9 9 9 9 #fit the subsampled data #fit the subsampled data #fit the subsampled data #fit the subsampled data 10 10 10 10 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 11 11 11 11 12 12 12 12 plt.hist(ws, bins=50) plt.hist(ws, bins=50) plt.hist(ws, bins=50) plt.hist(ws, bins=50) 3 . 3
Bootstrap: example Bootstrap: example k 2 ( x − μ ) e − ( x ) = ϕ s 2 k Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples the red lines are 5% and 95% quantiles also gives a measure of uncertainty of the predictions (for each point we can get these across bootstrap model predictions) 1 1 1 #Phi: N x D #Phi: N x D #Phi: N x D 2 2 2 #Phi_test: Nt x D #Phi_test: Nt x D #Phi_test: Nt x D 3 3 3 #y: N #y: N #y: N 4 4 4 #ws: B x D from previous code #ws: B x D from previous code #ws: B x D from previous code 5 5 5 y_hats = np.zeros((B, Nt)) y_hats = np.zeros((B, Nt)) y_hats = np.zeros((B, Nt)) 6 6 6 for b in range(B): for b in range(B): for b in range(B): 7 7 7 wb = ws[b,:] wb = ws[b,:] wb = ws[b,:] 8 8 8 y_hats[b,:] = np.dot(Phi_test, wb) y_hats[b,:] = np.dot(Phi_test, wb) y_hats[b,:] = np.dot(Phi_test, wb) 9 9 9 10 10 10 # get 95% quantiles # get 95% quantiles # get 95% quantiles 11 11 11 y_5 = np.quantile(y_hats, .05, axis=0) y_5 = np.quantile(y_hats, .05, axis=0) y_5 = np.quantile(y_hats, .05, axis=0) 12 12 12 y_95 = np.quantile(y_hats, .95, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) 3 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables E [( z 2 2 E [ z 2 2 Var( z + ) = + ) ] − + ] z z z 1 2 1 1 = E [ z 2 2 ( E [ z E [ z 2 + + 2 z ] − ] + ]) z z 1 2 1 2 1 2 = E [ z 2 E [ z 2 E [2 z E [ z 1 2 E [ z 2 2 2 E [ z ] E [ z ] + ] + ] − ] − ] − ] z 1 2 1 2 1 2 = Var( z ) + Var( z ) + 2Cov( z , z ) 1 2 1 2 for uncorrelated variables this term is zero 4 . 1
Bagging Bagging use bootstrap for more accurate prediction (not just uncertainty) average of uncorrelated random variables has a lower variance σ 2 , … , z μ are uncorrelated random variables with mean and variance z 1 B 1 ∑ b ˉ = μ the average has mean and variance z z b B 1 ∑ b 1 1 2 1 2 Var( ) = Var( ) = Bσ = ∑ b z z σ b b B 2 B 2 B B use this to reduce the variance of our models (bias remains the same) 1 ∑ b f ^ ^ ( x ) = ( x ) regression: average the model predictions f b B issue: model predictions are not uncorrelated (trained using the same data) bagging (bootstrap aggregation) use bootstrap samples to reduce correlation 4 . 2
Bagging for classification Bagging for classification averaging makes sense for regression, how about classification? wisdom of crowds > 0 μ = .5 + ϵ , … , z ∈ {0, 1} are IID Bernoulli random variables with mean z 1 B 1 ∑ b ˉ = p ( > ˉ .5) for goes to 1 as B grows z z we have z b B mode of iid classifiers that are better than chance is a better classifier use voting crowds are wiser when individuals are better than random votes are uncorrelated bagging (bootstrap aggregation) use bootstrap samples to reduce correlation 4 . 3
Bagging decision trees Bagging decision trees example setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable ! voting for the most probably class averaging probabilities B 4 . 4
Random forests Random forests further reduce the correlation between decision trees feature sub-sampling only a random subset of features are available for split at each step further reduce the dependence between decision trees D magic number? this is a hyper-parameter, can be optimized using CV Out Of Bag (OOB) samples: the instances not included in a bootsrap dataset can be used for validation simultaneous validation of decision trees in a forest no need to set aside data for cross validation 4 . 5
Example Example: spam detection : spam detection Dataset N=4601 emails binary classification task : spam - not spam D=57 features: 48 words: percentage of words in the email that match these words e.g., business,address,internet, free, George (customized per user) 6 characters: again percentage of characters that match these an example of ch; , ch( ,ch[ ,ch! ,ch$ , ch# feature engineering average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT average value of these features in the spam and non-spam emails 4 . 6
Example Example: spam detection : spam detection decision tree after pruning number of leaves (17) in optimal pruning decided based on cross-validation error cv error test error misclassification rate on test data 4 . 7
Example Example: spam detection : spam detection Bagging and Random Forests do much better Out Of Bag (OOB) error can be used for parameter tuning than a single decision tree! (e.g., size of the forest) 4 . 8 Winter 2020 | Applied Machine Learning (COMP551)
Summary so far... Summary so far... Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models Random forests: Bagging + further de-corelation of features at each split OOB validation instead of CV destroy interpretability of decision trees perform well in practice can fail if only few relevant features exist (due to feature-sampling) 5
Adaptive bases Adaptive bases several methods can be classified as learning these bases adaptively f ( x ) = ( x ; v ) ∑ d w ϕ d d d decision trees generalized additive models boosting neural networks in boosting each basis is a classifier or regression function ( weak learner, or base learner ) create a strong learner by sequentially combining week learners 6 . 1
Forward stagewise additive modelling Forward stagewise additive modelling T { t } { t } model f ( x ) = ϕ ( x ; v ) ∑ t =1 w a simple model, such as decision stump (decision tree with one node) { t } { t } N ( n ) ( n ) cost J ({ w , v } ) = L ( y , f ( x )) ∑ n =1 t so far we have seen L2 loss, log loss and hinge loss optimizing this cost is difficult given the form of f optimization idea add one weak-learner in each stage t, to reduce the error of previous stage 1. find the best weak learner { t } { t } N ( n ) { t −1} ( n ) ( n ) , w = arg min L ( y , f ( x ) + wϕ ( x ; v )) v , w ∑ n =1 v 2. add it to the current model { t } { t −1} ( n ) { t } ( n ) { t } ( x ) = ( x ) + ϕ ( x ; v ) f f w 6 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Recommend
More recommend