lecture 5 classification and dimension reduction

Lecture 5: Classification and dimension reduction Felix Held, - PowerPoint PPT Presentation

Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019 Random Forests 1. Given a training sample with features, do for = 1, , on


  1. Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019

  2. Random Forests 1. Given a training sample with π‘ž features, do for 𝑐 = 1, … , 𝐢 on bootstrapped data. Note: Step 1.2.1 leads to less correlation between trees built Majority vote at 𝐲 across trees Classification: 𝐢 1 𝑠𝑐 (𝐲) = 𝑔 Λ† Regression: 2. For a new 𝐲 predict 1.2.3 Split the node 1.2.2 Find best splitting variable among these 𝑛 1.2.1 Randomly select 𝑛 variables from the π‘ž available π‘œ min replacement) 1.1 Draw a bootstrap sample of size π‘œ from training data (with 1/21 1.2 Grow a tree π‘ˆ 𝑐 until each node reaches minimal node size 𝐢 βˆ‘ 𝑐=1 π‘ˆ 𝑐 (𝐲)

  3. Comparison of RF, Bagging and CART 𝐲 ∼ 𝑂(𝟏, 𝚻), 𝐲 ∈ ℝ 5 , Toy example Training and test data were sampled from the true model. Results 2/21 𝜁 ∼ 𝑂(0, 1) 𝑧 = 𝑦 2 where 1 + 𝜁 𝚻 π‘šπ‘š = 1, 𝚻 π‘šπ‘™ = 0.98, π‘š β‰  𝑙 for RF, bagged CART and a single CART, using 𝑦 1 , … , 𝑦 5 as predictor variables. ( π‘œ 𝑒𝑠 = 50 , π‘œ 𝑒𝑓 = 100 ) ● Test error 2.1 ● ● 1.8 ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 Number of trees

  4. Variable importance set for that particular tree, since they were not used serves as an importance measure for variable π‘˜ 1 𝐹 (π‘˜) 1 calculate test error again 𝐹 (π‘˜) 1. Impurity index: Splitting on a feature leads to a reduction during training. Resulting in test error 𝐹 0 a chance of about 63% to be selected trees per feature gives a measure for variable importance of node impurity. Summing all improvements over all 3/21 2. Out-of-bag error β–Ά During bootstrapping for large enough π‘œ , each sample has β–Ά For bagging the remaining samples are out-of-bag . β–Ά These out-of-bag samples for tree π‘ˆ 𝑐 can be used as a test β–Ά Permute variable π‘˜ in the out-of-bag samples and β–Ά The increase in error βˆ’ 𝐹 0 β‰₯ 0

  5. RF applied to cardiovascular dataset Monica dataset ( http://thl.fi/monica , π‘œ = 6367 , π‘ž = 11 ) 4/21 number of cardiovascular risk factors (class ratio 1.25 alive : 1 dead) Predicting whether or not patients survive a 10 year period given a Error estimate Variable importance 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● yronset ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● stroke Type ● ● Type Alive Acc. smstat Alive 0.20 ● Dead Acc. Dead ● Outβˆ’ofβˆ’bag error Mean Acc. sex ● OOB Mean Gini ● ● premi ● ● ● 0.15 ● ● ● ● ● ● ● ● ● ● ● hosp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● highbp ● 0.10 ● hichol ● ● ● ● ● diabetes ● ● ● ● ● ● ● 0.05 ● angina ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● age ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 150 200 0 1 2 3 Number of Trees Decrease

Recommend


More recommend