Lecture 5: Classification and dimension reduction Felix Held, - PowerPoint PPT Presentation

Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019

Random Forests 1. Given a training sample with 𝑞 features, do for 𝑐 = 1, … , 𝐶 on bootstrapped data. Note: Step 1.2.1 leads to less correlation between trees built Majority vote at 𝐲 across trees Classification: 𝐶 1 𝑠𝑐 (𝐲) = 𝑔 ˆ Regression: 2. For a new 𝐲 predict 1.2.3 Split the node 1.2.2 Find best splitting variable among these 𝑛 1.2.1 Randomly select 𝑛 variables from the 𝑞 available 𝑜 min replacement) 1.1 Draw a bootstrap sample of size 𝑜 from training data (with 1/21 1.2 Grow a tree 𝑈 𝑐 until each node reaches minimal node size 𝐶 ∑ 𝑐=1 𝑈 𝑐 (𝐲)

Comparison of RF, Bagging and CART 𝐲 ∼ 𝑂(𝟏, 𝚻), 𝐲 ∈ ℝ 5 , Toy example Training and test data were sampled from the true model. Results 2/21 𝜁 ∼ 𝑂(0, 1) 𝑧 = 𝑦 2 where 1 + 𝜁 𝚻 𝑚𝑚 = 1, 𝚻 𝑚𝑙 = 0.98, 𝑚 ≠ 𝑙 for RF, bagged CART and a single CART, using 𝑦 1 , … , 𝑦 5 as predictor variables. ( 𝑜 𝑢𝑠 = 50 , 𝑜 𝑢𝑓 = 100 ) ● Test error 2.1 ● ● 1.8 ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 Number of trees

Variable importance set for that particular tree, since they were not used serves as an importance measure for variable 𝑘 1 𝐹 (𝑘) 1 calculate test error again 𝐹 (𝑘) 1. Impurity index: Splitting on a feature leads to a reduction during training. Resulting in test error 𝐹 0 a chance of about 63% to be selected trees per feature gives a measure for variable importance of node impurity. Summing all improvements over all 3/21 2. Out-of-bag error ▶ During bootstrapping for large enough 𝑜 , each sample has ▶ For bagging the remaining samples are out-of-bag . ▶ These out-of-bag samples for tree 𝑈 𝑐 can be used as a test ▶ Permute variable 𝑘 in the out-of-bag samples and ▶ The increase in error − 𝐹 0 ≥ 0

RF applied to cardiovascular dataset Monica dataset ( http://thl.fi/monica , 𝑜 = 6367 , 𝑞 = 11 ) 4/21 number of cardiovascular risk factors (class ratio 1.25 alive : 1 dead) Predicting whether or not patients survive a 10 year period given a Error estimate Variable importance 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● yronset ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● stroke Type ● ● Type Alive Acc. smstat Alive 0.20 ● Dead Acc. Dead ● Out−of−bag error Mean Acc. sex ● OOB Mean Gini ● ● premi ● ● ● 0.15 ● ● ● ● ● ● ● ● ● ● ● hosp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● highbp ● 0.10 ● hichol ● ● ● ● ● diabetes ● ● ● ● ● ● ● 0.05 ● angina ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● age ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 150 200 0 1 2 3 Number of Trees Decrease

Lecture 5: Classification and dimension reduction Felix Held, - PowerPoint PPT Presentation

Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019 Random Forests 1. Given a training sample with features, do for = 1, , on

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Dimension Reduction CS 760@UW-Madison Goals for the lecture you should understand the following

VC Dimension and classification John Duchi Prof. John Duchi Outline I Setting: classification

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Dimension Reduction CS 6242 Ramakrishnan Kannan Thanks : Prof. Jaegul Choo and Prof. Le

Dimension reduction numerical methods for Bermudan options Scott Sues Probability, Numerics, and

Geometric perspectives for supervised dimension reduction A Tale of Two Manifolds S. Mukherjee,

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Stylization with a Purpose Stylization with a Purpose Stylization with a Purpose Stylization

The Gas Network Control Problem and How to Approach It Felix Hennings Combinatorial Optimization

Quality Energy Services Brochure Locations and Capabilities E-Line & Cased Hole Services

CPSC 875 CPSC 875 John D McGregor John D. McGregor ABS module a work in progress Purpose

Effect of the PCSK9 Inhibitor Evolocumab on Cardiovascular Outcomes MS Sabatine, RP Giugliano,

Data Warehousing & OLAP Motivation: Business Intelligence Customer information Product

Therapy of Ph+ ALL elderly pa1ents Cris%na Papayannidis,

What if my panel is negative? How do I manage my patient? Robert Campbell, MD September 30, 2016

Lecture 5: Classification and dimension reduction Felix Held, - PowerPoint PPT Presentation

Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019 Random Forests 1. Given a training sample with features, do for = 1, , on

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Dimension Reduction CS 760@UW-Madison Goals for the lecture you should understand the following

VC Dimension and classification John Duchi Prof. John Duchi Outline I Setting: classification

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Dimension Reduction CS 6242 Ramakrishnan Kannan Thanks : Prof. Jaegul Choo and Prof. Le

Dimension reduction numerical methods for Bermudan options Scott Sues Probability, Numerics, and

Geometric perspectives for supervised dimension reduction A Tale of Two Manifolds S. Mukherjee,

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Stylization with a Purpose Stylization with a Purpose Stylization with a Purpose Stylization

The Gas Network Control Problem and How to Approach It Felix Hennings Combinatorial Optimization

Quality Energy Services Brochure Locations and Capabilities E-Line &amp; Cased Hole Services

CPSC 875 CPSC 875 John D McGregor John D. McGregor ABS module a work in progress Purpose

Effect of the PCSK9 Inhibitor Evolocumab on Cardiovascular Outcomes MS Sabatine, RP Giugliano,

Data Warehousing &amp; OLAP Motivation: Business Intelligence Customer information Product

Therapy of Ph+ ALL elderly pa1ents Cris%na Papayannidis,

What if my panel is negative? How do I manage my patient? Robert Campbell, MD September 30, 2016

Quality Energy Services Brochure Locations and Capabilities E-Line & Cased Hole Services

Data Warehousing & OLAP Motivation: Business Intelligence Customer information Product