Linear Models: Comparing Variables Stony Brook University CSE545, - - PowerPoint PPT Presentation

linear models comparing variables
SMART_READER_LITE
LIVE PREVIEW

Linear Models: Comparing Variables Stony Brook University CSE545, - - PowerPoint PPT Presentation

Linear Models: Comparing Variables Stony Brook University CSE545, Fall 2017 Statistical Preliminaries Random Variables Random Variables X : A mapping from to that describes the question we care about in practice. 3 Random Variables X


slide-1
SLIDE 1

Linear Models: Comparing Variables

Stony Brook University CSE545, Fall 2017

slide-2
SLIDE 2

Statistical Preliminaries

Random Variables

slide-3
SLIDE 3

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice.

3

slide-4
SLIDE 4

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω

4

slide-5
SLIDE 5

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a “variable”, but a function that we end up notating a lot like a variable)

5

slide-6
SLIDE 6

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a “variable”, but a function that we end up notating a lot like a variable) X is a discrete random variable if it takes only a countable number of values.

6

slide-7
SLIDE 7

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. X is a discrete random variable if it takes only a countable number of values. X is a continuous random variable if it can take on an infinite number of values between any two given values.

7

slide-8
SLIDE 8

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≤ b} )

8

X is a continuous random variable if it can take on an infinite number of values between any two given values.

slide-9
SLIDE 9

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≤ b} ) P(X = i) := 0, for all i ∊ Ω

(probability of receiving exactly i inches of snowfall is zero)

9

X is a continuous random variable if it can take on an infinite number of values between any two given values.

slide-10
SLIDE 10

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≥ b} ) P(X = i) := 0, for all i ∊ Ω

(probability of receiving exactly i inches of snowfall is zero)

10

X is a continuous random variable if it can take on an infinite number of values between any two given values.

How to model?

slide-11
SLIDE 11

Continuous Random Variables

11

How to model? Discretize them!

(group into discrete bins)

slide-12
SLIDE 12

Continuous Random Variables

12

But aren’t we throwing away information?

P(bin=8) = .32 P(bin=12) = .08

slide-13
SLIDE 13

Continuous Random Variables

13

slide-14
SLIDE 14

Continuous Random Variables

14

X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that:

slide-15
SLIDE 15

Continuous Random Variables

15

X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that: fx : “probability density function” (pdf)

slide-16
SLIDE 16

Continuous Random Variables

16

slide-17
SLIDE 17

Continuous Random Variables

17

slide-18
SLIDE 18

Continuous Random Variables

Common Trap

  • does not yield a probability

○ does ○ may be anything (ℝ)

■ thus, may be > 1

18

slide-19
SLIDE 19

Continuous Random Variables

A Common Probability Density Function

19

slide-20
SLIDE 20

Continuous Random Variables

Common pdfs: Normal(μ, σ2) =

20

slide-21
SLIDE 21

Continuous Random Variables

Common pdfs: Normal(μ, σ2) = μ: mean (or “center”) = expectation σ2: variance, σ: standard deviation

21

slide-22
SLIDE 22

Common pdfs: Normal(μ, σ2) = μ: mean (or “center”) = expectation σ2: variance, σ: standard deviation

Continuous Random Variables

22

Credit: Wikipedia

slide-23
SLIDE 23

Continuous Random Variables

Common pdfs: Normal(μ, σ2)

X ~ Normal(μ, σ2), examples:

  • height
  • intelligence/ability
  • measurement error
  • averages (or sum) of

lots of random variables

23

slide-24
SLIDE 24

Continuous Random Variables

Common pdfs: Normal(0, 1) (“standard normal”) How to “standardize” any normal distribution:

  • subtract the mean, μ (aka “mean centering”)
  • divide by the standard deviation, σ

z = (x - μ) / σ, (aka “z score”)

24

Credit: MIT Open Courseware: Probability and Statistics

slide-25
SLIDE 25

Continuous Random Variables

Common pdfs: Normal(0, 1)

25

Credit: MIT Open Courseware: Probability and Statistics

slide-26
SLIDE 26

Cumulative Distribution Function

26

For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Normal Uniform

slide-27
SLIDE 27

Cumulative Distribution Function

27

For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Exponential Normal Uniform Pro: yields a probability! Con: Not intuitively interpretable.

slide-28
SLIDE 28

Random Variables, Revisited

X: A mapping from Ω to ℝ that describes the question we care about in practice. X is a discrete random variable if it takes only a countable number of values. X is a continuous random variable if it can take on an infinite number of values between any two given values.

28

slide-29
SLIDE 29

Discrete Random Variables

29

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by:

slide-30
SLIDE 30

Discrete Random Variables

30

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Binomial (n, p) (like normal)

slide-31
SLIDE 31

Discrete Random Variables

31

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: For a given discrete random variable X, probability mass function (pmf), fx: ℝ → [0, 1], is defined by: Binomial (n, p)

slide-32
SLIDE 32

Discrete Random Variables

Two Common Discrete Random Variables

  • Binomial(n, p)

example: number of heads after n coin flips (p, probability of heads)

  • Bernoulli(p) = Binomial(1, p)

example: one trial of success or failure

32

Binomial (n, p)

slide-33
SLIDE 33

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value; “null”: nothing changes H1: the alternative -- the opposite of the null => a change or difference

slide-34
SLIDE 34

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value; “null”: nothing changes H1: the alternative -- the opposite of the null => a change or difference Goal: Use probability to determine if we can: “reject the null” (H0) in favor of H1. “There is less than a 5% chance that the null is true” (i.e. 95% chance that alternative is true).

slide-35
SLIDE 35

Hypothesis Testing

Example: Hypothesize a coin is biased. H0: the coin is not biased (i.e. flipping n times results in a Binomial(n, 0.5)) H1: the coin is biased (i.e. flipping n times results in a Binomial(n, 0.5))

slide-36
SLIDE 36

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

slide-37
SLIDE 37

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

alpha : size of rejection region (e.g. 0.05, 0.01, .001)

slide-38
SLIDE 38

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

alpha : size of rejection region (e.g. 0.05, 0.01, .001) In the biased coin example, if n = 1000, then then Rreject = [0, 469] ∪ [531, 1000]

slide-39
SLIDE 39

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data)

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-40
SLIDE 40

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data) Thought experiment: If we have infinite data, can the null ever be true?

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-41
SLIDE 41

Type I, Type II Errors

(Orloff & Bloom, 2014)

slide-42
SLIDE 42

Power

significance level (“p-value”) = P(type I error) = P(Reject H0 | H0) (probability we are incorrect) power = 1 - P(type II error) = P(Reject H0 | H1) (probability we are correct)

P(Reject H0 | H0) P(Reject H0 | H1)

(Orloff & Bloom, 2014) (Orloff & Bloom, 2014)

slide-43
SLIDE 43

Multi-test Correction

If alpha = .05, and I run 40 variables through significance tests, then, by chance, how many are likely to be significant?

slide-44
SLIDE 44

Multi-test Correction

2 (5% any test rejects the null, by chance)

How to fix?

slide-45
SLIDE 45

Multi-test Correction

What if all tests are independent? => “Bonferroni Correction” (α/m) Better Alternative: False Discovery Rate (Bejamini Hochberg) How to fix?

slide-46
SLIDE 46

Statistical Considerations in Big Data

1. Average multiple models (ensemble techniques) 2. Correct for multiple tests (Bonferonni’s Principle) 3. Smooth data 4. “Plot” data (or figure out a way to look at a lot of it “raw”) 5. Interact with data 6. Know your “real” sample size 7. Correlation is not causation 8. Define metrics for success (set a baseline) 9. Share code and data 10. The problem should drive solution

(http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/)

slide-47
SLIDE 47

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-48
SLIDE 48

Distance Metrics

Typical properties of a distance metric, d: d(x, x) = 0 d(x, y) = d(y, x) d(x, y) ≤ d(x,z) + d(z,y)

(http://rosalind.info/glossary/euclidean-distance/)

slide-49
SLIDE 49

Distance Metrics

  • Jaccard Distance (1 - JS)
  • Euclidean Distance
  • Cosine Distance
  • Edit Distance
  • Hamming Distance

(http://rosalind.info/glossary/euclidean-distance/)

slide-50
SLIDE 50

Distance Metrics

  • Jaccard Distance (1 - JS)
  • Euclidean Distance
  • Cosine Distance
  • Edit Distance
  • Hamming Distance

(http://rosalind.info/glossary/euclidean-distance/)

(“L2 Norm”)

slide-51
SLIDE 51

Distance Metrics

  • Jaccard Distance (1 - JS)
  • Euclidean Distance
  • Cosine Distance
  • Edit Distance
  • Hamming Distance

(http://rosalind.info/glossary/euclidean-distance/)

(“L2 Norm”)

slide-52
SLIDE 52

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-53
SLIDE 53

Linear Regression

Finding a linear function based on X to best yield Y. X = “covariate” = “feature” = “predictor” = “regressor” = “independent variable” Y = “response variable” = “outcome” = “dependent variable” Regression: goal: estimate the function r

The expected value of Y, given that the random variable X is equal to some specific value, x.

slide-54
SLIDE 54

Linear Regression

Finding a linear function based on X to best yield Y. X = “covariate” = “feature” = “predictor” = “regressor” = “independent variable” Y = “response variable” = “outcome” = “dependent variable” Regression: goal: estimate the function r Linear Regression (univariate version): goal: find 0, 1 such that

slide-55
SLIDE 55

Linear Regression

Simple Linear Regression

more precisely

slide-56
SLIDE 56

Linear Regression

Simple Linear Regression expected variance intercept slope error

slide-57
SLIDE 57

Linear Regression

Simple Linear Regression expected variance intercept slope error

Estimated intercept and slope

Residual:

slide-58
SLIDE 58

Linear Regression

Simple Linear Regression expected variance intercept slope error

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

slide-59
SLIDE 59

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

slide-60
SLIDE 60

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

Learning rate Based on derivative of RSS

slide-61
SLIDE 61

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

via Direct Estimates (normal equations)

slide-62
SLIDE 62

Pearson Product-Moment Correlation

Covariance

via Direct Estimates (normal equations)

slide-63
SLIDE 63

Pearson Product-Moment Correlation

Covariance Correlation

via Direct Estimates (normal equations)

slide-64
SLIDE 64

Pearson Product-Moment Correlation

Covariance Correlation If one standardizes X and Y (i.e. subtract the mean and divide by the standard deviation) before running linear regression, then: = 0 and = r --- i.e. is the Pearson correlation!

via Direct Estimates (normal equations)

slide-65
SLIDE 65

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-66
SLIDE 66

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i (i.e. adding the intercept to X), then we can say:

Multiple Linear Regression

slide-67
SLIDE 67

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i, then we can say:

Or in vector notation across all i: where and are vectors and X is a matrix.

Multiple Linear Regression

slide-68
SLIDE 68

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i, then we can say:

Or in vector notation across all i: where and are vectors and X is a matrix.

Estimating :

Multiple Linear Regression

slide-69
SLIDE 69

Suppose we have multiple independent variables that we’d like to fit to our dependent variable: If we include and Xoi = 1 for all i. Then we can say: Or in vector notation across all i: Where and are vectors and X is a matrix. Estimating :

Multiple Linear Regression

To test for significance of individual coefficient, j:

slide-70
SLIDE 70

Suppose we have multiple independent variables that we’d like to fit to our dependent variable: If we include and Xoi = 1 for all i. Then we can say: Or in vector notation across all i: Where and are vectors and X is a matrix. Estimating :

Multiple Linear Regression

To test for significance of individual coefficient, j:

slide-71
SLIDE 71

T-Test for significance of hypothesis: 1) Calculate t 2) Calculate degrees of freedom:

df = N - (m+1)

3) Check probability in a t distribution:

To test for significance of individual coefficient, j:

Multiple Linear Regression

RSS s2 = ------ df

slide-72
SLIDE 72

T-Test for significance of hypothesis: 1) Calculate t 2) Calculate degrees of freedom:

df = N - (m+1)

3) Check probability in a t distribution: (df = v)

To test for significance of individual coefficient, j:

t

slide-73
SLIDE 73

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data) Thought experiment: If we have infinite data, can the null ever be true?

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-74
SLIDE 74

Type I, Type II Errors

(Orloff & Bloom, 2014)

slide-75
SLIDE 75

Power

significance level (“p-value”) = P(type I error) = P(Reject H0 | H0) (probability we are incorrect) power = 1 - P(type II error) = P(Reject H0 | H1) (probability we are correct)

P(Reject H0 | H0) P(Reject H0 | H1)

(Orloff & Bloom, 2014) (Orloff & Bloom, 2014)

slide-76
SLIDE 76

Multi-test Correction

If alpha = .05, and I run 40 variables through significance tests, then, by chance, how many are likely to be significant?

slide-77
SLIDE 77

Multi-test Correction

2 (5% any test rejects the null, by chance)

How to fix?

slide-78
SLIDE 78

Multi-test Correction

What if all tests are independent? => “Bonferroni Correction” (α/m) Better Alternative: False Discovery Rate (Bejamini Hochberg) How to fix?

slide-79
SLIDE 79

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-80
SLIDE 80

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-81
SLIDE 81

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) Note: this is a probability here. In simple linear regression we wanted an expectation:

slide-82
SLIDE 82

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

Note: this is a probability here. In simple linear regression we wanted an expectation:

(i.e. if p > 0.5 we can confidently predict Yi = 1) Note: this is a probability here. In simple linear regression we wanted an expectation:

slide-83
SLIDE 83

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-84
SLIDE 84

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) P(Yi = 0 | X = x) Thus, 0 is class 0 and 1 is class 1.

slide-85
SLIDE 85

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) We’re still learning a linear separating hyperplane, but fitting it to a logit outcome.

(https://www.linkedin.com/pulse/predicting-outcomes-pr

  • babilities-logistic-regression-konstantinidis/)
slide-86
SLIDE 86

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

To estimate ,

  • ne can use

reweighted least squares:

(Wasserman, 2005; Li, 2010)

slide-87
SLIDE 87

Uses of linear and logistic regression

  • 1. Testing the relationship between variables given other
  • variables. is an “effect size” -- a score for the magnitude
  • f the relationship; can be tested for significance.
  • 2. Building a predictive model that generalizes to new data.

Ŷ is an estimate value of Y given X.

slide-88
SLIDE 88

Uses of linear and logistic regression

  • 1. Testing the relationship between variables given other
  • variables. is an “effect size” -- a score for the magnitude
  • f the relationship; can be tested for significance.
  • 2. Building a predictive model that generalizes to new data.

Ŷ is an estimate value of Y given X. However, unless |X| <<< observatations then the model might “overfit”.

slide-89
SLIDE 89

Overfitting (1-d non-linear example)

Underfit Overfit High Bias High Variance

(image credit: Scikit-learn; in practice data are rarely this clear)

slide-90
SLIDE 90

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2

slide-91
SLIDE 91

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2

logit(Y) = 1.2 + -63*X1 + 179*X2 + 71*X3 + 18*X4 + -59*X5 + 19*X6

slide-92
SLIDE 92

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2 Do we really think we found something generalizable?

logit(Y) = 1.2 + -63*X1 + 179*X2 + 71*X3 + 18*X4 + -59*X5 + 19*X6

slide-93
SLIDE 93

Overfitting (2-d linear example)

1 1 1 Y = X 0.5 0.5 0.25 1 logit(Y) = 0 + 2*X1 + 2*X2 Do we really think we found something generalizable? What if only 2 predictors?

slide-94
SLIDE 94

Common Goal: Generalize to new data

Original Data New Data? Does the model hold up?

Model

slide-95
SLIDE 95

Common Goal: Generalize to new data

Training Data Testing Data

Model

Does the model hold up?

slide-96
SLIDE 96

Common Goal: Generalize to new data

Training Data Testing Data

Model Develop- ment Data Model Set training parameters

Does the model hold up?

slide-97
SLIDE 97

Feature Selection / Subset Selection

(bad) solution to overfit problem Use less features based on Forward Stepwise Selection:

  • start with current_model just has the intercept (mean)

remaining_predictors = all_predictors for i in range(k): #find best p to add to current_model: for p in remaining_prepdictors refit current_model with p #add best p, based on RSSp to current_model #remove p from remaining predictors

slide-98
SLIDE 98

Regularization (Shrinkage)

No selection (weight=beta) forward stepwise

Why just keep or discard features?

slide-99
SLIDE 99

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-100
SLIDE 100

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-101
SLIDE 101

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

In Matrix Form:

I: m x m identity matrix

slide-102
SLIDE 102

Regularization (L1, The “Lasso”)

Idea: Impose a penalty and zero-out some weights The Lasso Objective:

slide-103
SLIDE 103

Regularization (L1, The “Lasso”)

Idea: Impose a penalty and zero-out some weights The Lasso Objective: No closed form matrix solution, but

  • ften solved with coordinate descent.

Application: p ≅ n or p >> n (p: features; n: observations)

slide-104
SLIDE 104

Common Goal: Generalize to new data

Training Data Testing Data

Does the model hold up? Model Develo- pment Model Set parameters

slide-105
SLIDE 105

N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test

dev All data

train test

dev

train train test

dev

train

...

Iter 1 Iter 2 Iter 3 ….