Hypothesis Testing and statistical preliminaries Stony Brook - - PowerPoint PPT Presentation

hypothesis testing and statistical preliminaries
SMART_READER_LITE
LIVE PREVIEW

Hypothesis Testing and statistical preliminaries Stony Brook - - PowerPoint PPT Presentation

Hypothesis Testing and statistical preliminaries Stony Brook University CSE545, Spring 2019 Hypothesis Testing: Random Variables Distributions Hypothesis Testing Framework Comparing Variables: Simple Linear Regression,


slide-1
SLIDE 1

Hypothesis Testing and statistical preliminaries

Stony Brook University CSE545, Spring 2019

slide-2
SLIDE 2

Hypothesis Testing:

  • Random Variables
  • Distributions
  • Hypothesis Testing Framework

Comparing Variables:

  • Simple Linear Regression, Correlation, Multiple Linear Regression,
  • Comparing Variables and Hypothesis Testing
  • Regularized Linear Regression
  • Multiple Hypothesis Testing
slide-3
SLIDE 3

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice.

3

“sal ce”, se l osl ome.

slide-4
SLIDE 4

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…}

4

slide-5
SLIDE 5

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω

5

slide-6
SLIDE 6

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a “variable”, but a function that we end up notating a lot like a variable)

6

slide-7
SLIDE 7

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = 5 coin tosses = {<HHHHH>, <HHHHT>, <HHHTH>, <HHHTH>…} We may just care about how many tails? Thus, X(<HHHHH>) = 0 X(<HHHTH>) = 1 X(<TTTHT>) = 4 X(<HTTTT>) = 4 X only has 6 possible values: 0, 1, 2, 3, 4, 5 What is the probability that we end up with k = 4 tails? P(X = k) := P( {ω : X(ω) = k} ) where ω ∊ Ω X(ω) = 4 for 5 out of 32 sets in Ω. Thus, assuming a fair coin, P(X = 4) = 5/32 (Not a “variable”, but a function that we end up notating a lot like a variable) X is a discrete random variable if it takes only a countable number of values.

7

slide-8
SLIDE 8

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. X is a discrete random variable if it takes only a countable number of values. X is a continuous random variable if it can take on an infinite number of values between any two given values.

8

slide-9
SLIDE 9

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≤ b} )

9

X is a continuous random variable if it can take on an infinite number of values between any two given values.

slide-10
SLIDE 10

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≤ b} ) P(X = i) := 0, for all i ∊ Ω

(probability of receiving exactly i inches of snowfall is zero)

10

X is a continuous random variable if it can take on an infinite number of values between any two given values.

slide-11
SLIDE 11

Random Variables

X: A mapping from Ω to ℝ that describes the question we care about in practice. Example: Ω = inches of snowfall = [0, ∞) ⊆ ℝ X amount of inches in a snowstorm X(ω) = ω What is the probability we receive (at least) a inches? P(X ≥ a) := P( {ω : X(ω) ≥ a} ) What is the probability we receive between a and b inches? P(a ≤ X ≤ b) := P( {ω : a ≤ X(ω) ≥ b} ) P(X = i) := 0, for all i ∊ Ω

(probability of receiving exactly i inches of snowfall is zero)

11

X is a continuous random variable if it can take on an infinite number of values between any two given values.

How to model?

slide-12
SLIDE 12

Continuous Random Variables

12

How to model? Discretize them!

(group into discrete bins)

slide-13
SLIDE 13

Continuous Random Variables

13

But aren’t we throwing away information?

P(bin=8) = .32 P(bin=12) = .08

slide-14
SLIDE 14

Continuous Random Variables

14

slide-15
SLIDE 15

Continuous Random Variables

15

X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that:

slide-16
SLIDE 16

Continuous Random Variables

16

X is a continuous random variable if it can take on an infinite number of values between any two given values. X is a continuous random variable if there exists a function fx such that: fx : “probability density function” (pdf)

slide-17
SLIDE 17

Continuous Random Variables

17

slide-18
SLIDE 18

Continuous Random Variables

18

slide-19
SLIDE 19

Continuous Random Variables

Common Trap

  • does not yield a probability

○ does ○ 𝓎 may be anything (ℝ)

■ thus, may be > 1

19

slide-20
SLIDE 20

Continuous Random Variables

A Common Probability Density Function

20

slide-21
SLIDE 21

Continuous Random Variables

Common pdfs: Normal(μ, σ2) =

21

slide-22
SLIDE 22

Continuous Random Variables

Common pdfs: Normal(μ, σ2) = μ: mean (or “center”) = expectation σ2: variance, σ: standard deviation

22

slide-23
SLIDE 23

Common pdfs: Normal(μ, σ2) = μ: mean (or “center”) = expectation σ2: variance, σ: standard deviation

Continuous Random Variables

23

Credit: Wikipedia

slide-24
SLIDE 24

Continuous Random Variables

Common pdfs: Normal(μ, σ2)

X ~ Normal(μ, σ2), examples in real life:

  • height
  • intelligence/ability
  • measurement error
  • averages (or sum) of

lots of random variables

24

slide-25
SLIDE 25

Continuous Random Variables

Common pdfs: Normal(0, 1) (“standard normal”) How to “standardize” any normal distribution:

1. subtract the mean, μ (aka “mean centering”) 2. divide by the standard deviation, σ

z = (x - μ) / σ, (aka “z score”)

25

Credit: MIT Open Courseware: Probability and Statistics

slide-26
SLIDE 26

Continuous Random Variables

Common pdfs: Normal(0, 1)

26

Credit: MIT Open Courseware: Probability and Statistics

slide-27
SLIDE 27

Cumulative Distribution Function

27

For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Normal Uniform

slide-28
SLIDE 28

Cumulative Distribution Function

28

For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Exponential Normal Uniform Pro: yields a probability! Con: Not intuitively interpretable.

slide-29
SLIDE 29

Random Variables, Revisited

X: A mapping from Ω to ℝ that describes the question we care about in practice. X is a discrete random variable if it takes only a countable number of values. X is a continuous random variable if it can take on an infinite number of values between any two given values.

29

slide-30
SLIDE 30

Discrete Random Variables

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by:

slide-31
SLIDE 31

Discrete Random Variables

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: Binomial (n, p) (like normal)

slide-32
SLIDE 32

Discrete Random Variables

X is a discrete random variable if it takes only a countable number of values. For a given random variable X, the cumulative distribution function (CDF), Fx: ℝ → [0, 1], is defined by: For a given discrete random variable X, probability mass function (pmf), fx: ℝ → [0, 1], is defined by: Binomial (n, p)

slide-33
SLIDE 33

Discrete Random Variables

Two Common Discrete Random Variables

  • Binomial(n, p)

example: number of heads after n coin flips (p, probability of heads)

  • Bernoulli(p) = Binomial(1, p)

example: one trial of success or failure Binomial (n, p)

slide-34
SLIDE 34

Hypothesis Testing

Hypothesis -- something one asserts to be true.

slide-35
SLIDE 35

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value; “null”: nothing changes H1: the alternative -- the opposite of the null => a change or difference

slide-36
SLIDE 36

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value; “null”: nothing changes H1: the alternative -- the opposite of the null => a change or difference Goal: Use probability to determine if we can: “reject the null” (H0) in favor of H1. “There is less than a 5% chance that the null is true” (i.e. 95% chance that alternative is true).

slide-37
SLIDE 37

Hypothesis Testing

Example: Hypothesize a coin is biased. H0: the coin is not biased (i.e. flipping n times results in a Binomial(n, 0.5)) H1: the coin is biased (i.e. flipping n times does not result in a Binomial(n, 0.5))

slide-38
SLIDE 38

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

slide-39
SLIDE 39

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

alpha : size of rejection region (e.g. 0.05, 0.01, .001)

slide-40
SLIDE 40

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

More formally: Let X be a random variable and let R be the range of X. Rreject ⊂ R is the rejection region. If X ∊ Rreject then we reject the null.

alpha : size of rejection region (e.g. 0.05, 0.01, .001) In the biased coin example, if n = 1000, then then Rreject = [0, 469] ∪ [531, 1000]

slide-41
SLIDE 41

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

slide-42
SLIDE 42

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

A general framework for answering (yes/no) questions!

slide-43
SLIDE 43

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

A general framework for answering (yes/no) questions!

  • Are h ad ds et?
  • Is de rit od te t t ta te t?
slide-44
SLIDE 44

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

A general framework for answering (yes/no) questions!

  • Are h ad ds et?
  • Is de rit od te t t ta te t?
  • Is e h id a cni red vey?
  • Is e h id a cni red vey corn or at at?
  • Dos wet eve he ra me f ty it?
slide-45
SLIDE 45

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

A general framework for answering (yes/maybe) questions!

  • Are h ad ds et?
  • Is de rit od te t t ta te t?
  • Is e h id a cni red vey?
  • Is e h id a cni red vey corn or at at?
  • Dos wet eve he ra me f ty it?

Failing to “reject the null” does not mean the null is true.

slide-46
SLIDE 46

Hypothesis Testing

Hypothesis -- something one asserts to be true. Classical Approach: H0: null hypothesis -- some “default” value (usually that one’s hypothesis is false)

Wh?

A general framework for answering (yes/maybe) questions!

  • Are h ad ds et?
  • Is de rit od te t t ta te t?
  • Is e h id a cni red vey?
  • Is e h id a cni red vey corn or at at?
  • Dos wet eve he ra me f ty it?

Failing to “reject the null” does not mean the null is true. However, if the sample is large enough, it may be enough to say that the effect size (correlation, difference value, etc…) is not very meaningful.

slide-47
SLIDE 47

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data)

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-48
SLIDE 48

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data) Thought experiment: If we have infinite data, can the null ever be true?

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-49
SLIDE 49

Statistical Considerations in Big Data

1. Average multiple models (ensemble techniques) 2. Correct for multiple tests (Bonferonni’s Principle) 3. Smooth data 4. “Plot” data (or figure out a way to look at a lot of it “raw”) 5. Interact with data 6. Know your “real” sample size 7. Correlation is not causation 8. Define metrics for success (set a baseline) 9. Share code and data 10. The problem should drive solution

(http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/)

slide-50
SLIDE 50

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-51
SLIDE 51

Linear Regression

Finding a linear function based on X to best yield Y. X = “covariate” = “feature” = “predictor” = “regressor” = “independent variable” Y = “response variable” = “outcome” = “dependent variable” Regression: goal: estimate the function r

The expected value of Y, given that the random variable X is equal to some specific value, x.

slide-52
SLIDE 52

Linear Regression

Finding a linear function based on X to best yield Y. X = “covariate” = “feature” = “predictor” = “regressor” = “independent variable” Y = “response variable” = “outcome” = “dependent variable” Regression: goal: estimate the function r Linear Regression (univariate version): goal: find 𝛾0, 𝛾1 such that

slide-53
SLIDE 53

Linear Regression

Simple Linear Regression

more precisely

slide-54
SLIDE 54

Linear Regression

Simple Linear Regression expected variance intercept slope error

slide-55
SLIDE 55

Linear Regression

Simple Linear Regression expected variance intercept slope error

Estimated intercept and slope

Residual:

slide-56
SLIDE 56

Linear Regression

Simple Linear Regression expected variance intercept slope error

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

slide-57
SLIDE 57

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

slide-58
SLIDE 58

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

Learning rate Based on derivative of RSS

slide-59
SLIDE 59

Estimated intercept and slope

Residual:

Least Squares Estimate. Find and which minimizes the residual sum of squares:

Linear Regression

via Gradient Descent Start with = = 0 Repeat until convergence: Calculate all

via Direct Estimates (normal equations)

slide-60
SLIDE 60

Pearson Product-Moment Correlation

Covariance

via Direct Estimates (normal equations)

slide-61
SLIDE 61

Pearson Product-Moment Correlation

Covariance Correlation

via Direct Estimates (normal equations)

slide-62
SLIDE 62

Pearson Product-Moment Correlation

Covariance Correlation If one standardizes X and Y (i.e. subtract the mean and divide by the standard deviation) before running linear regression, then: = 0 and = r --- i.e. is the Pearson correlation!

via Direct Estimates (normal equations)

slide-63
SLIDE 63

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-64
SLIDE 64

Measures for Comparing Random Variables

  • Distance metrics
  • Linear Regression
  • Pearson Product-Moment Correlation
  • Multiple Linear Regression
  • (Multiple) Logistic Regression
  • Ridge Regression (L2 Penalized)
  • Lasso Regression (L1 Penalized)
slide-65
SLIDE 65

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i (i.e. adding the intercept to X), then we can say:

Multiple Linear Regression

slide-66
SLIDE 66

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i, then we can say:

Or in vector notation across all i: where and are vectors and X is a matrix.

Multiple Linear Regression

slide-67
SLIDE 67

Suppose we have multiple X that we’d like to fit to Y at once: If we include and Xoi = 1 for all i, then we can say:

Or in vector notation across all i: where and are vectors and X is a matrix.

Estimating :

Multiple Linear Regression

slide-68
SLIDE 68

Suppose we have multiple independent variables that we’d like to fit to our dependent variable: If we include and Xoi = 1 for all i. Then we can say: Or in vector notation across all i: Where and are vectors and X is a matrix. Estimating :

Multiple Linear Regression

To test for significance of individual coefficient, j:

slide-69
SLIDE 69

Suppose we have multiple independent variables that we’d like to fit to our dependent variable: If we include and Xoi = 1 for all i. Then we can say: Or in vector notation across all i: Where and are vectors and X is a matrix. Estimating :

Multiple Linear Regression

To test for significance of individual coefficient, j:

slide-70
SLIDE 70

T-Test for significance of hypothesis: 1) Calculate t 2) Calculate degrees of freedom:

df = N - (m+1)

3) Check probability in a t distribution:

To test for significance of individual coefficient, j:

Multiple Linear Regression

RSS s2 = ------ df

slide-71
SLIDE 71

T-Test for significance of hypothesis: 1) Calculate t 2) Calculate degrees of freedom:

df = N - (m+1)

3) Check probability in a t distribution: (df = v)

To test for significance of individual coefficient, j:

t

slide-72
SLIDE 72

Hypothesis Testing

Important logical question: Does failure to reject the null mean the null is true? no. Traditionally, one of the most common reasons to fail to reject the null: n is too small (not enough data) Thought experiment: If we have infinite data, can the null ever be true?

Big Data problem: “everything” is significant. Thus, consider “effect size”

slide-73
SLIDE 73

Type I, Type II Errors

(Orloff & Bloom, 2014)

slide-74
SLIDE 74

Power

significance level (“p-value”) = P(type I error) = P(Reject H0 | H0) (probability we are incorrect) power = 1 - P(type II error) = P(Reject H0 | H1) (probability we are correct)

P(Reject H0 | H0) P(Reject H0 | H1)

(Orloff & Bloom, 2014) (Orloff & Bloom, 2014)

slide-75
SLIDE 75

Multi-test Correction

If alpha = .05, and I run 40 variables through significance tests, then, by chance, how many are likely to be significant?

slide-76
SLIDE 76

Multi-test Correction

2 (5% any test rejects the null, by chance)

How to fix?

slide-77
SLIDE 77

Multi-test Correction

What if all tests are independent? => “Bonferroni Correction” (α/m) Better Alternative: False Discovery Rate (Bejamini Hochberg) How to fix?

slide-78
SLIDE 78

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-79
SLIDE 79

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-80
SLIDE 80

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) Note: this is a probability here. In simple linear regression we wanted an expectation:

slide-81
SLIDE 81

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

Note: this is a probability here. In simple linear regression we wanted an expectation:

(i.e. if p > 0.5 we can confidently predict Yi = 1) Note: this is a probability here. In simple linear regression we wanted an expectation:

slide-82
SLIDE 82

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

slide-83
SLIDE 83

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) P(Yi = 0 | X = x) Thus, 0 is class 0 and 1 is class 1.

slide-84
SLIDE 84

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”) We’re still learning a linear separating hyperplane, but fitting it to a logit outcome.

(https://www.linkedin.com/pulse/predicting-outcomes-pr

  • babilities-logistic-regression-konstantinidis/)
slide-85
SLIDE 85

Logistic Regression

What if Yi ∊ {0, 1}? (i.e. we want “classification”)

To estimate ,

  • ne can use

reweighted least squares:

(Wasserman, 2005; Li, 2010)

slide-86
SLIDE 86

Uses of linear and logistic regression

  • 1. Testing the relationship between variables given other
  • variables. 𝛾 is an “effect size” -- a score for the magnitude
  • f the relationship; can be tested for significance.
  • 2. Building a predictive model that generalizes to new data.

Ŷ is an estimate value of Y given X.

slide-87
SLIDE 87

Uses of linear and logistic regression

  • 1. Testing the relationship between variables given other
  • variables. 𝛾 is an “effect size” -- a score for the magnitude
  • f the relationship; can be tested for significance.
  • 2. Building a predictive model that generalizes to new data.

Ŷ is an estimate value of Y given X. However, unless |X| <<< observatations then the model might “overfit”.

slide-88
SLIDE 88

Overfitting (1-d non-linear example)

Underfit Overfit High Bias High Variance

(image credit: Scikit-learn; in practice data are rarely this clear)

slide-89
SLIDE 89

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2

slide-90
SLIDE 90

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2

logit(Y) = 1.2 + -63*X1 + 179*X2 + 71*X3 + 18*X4 + -59*X5 + 19*X6

slide-91
SLIDE 91

Overfitting (5-d linear example)

1 1 1 Y = X 0.5 0.6 1 0.25 0.5 0.3 1 1 1 0.5 1 1 0.25 1 1.25 1 0.1 2 Do we really think we found something generalizable?

logit(Y) = 1.2 + -63*X1 + 179*X2 + 71*X3 + 18*X4 + -59*X5 + 19*X6

slide-92
SLIDE 92

Overfitting (2-d linear example)

1 1 1 Y = X 0.5 0.5 0.25 1 logit(Y) = 0 + 2*X1 + 2*X2 Do we really think we found something generalizable? What if only 2 predictors?

slide-93
SLIDE 93

Common Goal: Generalize to new data

Original Data New Data? Does the model hold up?

Model

slide-94
SLIDE 94

Common Goal: Generalize to new data

Training Data Testing Data

Model

Does the model hold up?

slide-95
SLIDE 95

Common Goal: Generalize to new data

Training Data Testing Data

Model Develop- ment Data Model Set training parameters

Does the model hold up?

slide-96
SLIDE 96

Feature Selection / Subset Selection

(bad) solution to overfit problem Use less features based on Forward Stepwise Selection:

  • start with current_model just has the intercept (mean)

remaining_predictors = all_predictors for i in range(k): #find best p to add to current_model: for p in remaining_prepdictors refit current_model with p #add best p, based on RSSp to current_model #remove p from remaining predictors

slide-97
SLIDE 97

Regularization (Shrinkage)

No selection (weight=beta) forward stepwise

Why just keep or discard features?

slide-98
SLIDE 98

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-99
SLIDE 99

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-100
SLIDE 100

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

In Matrix Form:

I: m x m identity matrix

slide-101
SLIDE 101

Regularization (L1, The “Lasso”)

Idea: Impose a penalty and zero-out some weights The Lasso Objective:

slide-102
SLIDE 102

Regularization (L1, The “Lasso”)

Idea: Impose a penalty and zero-out some weights The Lasso Objective: No closed form matrix solution, but

  • ften solved with coordinate descent.

Application: p ≅ n or p >> n (p: features; n: observations)

slide-103
SLIDE 103

Common Goal: Generalize to new data

Training Data Testing Data

Does the model hold up? Model Develo- pment Model Set parameters

slide-104
SLIDE 104

N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test

dev All data

train test

dev

train train test

dev

train

...

Iter 1 Iter 2 Iter 3 ….

slide-105
SLIDE 105

Summary

Hypothesis Testing:

A framework for deciding which differences/relationships matter.

  • Random Variables
  • Distributions
  • Hypothesis Testing Framework

Comparing Variables:

Metrics to quantify the difference or relationship between variables.

  • Simple Linear Regression, Correlation, Multiple Linear Regression,
  • Comparing Variables and Hypothesis Testing
  • Regularized Linear Regression
  • Multiple Hypothesis Testing