Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell - PowerPoint PPT Presentation

Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell Fall 2016 1 / 67

1. Outliers, leverage points, and infmuential observations 2. Heteroskedasticity 3. Nonlinearity of the regression function 2 / 67

Where are we? Where are we going? under Gauss-Markov assumptions (and sometimes conditional Normality) tell? Can we fjx it? 3 / 67 • Last few weeks: estimation and inference for the linear model • This week: what happens when the assumptions fail? Can we • Next weeks: dealing with panel data.

Review of the OLS assumptions 𝑗 𝜸 + 𝑣 𝑗 2. Random sample: (𝑧 𝑗 , 𝐲 ′ 𝑗 ) are a iid sample from the population. 3. Full rank: 𝐘 is an 𝑜 × (𝑙 + 1) matrix with rank 𝑙 + 1 4. Zero conditional mean: 𝔽[𝑣 𝑗 |𝐲 𝑗 ] = 0 5. Homoskedasticity: 𝕎[𝑣 𝑗 |𝐲 𝑗 ] = 𝜏 2 𝑣 𝑣 ) 4 / 67 1. Linearity: 𝑧 𝑗 = 𝐲 ′ 6. Normality: 𝑣 𝑗 |𝐲 𝑗 ∼ 𝑂(0, 𝜏 2 • 1-4 give us unbiasedness/consistency • 1-5 are the Gauss-Markov, allow for large-sample inference • 1-6 allow for small-sample inference

Violations of the assumptions Three issues today: 1. Infmuential observations that skew regression estimates 2. Violations of homoskedaticity 3. Incorrect functional form/nonlinearity 5 / 67 ▶ ⇝ SEs are biased (usually downward) ▶ ⇝ biased/inconsistent estimates

1/ Outliers, leverage points, and influential observations 6 / 67

Example: Buchanan votes in Florida, 2000 7 / 67 • 2000 Presidential election in FL (Wand et al., 2001, APSR)

Example: Buchanan votes in Florida, 2000 8 / 67 3500 3000 2500 Buchanan Votes 2000 1500 1000 500 0 0 100000 200000 300000 400000 500000 600000 Total Votes

Example: Buchanan votes in Florida, 2000 9 / 67 3500 Palm Beach 3000 2500 Buchanan Votes 2000 1500 Pinellas 1000 Hillsborough Broward Duval Marion Pasco Brevard Miami-Dade Polk Escambia Volusia 500 Orange Santa Rosa Sarasota Lee Lake Leon Okaloosa Citrus Alachua Manatee Hernando Bay St. Johns Holmes Clay Charlotte Seminole Putnam Osceola Highlands Walton Sumter St. Lucie Collier Suwannee Indian River Jackson Martin Washington Calhoun Columbia Baker Nassau Flagler Bradford Gulf Levy Okeechobee Franklin Gadsden Wakulla Liberty Desoto Union Monroe Lafayette Jefferson Hamilton Madison Gilchrist Hardee Hendry Taylor Dixie Glades 0 0 100000 200000 300000 400000 500000 600000 Total Votes

Example: Buchanan votes 2.4e-10 *** p-value: 2.42e-10 56 on 1 and 65 DF, ## F-statistic: 0.455 Adjusted R-squared: 0.463, ## Multiple R-squared: ## Residual standard error: 333 on 65 degrees of freedom ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Signif. codes: ## --- 7.48 mod <- lm(edaybuchanan ~ edaytotal, data = flvote) 0.00031 0.00232 ## edaytotal 0.27 1.10 49.14146 ## (Intercept) 54.22945 Estimate Std. Error t value Pr(>|t|) ## ## Coefficients: ## summary(mod) 10 / 67

Three types of extreme values 1. Leverage point: extreme in one 𝑦 direction 2. Outlier: extreme in the 𝑧 direction 3. Infmuence point: extreme in both directions distribution), can cause ineffjciency and possibly bias 11 / 67 • Not all of these are problematic • If the data are truly “contaminated” (come from a difgerent • Can be a violation of iid (not identically distributed) • Diagnostics are loose

Leverage point definition 12 / 67 2 Leverage Point 1 Full sample 0 Without leverage point y -1 -2 -3 -4 -2 0 2 4 6 8 • Values that are extreme in the 𝑦 direction • That is, values far from the center of the covariate distribution • Decrease SEs (more 𝑦 variation) • No bias if typical in 𝑧 dimension

Hat matrix ̂ 𝜸 = (𝐉 − 𝐈)𝐳 ̂ 𝐳 = 𝐈𝐳 13 / 67 • First we need to defjne an important matrix 𝐈 = 𝐘 (𝐘 ′ 𝐘) −1 𝐘 ′ 𝐯 = 𝐳 − 𝐘 ̂ = 𝐳 − 𝐘 (𝐘 ′ 𝐘) −1 𝐘 ′ 𝐳 ≡ 𝐳 − 𝐈𝐳 • 𝐈 is the hat matrix because it puts the “hat” on 𝐳 : ▶ 𝐈 is an 𝑜 × 𝑜 symmetric matrix ▶ 𝐈 is idempotent: 𝐈𝐈 = 𝐈

Hat values 𝑘=1 ∑ 𝑜 matrix 𝑧 𝑗 ̂ ℎ 𝑗𝑘 𝑧 𝑘 ̂ ∑ 𝜸 = 𝐘(𝐘 ′ 𝐘) −1 𝐘 ′ 𝐳 = 𝐈𝐳 𝑜 14 / 67 ̂ 𝐳 = 𝐘 ̂ • For a particular observation 𝑗 , we can show this means: 𝑧 𝑗 = • ℎ 𝑗𝑘 = importance of observation 𝑘 is for the fjtted value • Leverage/hat values: ℎ 𝑗 = ℎ 𝑗𝑗 diagonal entries of the hat • With a simple linear regression, we have ℎ 𝑗 = 1 (𝑦 𝑗 − 𝑦) 2 𝑜 + 𝑘=1 (𝑦 𝑘 − 𝑦) 2 ▶ ⇝ how far 𝑗 is from the center of the 𝐘 distribution • Rule of thumb: examine hat values greater than 2(𝑙 + 1)/𝑜

Buchanan hats head(hatvalues(mod), 5) ## 1 2 3 4 5 ## 0.04179 0.02285 0.22066 0.01556 0.01493 15 / 67

Buchanan hats 16 / 67 Liberty Hardee Hamilton Palm Beach Levy Santa Rosa Escambia Okaloosa Gadsden Baker Bay Nassau Calhoun Bradford Hernando Hillsborough Lake Union Taylor Clay Holmes Citrus Flagler Jackson Marion Washington Gilchrist Charlotte Monroe Franklin Seminole St. Johns Putnam Pasco Sumter Walton Columbia Gulf Suwannee Jefferson Sarasota Manatee Brevard Wakulla Indian River Desoto Lafayette Okeechobee Orange Madison Polk St. Lucie Hendry Highlands Volusia Leon Glades Alachua Miami-Dade Osceola Pinellas Dixie Collier Martin Broward Lee Duval 0.05 0.10 0.15 0.20 0.25 Hat Values

Outlier definition 𝜏 2 ) 17 / 67 Outlier 6 4 2 Full sample 0 Without outlier -2 -4 -2 0 2 4 • An outlier is a data point with very large regression errors, 𝑣 𝑗 • Very distant from the rest of the data in the 𝑧 -dimension • Increases standard errors (by increasing ̂ • No bias if typical in the 𝑦 ’s

Detecting outliers 𝑣 ′ 𝑗 | > 4 − 5 should defjnitely be checked. 𝑣 ′ 𝑗 | > 2 will be relatively rare. 𝑣 ′ 𝜏√1 − ℎ 𝑗𝑗 ̂ 𝑣 𝑗 ̂ 18 / 67 ̂ 𝑣 (1 − ℎ 𝑗𝑗 ) 𝑣 𝑗 |𝐘] = 𝜏 2 𝕎[ ̂ ̂ • Look for big residuals, right? ▶ Problem: 𝑣 𝑗 are not identically distributed. ▶ Variance of the 𝑗 th residual: • Rescale to get standardized residuals with constant variance: 𝑗 = • Rule of thumb: ▶ | ̂ ▶ | ̂

Buchanan outliers std.resids <- rstandard(mod) 19 / 67 Palm Beach 6 Standardized Residuals 4 2 0 -2 0 10 20 30 40 50 60 Index

Detecting outliers ̃ 1 − ℎ 𝑗 𝑣 𝑗 ̂ ̃ 𝑧 𝑗 ̃ 3. Calculate prediction error: 𝜸 (−𝑗) 20 / 67 (−𝑗) 𝐳 (−𝑗) (−𝑗) 𝐘 (−𝑗) ) ̂ them. outliers because they might pull the regression line close to • Standardized or regular residuals are not good for detecting • Better: leave-one-out prediction errors, 1. Regress 𝐘 (−𝑗) on 𝐳 (−𝑗) , where these omit unit 𝑗 : −1 𝐘 ′ 𝜸 (−𝑗) = (𝐘 ′ 2. Calculate predicted value of 𝑧 𝑗 using that regression: 𝑗 ̂ 𝑧 𝑗 = 𝐲 ′ 𝑣 𝑗 = 𝑧 𝑗 − ̃ • Possible relate prediction errors to residuals: 𝑣 𝑗 =

Influence points leverage point. 21 / 67 Influence Point 6 4 Full sample 2 y 0 Without influence point -2 -4 -2 0 2 4 6 8 • An infmuence point is one that is both an outlier and a • Extreme in both the 𝑦 and 𝑧 dimensions • Causes the regression line to move toward it (bias?)

Overall measures of influence leverage” 𝑗 𝑣 ′ ̂ (𝑙+1)̂ 𝑗 ̃ 𝑣 2 𝑣 𝑗 ℎ 𝑗 , which is just the “outlier-ness × ̃ between the fjtted value and the predicted leave-one-out value: ̂ 22 / 67 𝑧 𝑗 • A rough measure of infmuence is to look at how the difgerence 𝑧 𝑗 − ̃ ▶ This is equivalent to • Cook’s distance ( cooks.distance() ): 𝐸 𝑗 = 𝜏 2 × ℎ 𝑗 ▶ Basically: “normalized outlier-ness × leverage” ▶ 𝐸 𝑗 > 4/(𝑜 − 𝑙 − 1) considered “large”, but cutofgs are arbitrary • Infmuence plot: ▶ x-axis: hat values, ℎ 𝑗 ▶ y-axis: standardized residuals,

Influence plot from lm output plot(mod, which = 5, labels.id = flvote$county) 23 / 67 8 Palm Beach Standardized residuals 6 4 2 0 -2 Broward Miami-Dade Cook's distance -4 0.00 0.05 0.10 0.15 0.20 0.25 Leverage lm(edaybuchanan ~ edaytotal)

Limitations of the standard tools 24 / 67 1.5 1.0 0.5 y 0.0 -0.5 -1.0 0 2 4 6 8 • What happens when there are two infmuence points? • Red line drops the red infmuence point • Blue line drops the blue infmuence point • “Leave-one-out” approaches helps recover the line

What to do about outliers and influential units? least absolute deviations) 25 / 67 • Is the data corrupted? ▶ Fix the observation (obvious data entry errors) ▶ Remove the observation ▶ Be transparent either way • Is the outlier part of the data generating process? ▶ Transform the dependent variable ( log(𝑧) ) ▶ Use a method that is robust to outliers (robust regression,

2/ Heteroskedasticity 26 / 67

Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell - PowerPoint PPT Presentation

Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell Fall 2016 1 / 67 1. Outliers, leverage points, and infmuential observations 2. Heteroskedasticity 3. Nonlinearity of the regression function 2 / 67 Where are we? Where are we

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

HD-2000 HIGH DEFINITION MPEG ENCODER MODULATOR WITH ASI OUTPUT HD-2000 FRONT HD-2000 BACK

2000 I I NTERIM NTERIM R R ESULTS ESULTS P P RESENTATION 2000 RESENTATION 13th September 2000

Delta Scorpii Variability 2000-2002 Delta Hipparcos Primary June 2000 to October 2000 No

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

Math 211 Math 211 Lecture #42 The Pendulum Predator-Prey December 6, 2002 2 The Pendulum

Math 211 Math 211 Lecture #40 Predator Prey Models April 27, 2001 2 Predator-Prey

How to compute the maximal subsemigroups of a finite semigroup in GAP Wilf Wilson 18 th March

C o z : F i n d i n g C o d e t h a t C o u n t s w i t h C a s u

Human-Centered Approaches to Software Engineering Research Amy J. Ko Ph.D. student advised by

ACCCA is committed to developing and supporting community college leaders through unparalleled

Databases, Crypto & Decentralization Caleb James DeLisle Oct 2, 2019 - Percona Amsterdam

A Symbolic Approach to the Projection Method Nam Pham Mark Giesbrecht University of Waterloo,

Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell - PowerPoint PPT Presentation

Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell Fall 2016 1 / 67 1. Outliers, leverage points, and infmuential observations 2. Heteroskedasticity 3. Nonlinearity of the regression function 2 / 67 Where are we? Where are we

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Troubleshooting &amp; Q&amp;A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

HD-2000 HIGH DEFINITION MPEG ENCODER MODULATOR WITH ASI OUTPUT HD-2000 FRONT HD-2000 BACK

2000 I I NTERIM NTERIM R R ESULTS ESULTS P P RESENTATION 2000 RESENTATION 13th September 2000

Delta Scorpii Variability 2000-2002 Delta Hipparcos Primary June 2000 to October 2000 No

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics &amp; Turf Troubleshooting Presentation Q &amp; A Lawn Basics &amp; Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&amp;S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

Math 211 Math 211 Lecture #42 The Pendulum Predator-Prey December 6, 2002 2 The Pendulum

Math 211 Math 211 Lecture #40 Predator Prey Models April 27, 2001 2 Predator-Prey

How to compute the maximal subsemigroups of a finite semigroup in GAP Wilf Wilson 18 th March

C o z : F i n d i n g C o d e t h a t C o u n t s w i t h C a s u

Human-Centered Approaches to Software Engineering Research Amy J. Ko Ph.D. student advised by

ACCCA is committed to developing and supporting community college leaders through unparalleled

Databases, Crypto &amp; Decentralization Caleb James DeLisle Oct 2, 2019 - Percona Amsterdam

A Symbolic Approach to the Projection Method Nam Pham Mark Giesbrecht University of Waterloo,

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

Databases, Crypto & Decentralization Caleb James DeLisle Oct 2, 2019 - Percona Amsterdam