How to Lie with Statistics March 3, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter
Announcements
Today • Linear Regression Recap/Follow up • P-Hacking, Researcher Degrees of Freedom
Today • Linear Regression Recap/Follow up • P-Hacking, Researcher Degrees of Freedom
Dummy Variables cholesterol yes breakfast constant meds 20 31 0 1 1 20 5 0 1 1 X = 20 40 0 1 1 why do we 25 18 1 0 1 have to do this? what no breakfast about pseudo- eucalyptus inverse?
statsmodels import statsmodels.api as sm y, X = read_data() X = sm.add_constant(X) model = sm.OLS(y, X) results = model.fit() print(results.summary()) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels import statsmodels.api as sm import statsmodels.formula.api as smf # M has column headers w/ names M = read_data() X = sm.add_constant(X) eq = “chol ~ eucalyptus + meds + breakfast” model = smf.ols(formula=eq, data=M) results = model.fit() print(results.summary()) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels import statsmodels.api as sm import statsmodels.formula.api as smf # M has column headers w/ names M = read_data() interaction term X = sm.add_constant(X) eq = “chol ~ eucalyptus + meds + breakfast + eucalyptus:meds” model = smf.ols(formula=eq, data=M) results = model.fit() print(results.summary()) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels import statsmodels.api as sm import statsmodels.formula.api as smf # M has column headers w/ names M = read_data() squared terms X = sm.add_constant(X) eq = “chol ~ eucalyptus + meds + breakfast + eucalyptus^2” model = smf.ols(formula=eq, data=M) results = model.fit() print(results.summary()) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels overall fit of model (SSE) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels coefficients (i.e. effect sizes) https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels p-values https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
statsmodels p-values https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
Clicker Question!
Today • Linear Regression Recap/Follow up • P-Hacking, Researcher Degrees of Freedom
You can find almost anything if you look hard enough. Per capita cheese consumption correlates with Number of people who died by becoming tangled in their bedsheets 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 800 deaths 33lbs Bedsheet tanglings Cheese consumed 600 deaths 31.5lbs ρ = 0.95 400 deaths 30lbs 28.5lbs 200 deaths 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Bedsheet tanglings Cheese consumed tylervigen.com https://en.wikipedia.org/wiki/Data_dredging http://www.tylervigen.com/spurious-correlations
Recommend
More recommend