Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 - - PDF document

statistically significant correlations 11 oct 2014
SMART_READER_LITE
LIVE PREVIEW

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 - - PDF document

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant Correlations 1 0F 2014 NNN4 Statistically-Significant Correlations 2 Statistically-Significant Exact Solutions Correlations For N random pairs


slide-1
SLIDE 1

Statistically-Significant Correlations 11 Oct, 2014 2014-Schield-NNN4-slides.pdf 1

2014 NNN4 Statistically-Significant Correlations

0F

1

Milo Schield Augsburg College Editor of www.StatLit.org

US Rep: International Statistical Literacy Project

Fall 2014 National Numeracy Network Conference

www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf

Statistically-Significant Correlations

2014 NNN4 Statistically-Significant Correlations

0F

2

Exact Solutions

For N random pairs from an uncorrelated bivariate normally-distributed distribution, the sampling distribution is not simple. Here are three common analytic approaches: 1.Fisher transformation (using LN and Arctanh), 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2

  • For large n, the critical value of t (95% confidence) is 1.96.
  • For small n, the critical value of t increases as n decreases.

None of these are simple or memorable.

2014 NNN4 Statistically-Significant Correlations

0F

3

Sufficient Condition

Approach: Find an equation generating a minimum correlation for statistical-significance given N.

  • 1. Given N, find the smallest value of r where the left

end of a 95% confidence interval is non-negative. Use calculator at www.vassarstats.net/rho.html or www.danielsoper.com/statcalc3/calc.aspx?id=44 For Daniels, use the results for a two-tailed test.

  • 2. Generate correlation coefficient with simple model
  • 3. Calculate error difference between calculated and

exact using the exact as the standard. If all errors are positive, then the model is sufficient.

2014 NNN4 Statistically-Significant Correlations

0F

4

All errors positive means the model is sufficient.

Simple Model: 2/SQRT(n)

Minimum Correlation for Statistical Significance N Exact 2/sqrt(n) Error

400 0.10 0.10 3.0% 256 0.12 0.13 2.7% 100 0.20 0.20 2.0% 49 0.28 0.29 1.7% 25 0.40 0.40 1.3% 16 0.50 0.50 1.0% 12 0.57 0.58 0.6% 10 0.63 0.63 0.4% 7 0.75 0.76 0.4% 6 0.81 0.82 0.6% 5 0.88 0.89 1.4% 4 0.96 1.00 4.0%

2014 NNN4 Statistically-Significant Correlations

0F

5

Minimum statistically-significant r = 2/Sqrt(n) “n” is the number of pairs being correlated Less than 5% over for n between 5 and 4,000. Simple and memorable for two variables. It is similar to the formula for the maximum 95% Margin of Error in samples from a binary variable: 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable.

Solution

2014 NNN4 Statistically-Significant Correlations

0F

6

10 pairs; 2/Sqrt(10) = 0.63; Statistically significant

Time-Series Correlations

www.tylervigen.com

1400 1600 1800 2000 2200 2400 2600 300 400 500 600 700 800 900 2000 2002 2004 2006 2008 Revenues ($M) Deaths (US)

Tangled bed‐sheet Deaths vs. Skiing Revenues

Source: http://tylervigen.com/view_correlation?id=1864 Revenues: Blue line Correlation: 0.969724

slide-2
SLIDE 2

Statistically-Significant Correlations 11 Oct, 2014 2014-Schield-NNN4-slides.pdf 2

2014 NNN4 Statistically-Significant Correlations

0F

7

20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant

Correlation = -0.993 Bee colonies & MJ arrests

15,000 25,000 35,000 45,000 55,000 65,000 75,000 85,000 95,000 2,200 2,400 2,600 2,800 3,000 3,200 3,400 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 Arrests Colonies (K)

Bee Colonies vs. Juvenile Marijuana Arrests

www.tylervigen.com/view_correlation?id=1582 Honey Bee Colonies Marijuna Arrests of Juveniles Correlation: ‐0.933389

2014 NNN4 Statistically-Significant Correlations

0F

8

11pairs; 2/Sqrt(11) = 0.60; Statistically-significant!

Correlation = 0.664 Drownings & Cage films

0.5 1 1.5 2 2.5 3 3.5 4 4.5 80 90 100 110 120 130 1998 2000 2002 2004 2006 2008 2010 Cage films Drownings www.tylervigen.com/view_correlation?id=359

Pool Drownings vs. Films with Nicholas Cage

Drowningns: Red line Cage films: Blue line Correlation: 0.666004

2014 NNN4 Statistically-Significant Correlations

0F

9

  • 1. There is nothing linear about these associations.
  • 2. These correlations seem unbelievably high.
  • #1: The correlation between two time-series

eliminates the common factor: time. The question is whether their mutual association is

  • linear. To see this, an XY-plot is generated.

Something Seems Wrong!

2014 NNN4 Statistically-Significant Correlations

0F

10

Correlation = 0.664 Drownings & Cage films

2014 NNN4 Statistically-Significant Correlations

0F

11

#2: Very High Correlations. Three Explanations

  • 1. Association is causal. See Tyler Vigen’s video:

www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs

  • 2. Association is spurious – just random chance.

Five percent of random associations will be mistakenly classified as statistically significant.

  • 3. Association is cherry-picked -- after the fact.

According to Tyler, “This server has generated 24,470 correlations.” Tyler just picked those with high or interesting correlations.

2014 NNN4 Statistically-Significant Correlations

0F

12

Conclusions

  • 1. Use 2/Sqrt(n) as the minimum correlation for

statistical significance. This criteria is sufficient, fairly accurate (within 5%) and memorable.

  • 2. The correlation between two time-series eliminates
  • time. Correlation determines the degree of linearity

in their cross-sectional association.

  • 3. Do not use a test for statistical significance if the

data pairs were selected – after the fact via data mining – solely because of their high correlation.

slide-3
SLIDE 3

Statistically-Significant Correlations 11 Oct, 2014 2014-Schield-NNN4-slides.pdf 3

2014 NNN4 Statistically-Significant Correlations

0F

13

Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) =

10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant

www.tylervigen.com/view_correlation?id=1703

Correlation = 0.993 Divorce & Margarine Usage

3 4 5 6 7 8 9 4 4.2 4.4 4.6 4.8 5 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Divorce Rate vs. Margarine Consumption

Divorce Rate in Maine Margarine Consumption per capita (US) Correlation: 0.992558
slide-4
SLIDE 4

2014 NNN4 Statistically-Significant Correlations

0F

1

Milo Schield Augsburg College Editor of www.StatLit.org

US Rep: International Statistical Literacy Project

Fall 2014 National Numeracy Network Conference

www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf

Statistically-Significant Correlations

slide-5
SLIDE 5

2014 NNN4 Statistically-Significant Correlations

0F

2

Exact Solutions

For N random pairs from an uncorrelated bivariate normally-distributed distribution, the sampling distribution is not simple. Here are three common analytic approaches: 1.Fisher transformation (using LN and Arctanh), 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2

  • For large n, the critical value of t (95% confidence) is 1.96.
  • For small n, the critical value of t increases as n decreases.

None of these are simple or memorable.

slide-6
SLIDE 6

2014 NNN4 Statistically-Significant Correlations

0F

3

Sufficient Condition

Approach: Find an equation generating a minimum correlation for statistical-significance given N.

  • 1. Given N, find the smallest value of r where the left

end of a 95% confidence interval is non-negative. Use calculator at www.vassarstats.net/rho.html or www.danielsoper.com/statcalc3/calc.aspx?id=44 For Daniels, use the results for a two-tailed test.

  • 2. Generate correlation coefficient with simple model
  • 3. Calculate error difference between calculated and

exact using the exact as the standard. If all errors are positive, then the model is sufficient.

slide-7
SLIDE 7

2014 NNN4 Statistically-Significant Correlations

0F

4

All errors positive means the model is sufficient.

Simple Model: 2/SQRT(n)

Minimum Correlation for Statistical Significance N Exact 2/sqrt(n) Error

400 0.10 0.10 3.0% 256 0.12 0.13 2.7% 100 0.20 0.20 2.0% 49 0.28 0.29 1.7% 25 0.40 0.40 1.3% 16 0.50 0.50 1.0% 12 0.57 0.58 0.6% 10 0.63 0.63 0.4% 7 0.75 0.76 0.4% 6 0.81 0.82 0.6% 5 0.88 0.89 1.4% 4 0.96 1.00 4.0%

slide-8
SLIDE 8

2014 NNN4 Statistically-Significant Correlations

0F

5

Minimum statistically-significant r = 2/Sqrt(n) “n” is the number of pairs being correlated Less than 5% over for n between 5 and 4,000. Simple and memorable for two variables. It is similar to the formula for the maximum 95% Margin of Error in samples from a binary variable: 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable.

Solution

slide-9
SLIDE 9

2014 NNN4 Statistically-Significant Correlations

0F

6

10 pairs; 2/Sqrt(10) = 0.63; Statistically significant

Time-Series Correlations

www.tylervigen.com

1400 1600 1800 2000 2200 2400 2600 300 400 500 600 700 800 900 2000 2002 2004 2006 2008 Revenues ($M) Deaths (US)

Tangled bed‐sheet Deaths vs. Skiing Revenues

Source: http://tylervigen.com/view_correlation?id=1864 Revenues: Blue line Correlation: 0.969724

slide-10
SLIDE 10

2014 NNN4 Statistically-Significant Correlations

0F

7

20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant

Correlation = -0.993 Bee colonies & MJ arrests

15,000 25,000 35,000 45,000 55,000 65,000 75,000 85,000 95,000 2,200 2,400 2,600 2,800 3,000 3,200 3,400 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09

Arrests Colonies (K)

Bee Colonies vs. Juvenile Marijuana Arrests

www.tylervigen.com/view_correlation?id=1582 Honey Bee Colonies Marijuna Arrests of Juveniles Correlation: ‐0.933389

slide-11
SLIDE 11

2014 NNN4 Statistically-Significant Correlations

0F

8

11pairs; 2/Sqrt(11) = 0.60; Statistically-significant!

Correlation = 0.664 Drownings & Cage films

0.5 1 1.5 2 2.5 3 3.5 4 4.5 80 90 100 110 120 130 1998 2000 2002 2004 2006 2008 2010

Cage films Drownings www.tylervigen.com/view_correlation?id=359

Pool Drownings vs. Films with Nicholas Cage

Drowningns: Red line Cage films: Blue line

Correlation: 0.666004

slide-12
SLIDE 12

2014 NNN4 Statistically-Significant Correlations

0F

9

  • 1. There is nothing linear about these associations.
  • 2. These correlations seem unbelievably high.
  • #1: The correlation between two time-series

eliminates the common factor: time. The question is whether their mutual association is

  • linear. To see this, an XY-plot is generated.

Something Seems Wrong!

slide-13
SLIDE 13

2014 NNN4 Statistically-Significant Correlations

0F

10

Correlation = 0.664 Drownings & Cage films

slide-14
SLIDE 14

2014 NNN4 Statistically-Significant Correlations

0F

11

#2: Very High Correlations. Three Explanations

  • 1. Association is causal. See Tyler Vigen’s video:

www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs

  • 2. Association is spurious – just random chance.

Five percent of random associations will be mistakenly classified as statistically significant.

  • 3. Association is cherry-picked -- after the fact.

According to Tyler, “This server has generated 24,470 correlations.” Tyler just picked those with high or interesting correlations.

slide-15
SLIDE 15

2014 NNN4 Statistically-Significant Correlations

0F

12

Conclusions

  • 1. Use 2/Sqrt(n) as the minimum correlation for

statistical significance. This criteria is sufficient, fairly accurate (within 5%) and memorable.

  • 2. The correlation between two time-series eliminates
  • time. Correlation determines the degree of linearity

in their cross-sectional association.

  • 3. Do not use a test for statistical significance if the

data pairs were selected – after the fact via data mining – solely because of their high correlation.

slide-16
SLIDE 16

2014 NNN4 Statistically-Significant Correlations

0F

13

Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) =

10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant

www.tylervigen.com/view_correlation?id=1703

Correlation = 0.993 Divorce & Margarine Usage

3 4 5 6 7 8 9 4 4.2 4.4 4.6 4.8 5 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Divorce Rate vs. Margarine Consumption

Divorce Rate in Maine Margarine Consumption per capita (US)

Correlation: 0.992558