Modelling correlations with Python and SciPy
Eric Marsden
<eric.marsden@risk-engineering.org>
Modelling correlations with Python and SciPy Eric Marsden - - PowerPoint PPT Presentation
Modelling correlations with Python and SciPy Eric Marsden <eric.marsden@risk-engineering.org> Process safety engineer: To what extent does increased process temperature and pressure increase the level of corrosion of my
<eric.marsden@risk-engineering.org>
▷ Analysis of causal efgects is an important activity in risk analysis
▷ Tie simplest statistical technique for analyzing causal efgects is
▷ Correlation analysis measures the extent to which two variables vary
2 / 30
▷ Linear correlation coeffjcient: a measure of the strength and direction
▷ 𝜍𝑌,𝑍 = 𝑑𝑝𝑤(𝑌,𝑍)
𝜏𝑌𝜏𝑍
𝜏𝑌𝜏𝑍
▷ Python: scipy.stats.pearsonr(X, Y) ▷ Excel / Google Docs spreadsheet: function CORREL
3 / 30
▷ sign of ρ determines the relative directions that the variables move in ▷ value determines strength of the relative movements (ranging from -1
▷ ρ = 0.5: one variable moves in the same direction by half the amount that
▷ ρ = 0: variables are uncorrelated
4 / 30
Image source: Wikipedia
c
r e l a t i
≠ d e p e n d e n c y
5 / 30
Image source: Wikipedia
c
r e l a t i
≠ d e p e n d e n c y
5 / 30
Image source: Wikipedia
c
r e l a t i
≠ d e p e n d e n c y
5 / 30
6 / 30
▷ Example: Yerkes–Dodson law
▷ Performance initially increases with
▷ Beyond a certain level of stress, performance
Source: wikipedia.org/wiki/Yerkes–Dodson_law
7 / 30
In [3]: import numpy import matplotlib.pyplot as plt import scipy.stats In [4]: X = numpy.random.normal(10, 1, 100) Y = X + numpy.random.normal(0, 0.3, 100) plt.scatter(X, Y) Out[4]: <matplotlib.collections.PathCollection at 0x7f7443e3c438> In [5]: scipy.stats.pearsonr(X, Y) Out[5]: (0.9560266103379802, 5.2241043747083435e-54)
E x e r c i s e : s h
t h a t w h e n t h e e r r
i n
𝑍
d e c r e a s e s , t h e c
r e l a t i
c
f f i c i e n t i n c r e a s e s E x e r c i s e : p r
u c e d a t a a n d a p l
w i t h a n e g a t i v e c
r e l a t i
c
f f i c i e n t
8 / 30
4 8 12
I II
10 20 4 8 12
III
10 20
IV
E a c h d a t a s e t h a s t h e s a m e c
r e l a t i
c
f f i c i e n t !
9 / 30
▷ Scatterplot: use function plt.scatter ▷ Continuous plot or X-Y: function plt.plot import matplotlib.pyplot as plt import numpy X = numpy.random.uniform(0, 10, 100) Y = X + numpy.random.uniform(0, 2, 100) plt.scatter(X, Y, alpha=0.5) plt.show()
−2 2 4 6 8 10 12 −2 2 4 6 8 10 12 14
10 / 30
▷ A correlation matrix is used to investigate the dependence
▷ In Python, can use:
11 / 30
Vehicle_Reference Casualty_Reference Casualty_Class Sex_of_Casualty Age_of_Casualty Age_Band_of_Casualty Casualty_Severity Pedestrian_Location Pedestrian_Movement Car_Passenger Bus_or_Coach_Passenger Pedestrian_Road_Maintenance_Worker Casualty_Type Casualty_Home_Area_Type Casualty_Home_Area_Type Casualty_Type Pedestrian_Road_Maintenance_Worker Bus_or_Coach_Passenger Car_Passenger Pedestrian_Movement Pedestrian_Location Casualty_Severity Age_Band_of_Casualty Age_of_Casualty Sex_of_Casualty Casualty_Class Casualty_Reference Vehicle_Reference −0.8 −0.4 0.0 0.4 0.8
from pandas import read_csv import matplotlib.pyplot as plt import seaborn as sns data = read_csv("casualties.csv") cm = data.corr() sns.heatmap(cm, square=True) plt.yticks(rotation=0) plt.xticks(rotation=90)
Data source: UK Department for Transport, data.gov.uk/dataset/road-accidents-safety-data
12 / 30
▷ Polio: an infectious disease causing paralysis, which primarily
▷ Largely eliminated today, but was once a worldwide concern ▷ Late 1940s: public health experts in usa noticed that the
▷ Some suspected that ice cream caused polio… sales plummeted ▷ Polio incidence increases in hot summer weather ▷ Correlation is not causation: there may be a hidden, underlying
More info: Freakonomics, Steven Levitt and Stephen J. Dubner
13 / 30
▷ Statistical fact: the larger the number of fjre-fjghters attending
▷ More fjre fjghters are sent to larger fjres ▷ Larger fjres lead to more damage ▷ Lurking (underlying) variable = fjre size ▷ An instance of “Simpson’s paradox”
14 / 30
▷ Statistical fact: low birth-weight children born to smoking mothers have
▷ In a given population, low birth weight babies have a signifjcantly higher
▷ Babies of mothers who smoke are more likely to be of low birth weight
▷ Babies underweight because of smoking still have a lower mortality rate
▷ Lurking variable between smoking, birth weight and infant mortality
Source: Wilcox, A. (2001). On the importance — and the unimportance — of birthweight, International Journal of Epidemiology. 30:1233–1241
15 / 30
▷ In early 2004, the governor of the us state of Illinois R. Blagojevich
▷ Data underlying the plan: children in households where there are more
▷ Later studies showed that children from homes with many books did
▷ Lurking variable: homes where parents buy books have an environment
Source: freakonomics.com/2008/12/10/the-blagojevich-upside/
16 / 30
Source: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012, doi: 10.1056/NEJMon1211064
17 / 30
Source: tylervigen.com, with many more surprising correlations
18 / 30
smoking lung cancer hidden factor?
I n l
i c , t h i s i s c a l l e d t h e “ p
t h
e r g
r
t e r h
” f a l l a c y
19 / 30
smoking lung cancer hidden factor?
I n l
i c , t h i s i s c a l l e d t h e “ p
t h
e r g
r
t e r h
” f a l l a c y
19 / 30
▷ To demonstrate the causality, you need a randomized controlled
▷ Assume we have the power to force people to smoke or not smoke
▷ Take a large group of people and divide them into two groups
▷ Observe whether smoker group develops more lung cancer than the
▷ We have eliminated any possible hidden factor causing both smoking and
▷ More information: read about design of experiments
20 / 30
▷ Causality is an important — and complex — notion in risk analysis and
▷ Conservative approach used mostly in the physical sciences requires
▷ Relaxed approach used in the social sciences requires
▷ Alternative relaxed approach: a quasi-experimental “natural experiment”
21 / 30
▷ Natural experiment: an empirical study in which allocation
▷ Example: in testing whether military service subsequently afgected
22 / 30
▷ Example: cholera outbreak in London in 1854 led to 616 deaths ▷ Medical doctor J. Snow discovered a strong association between
▷ Cholera outbreak stopped when the “bad” pumps were shut
23 / 30
Source: xkcd.com/552/ (CC BY-NC licence)
24 / 30
25 / 30
Source: xkcd.com/925/ (CC BY-NC licence)
26 / 30
▷ slides on linear regression modelling using Python, the simplest
▷ slides on copula and multivariate dependencies for risk models, a
27 / 30
▷ Eye (slide 21): Flood G. via flic.kr/p/aNpvLT, CC BY-NC-ND licence ▷ Map of cholera outbreaks (slide 23) by John Snow (1854) from Wikipedia
For more free content on risk engineering, visit risk-engineering.org
28 / 30
▷ Analysis of the “pay for performance” (correlation between a ceo’s pay
▷ Python notebook on a more sophisticated Bayesian approach to
For more free content on risk engineering, visit risk-engineering.org
29 / 30
Was some of the content unclear? Which parts were most useful to you? Your comments to feedback@risk-engineering.org (email) or @LearnRiskEng (Twitter) will help us to improve these
@LearnRiskEng fb.me/RiskEngineering
This presentation is distributed under the terms of the Creative Commons Aturibution – Share Alike licence
For more free content on risk engineering, visit risk-engineering.org
30 / 30