Analyzing data using Python Eric Marsden - PowerPoint PPT Presentation

Analyzing data using Python Eric Marsden <eric.marsden@risk-engineering.org> The purpose of computing is insight, not numbers. – Richard Hamming

data probabilistic model event probabilities consequence model event consequences risks curve fjtting costs decision-making criteria Tiese slides 2 / 47 Where does this fjt into risk engineering?

3 / 47 Descriptive statistics

observations • organize and simplify data to help understand it ▷ Inferential statistics use observations (data from a sample) to make inferences about the total population • generalize from a sample to a population 4 / 47 Descriptive statistics ▷ Descriptive statistics allow you to summarize information about

5 / 47 Source: dilbert.com Descriptive statistics

median or the mean ▷ Tie median is the point in the distribution where half the values are lower, and half higher • it’s the 0.5 quantile ▷ Tie (arithmetic) mean (also called the average or the mathematical expectation) is the “center of mass” of the distribution • continuous case: 𝔽(𝑌) = ∫ 𝑐 ▷ Tie mode is the element that occurs most frequently in your data 6 / 47 Measures of central tendency ▷ Central tendency (the “middle” of your data) is measured either by the 𝑏 𝑦𝑔 (𝑦)𝑒𝑦 • discrete case: 𝔽(𝑌) = ∑ 𝑗 𝑦 𝑗 𝑄(𝑦 𝑗 )

7 / 47 Measurements of fatigue life Association, 53(281) Source: Birnbaum, Z. W. and Saunders, S. C. (1958), A statistical model for life-length of materials , Journal of the American Statistical 1416.0 1400.9108910891089 1400.9108910891089 > cycles.mean() > cycles = numpy.array([370, 1016, 1235, [...] 1560, 1792]) > import numpy (1958). Data from Birnbaum and Saunders subjected to loads of 21 000 PSI. strips of 6061-T6 aluminium sheeting, (thousands of cycles until rupture) of Illustration: fatigue life of aluminium sheeting > numpy.mean(cycles) > numpy.median(cycles)

8 / 47 # 50th percentile = 0.5 quantile = median # <-- almost unchanged 79.70768232050916 > numpy.median(weights) # <-- big change 190.4630171138418 > numpy.mean(weights) # outliers > weights = numpy.append(weights, [10001, 101010]) 79.69717178759265 Note: the mean is quite sensitive to outliers, the median much less. 79.69717178759265 79.83294314806949 > numpy.mean(weights) > weights = numpy.random.normal(80, 10, 1000) > import numpy ▷ the median is what’s called a robust measure of central tendency Aside: sensitivity to outliers > numpy.median(weights) > numpy.percentile(weights, 50)

If the distribution of data is symmetrical, then the mean is equal to the median. If the distribution is asymmetric (skewed), the mean is generally closer to the skew than the median. Degree of asymmetry is measured by skewness (Python: scipy.stats.skew() ) 9 / 47 Measures of central tendency Negative skew Positive skew

▷ Variance measures the dispersion (spread) of observations around the mean • 𝑊𝑏𝑠(𝑌) = 𝔽 [(𝑌 − 𝔽[𝑌]) 2 ] function of 𝑌 1 • note: if observations are in metres, variance is measured in 𝑛 2 • Python: array.var() or numpy.var(array) ▷ Standard deviation is the square root of the variance • it has the same units as the mean • Python: array.std() or numpy.std(array) 10 / 47 Measures of variability • continuous case: 𝜏 2 = ∫(𝑦 − 𝜈) 2 𝑔 (𝑦)𝑒𝑦 where 𝑔 (𝑦) is the probability density • discrete case: 𝜏 2 = 𝑜−1 ∑ 𝑜 𝑗=1 (𝑦 𝑗 − 𝜈) 2

11 / 47 823.99793599999998 between 100 and 200. Calculate the mean, min, max, variance and standard deviation of this sample. > import numpy > obs = numpy.random.randint(100, 201, 1000) > obs.mean() 149.49199999999999 Task : Choose randomly 1000 integers from a uniform distribution 100 28.705364237368595 200 > obs.std() Exercise: Simple descriptive statistics > obs.min() > obs.max() > obs.var()

12 / 47 interval plt.xlabel("Cycles until failure") Histograms are a sort of bar graph that shows # our Birnbaum and Sanders failure data import matplotlib.pyplot as plt “reasonable” histogram, but is subjective. Note: the width of the bins is important to obtain a plt.hist(cycles) 3 Plot the number of observations in each interval 2 Count the number of observations in each classes or intervals (called “bins”) 1 Subdivide the observations into several equal To build a histogram: displays raw counts or proportions. the distribution of data values. Tie vertical axis Histograms: plots of variability 25 20 15 10 5 0 500 1000 1500 2000 2500 Cycles until failure

13 / 47 half the fjrst and third quartiles ¾ of cases from the latter ¼ ▷ Tie interquartile range (IQR) is the distance between ▷ Tie second quartile, the median, divides the dataset in fjrst ¼ of cases from the latter ¾ that breaks a dataset into four equal parts Quartiles ▷ A quartile is the value that marks one of the divisions ▷ Tie fjrst quartile, at the 25 th percentile, divides the 25% of observations 25% of observations 25% of observations 25% of observations ▷ Tie third quartile, the 75 th percentile, divides the fjrst interquartile range • 25 th percentile and the 75 th percentile

14 / 47 A “box and whisker” plot or boxplot shows the spread of the data ▷ the median (horizontal line) ▷ lower and upper quartiles Q1 and Q3 (the box) ▷ upper whisker: last datum < Q3 + 1.5×IQR ▷ the lower whisker: fjrst datum > Q1 - 1.5×IQR ▷ any data beyond the whiskers are typically called outliers import matplotlib.pyplot as plt plt.boxplot(cycles) plt.xlabel("Cycles until failure") Box and whisker plot 2500 2000 Cycles until failure 1500 1000 500 1 differently, to represent the 5 th and 95 th Note that some people plot whiskers percentiles for example, or even the min and max values…

Adds a kernel density estimation to a boxplot import seaborn as sns sns.violinplot(cycles, orient="v") plt.xlabel("Cycles until failure") 15 / 47 Violin plot 2500 2000 1500 1000 500 0 Cycles until failure

increases). A good estimator should be unbiased, precise and consistent (converge as sample size 16 / 47 Bias and precision Precise Imprecise Biased Unbiased

know the associated uncertainty • especially for risk engineering! ▷ One option is to report the standard error • ̂ 𝜏 √𝑜 , where ̂ 𝜏 is the sample standard deviation (an estimator for the population standard deviation) and 𝑜 is the size of the sample • diffjcult to interpret without making assumptions about the distribution of the error (ofuen assumed to be normal) ▷ Alternatively, we might report a confjdence interval 17 / 47 Estimating values ▷ In engineering, providing a point estimate is not enough: we also need to

time, the parameter of interest will be included in that interval • most commonly, 95% confjdence intervals are used ▷ Confjdence intervals are used to describe the uncertainty in a point estimate • a wider confjdence interval means greater uncertainty 18 / 47 Confjdence intervals ▷ A two-sided confjdence interval is an interval [𝑀, 𝑉] such that C% of the

19 / 47 A 90% confjdence interval means that 10% of Here, for a two-sided confjdence interval. included in that interval. the time, the parameter of interest will not be Interpreting confjdence intervals population mean m m m m m m m m m m

20 / 47 A 90% confjdence interval means that 10% of Here, for a one-sided confjdence interval. included in that interval. the time, the parameter of interest will not be Interpreting confjdence intervals population mean m m m m m m m m m m

21 / 47 Data from Birnbaum and Saunders (1958) graphically on a barplot, as “error lines”. Note however that this graphical presentation is ambiguous, because some authors represent the standard deviation on error bars. Tie caption should always state what the error bars represent. import seaborn as sns sns.barplot(cycles, ci=95, capsize=0.1) plt.xlabel("Cycles until failure (95% CI)") Confjdence intervals can be displayed Illustration: fatigue life of aluminium sheeting 0 200 400 600 800 1000 1200 1400 Cycles until failure, with 95% confidence interval

sample population Statistical inference means deducing information about a population by examining only a subset of the population (the sample). We use a sample statistic to estimate a population parameter . 22 / 47 Statistical inference

Analyzing data using Python Eric Marsden - PowerPoint PPT Presentation

Analyzing data using Python Eric Marsden <eric.marsden@risk-engineering.org> The purpose of computing is insight, not numbers. Richard Hamming data probabilistic model event probabilities consequence model event consequences

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Processing Twitter Text Alex Hanna Computational Social Scientist DataCamp Analyzing Social

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Getting Started with Python The Python Interpreter A piece of software that executes

US Census data: an overview Kyle Walker Instructor DataCamp Analyzing US Census Data in R

We already know Java and C++. Why learn Python? Using Python to Implement Algorithms Tyler Moore

A continuum damage model for creep fracture and fatigue analysis Petteri Kauppila 1 , Reijo Kouhia

Novel Targeted Therapies for Neuroendocrine Tumors Jennifer Chan, MD, MPH Director, Program in

Eric Blocher STARS Agenda LB60 TLAA Considerations Definition of TLAA TLAA

1) 65 y.o. F with fatigue and known LBBB 1) 65 y.o F with known LBBB - baseline 2) 86 F with

Work Smarter 10 Strategies to Maximize your Time, Attention, and Energy Dr. Mike Doughty

Prevalence and Predictors of Burnout Among Hospice and Palliative Care Clinicians: An IDT

What does it take to run a bug bounty program? Typical problems and practical solutions ANTON

NAPH HCAHPS L e ar ning Ne twor k: Be yond the Basic s Offic e Hour s: Maintaining

Analyzing data using Python Eric Marsden - PowerPoint PPT Presentation

Analyzing data using Python Eric Marsden <eric.marsden@risk-engineering.org> The purpose of computing is insight, not numbers. Richard Hamming data probabilistic model event probabilities consequence model event consequences

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Processing Twitter Text Alex Hanna Computational Social Scientist DataCamp Analyzing Social

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Getting Started with Python The Python Interpreter A piece of software that executes

US Census data: an overview Kyle Walker Instructor DataCamp Analyzing US Census Data in R

We already know Java and C++. Why learn Python? Using Python to Implement Algorithms Tyler Moore

A continuum damage model for creep fracture and fatigue analysis Petteri Kauppila 1 , Reijo Kouhia

Novel Targeted Therapies for Neuroendocrine Tumors Jennifer Chan, MD, MPH Director, Program in

Eric Blocher STARS Agenda LB60 TLAA Considerations Definition of TLAA TLAA

1) 65 y.o. F with fatigue and known LBBB 1) 65 y.o F with known LBBB - baseline 2) 86 F with

Work Smarter 10 Strategies to Maximize your Time, Attention, and Energy Dr. Mike Doughty

Prevalence and Predictors of Burnout Among Hospice and Palliative Care Clinicians: An IDT

What does it take to run a bug bounty program? Typical problems and practical solutions ANTON

NAPH HCAHPS L e ar ning Ne twor k: Be yond the Basic s Offic e Hour s: Maintaining

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons