Computing for engineering simulation Data analysis I, II and - - PowerPoint PPT Presentation
Computing for engineering simulation Data analysis I, II and - - PowerPoint PPT Presentation
Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon Statistical Consulting Unit The Australian National University May 2020 What is Statistics Definitions In everyday usage, the term
What is ” Statistics”
Definitions
◮ In everyday usage, the term statistics : numerical facts or data, e.g., the unemployment rate 9.2%, or the average smartphone price is $1000 ◮ The field or study of statistics : more complex - collecting, summarising, analysing and interpreting data ◮ Both above : use of the word ’data’
1 / 1
Data
Types of statistical data
In Statistics, data can be considered to be one of two types: ◮ categorical data : generally non-numeric or qualitative, e.g., color, gender, religion, etc
◮ subdivided into two types: nominal and ordinal
◮ numerical data : quantitative and generally measurements, e.g., age, income, height, etc.
◮ subdivided into two types: discrete and continuous
2 / 1
Data
Types of statistical data
“Methods for viewing and summarizing data depend on which type of data it is.”
3 / 1
Data
Types of statistical data
Depend on the number of variables: ◮ univariate data: when a dataset consists of a single variable: graphical and numeric summaries of a dataset ◮ bivariate data: there are two variables in a dataset. ◮ multivariate data: two or more variables in a dataset.
4 / 1
Data
Population vs. sample
◮ ‘Data’ can refer either to the population or just sample selected from that population. ◮ Very important to distinguish between numerical measures of population and numerical measures of a sample. ◮ A parameter and a statistic : a numerical measure of a population and a sample, respectively
5 / 1
Data
Summarising data
◮ Summarise the (sample) data in order to present them in a more meaningful or more easily interpreted form. ◮ Descriptive Statistics and graphical summaries - methods for summarising, describing and displaying data ◮ Using descriptive or summary measures in order to learn about characteristic of the population - statistical inference.
6 / 1
Summarising Data
Example: machine breakdowns
The engineer in charge of the maintenance of the machine keeps records on the breakdown causes over a period of a year. Altogether there are 46 breakdowns, of which 9 are “electrical causes” , 24 are “mechanical causes” , and 13 are “operator misuse” . Summaries this data?
7 / 1
Summarising Data
Example: machine breakdowns
Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x1, . . . , x46, with each observation taking one of the values {electrical, mechanical, misuse}
8 / 1
Summarising Data
Example: machine breakdowns
Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x1, . . . , x46, with each observation taking one of the values {electrical, mechanical, misuse}
9 / 1
Summarising Data
categorical data - table
◮ Table is the most common way to summarise categorical data Breakdown cause Frequency Electrical 9 Mechanical 24 Misuse 13 Total 46 ◮ Graphically with barplots, dot charts and pie charts.
10 / 1
Summarising Data
categorical data - barplot
Electrical Mechanical Misuse Breakdown cause Frequency 5 10 15 20 25
11 / 1
Summarising Data
categorical data - misleading plots
When we look at a barplot, we are trying to visualize differences between the data presented, assuming a scale starts at 0. The reader can be deliberately misled by making a graphic non-0
- based. Such misleading barplots appear frequently in the
media.
12 / 1
Summarising Data
categorical data - misleading plots
IE Firefox Chrome
Web Preference
Thousands 10 20 30 40 IE Firefox Chrome
Web Preference
Thousands 42 43 44 45 46
13 / 1
Summarising Data
categorical data - others, Pie chart
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A barchart or dotchart is a preferable way of displaying this type of data.
14 / 1
Summarising Data
numerical data
Now consider a numerical data and basically what we want to understand is the distribution of the data ◮ what is the range of the data? ◮ what is the central tendency? ◮ how spread out are the values? Answer these questions graphically or numerically.
15 / 1
Summarising Data
measure of central tendency
A measure of central tendency : a numerical measure that locates the centre of a distribution of measurements or describes ’typical value’ ◮ Most common measures of centre: 3M (Mode, Median, Mean) ◮ Not only simplify a description of the data but also comparing different data quantitatively
16 / 1
Summarising Data
measure of central tendency - 3M
The most common measures of centre, 3M: ◮ Mode: the observation in the dataset that occurs most often (i.e., has the highest frequency of occurrence.) ◮ Median: the middle number in an ordered dataset. ◮ Mean: the arithmetic average of all the measurements in the dataset.
17 / 1
Summarising Data
measure of central tendency - example
Find the 3M of the following sample of dataset: X = {18, 19, 18, 20, 18, 18, 20, 21, 37, 18} ◮ Mode : 18 ◮ Median : 18.5 ◮ Mean : 20.7
18 / 1
Summarising Data
measures of variability - how spread out
A measure of variability: a single value to measure the internal variation of the data - which data items vary from one another or from a central point. ◮ Three of the more commonly used measures of variability: Range, Variance, Standard deviation
19 / 1
Summarising Data
measure of variability
◮ Range: the difference between the largest and the smallest values in the data (the simplest one) ◮ Variance: a single value obtained by summing the squares of the deviations from the mean and dividing this sum by (n − 1), n is the sample size ◮ Standard deviation: the square root of the variance
20 / 1
Summarising Data
measure of variability - variance and standard deviation
How to calculate the variance of the sample data, x with sample size, n: ◮ First, calculate the sample mean, ¯ x, then ◮ Calculate the deviation from the mean or the residual, x − ¯ x, then take the squares (x − ¯ x)2 ◮ Summing the squared residuals and dividing by (n − 1)
21 / 1
Summarising Data
measures of variability - variance and standard deviation
Consider the following sample of data: x = {10, 12, 15, 17, 21} ◮ The sample mean is 10+12+···+21
5
= 15 ◮ x 10 12 15 17 21 75 (x − ¯ x)
- 5
- 3
2 6 (x − ¯ x)2 25 9 4 36 74 ◮ The variance of x is 74/(5 − 1) = 18.5 and √ 18.5 = 4.30
22 / 1
Summarising Data
how to get the mean and variance (standard deviation in R) > x = c(10,12,15,17,21) # input data as a vector > xbar = mean(x) # calculate the mean directly > x - xbar [1] -5 -3 2 6 > (x-xbar)^2 [1] 25 9 4 36 > sum((x-xbar)^2) [1] 74 > sum((x-xbar)^2) / (length(x) - 1) [1] 18.5 > var(x) # calculate the variance directly [1] 18.5
23 / 1
Summarising Data
Five number summary
The five number summary gives you a rough idea about what the dataset looks like and includes 5 values (items): ◮ The minimum (min) and the maximum (max) ◮ The first quartile (25%), the median (50%), and the third quartile (75%)
24 / 1
Summarising Data
Five number summary
> x = c(10,12,15,17,21) > length(x) [1] 5 > sum(sort(x)[3]) # the median the hard way [1] 15 > median(x) # easy way [1] 15 > quantile(x, c(0.25,0.5,0.75)) 25% 50% 75% 12 15 17 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 12 15 15 17 21
25 / 1
Graphical summaries of the data
categorical data and numerical data
In general, a table of numbers is not very informative, whereas a picture or graphical representation can be quite informative: ◮ Categorical (or Qualitative) data: pie charts, bar charts and dotplots - easily grasp the distribution of the data quickly ◮ Numerical (or Quantitative) data: boxplot, histogram and density curve
26 / 1
Graphical summaries of the data
numerical (quantitative) data
Numerical data are summarized by graphically with histogram, boxplot, and density curve. ◮ Histogram: constructed by binning the data and counting the number of observations in each bin ◮ Density plot: thought of as plots of smoothed histogram ◮ Boxplot: visualization of five number summary (shown above)
27 / 1
Graphical summaries of the data
numerical data - example: histogram and density plot in R
A random sample of 50 milk containers is selected and their milk contents are weighed and is shown below:
1.958 1.951 2.107 2.092 1.955 2.162 2.168 2.134 1.971 2.072 2.049 2.017 2.117 1.977 2.034 2.062 2.110 1.974 1.992 2.018 2.135 2.107 2.084 2.169 2.085 2.018 1.977 2.116 1.988 2.066 2.126 2.167 1.969 2.198 2.078 2.119 2.088 2.172 2.133 2.112 2.066 2.128 2.142 2.042 2.050 2.102 2.000 2.188 1.960 2.128
28 / 1
Graphical summaries of the data
numerical data - example: histogram and density plot in R
Histogram of milkdata
milkdata Frequency 1.95 2.00 2.05 2.10 2.15 2.20 5 10 15
Histogram of milkdata
milkdata Density 1.95 2.00 2.05 2.10 2.15 2.20 2 4 6 8
29 / 1
Graphical summaries of the data
numerical data - example: histogram and density plot in R
A histogram and density plot show ◮ Overall pattern: where is the centre of the data, what is its spread, what is the shape of the spread ◮ Unusual differences: a point lying away from main part of pattern is an outlier
30 / 1
Graphical summaries of the data
numerical data - example: histogram and density plot in R codes
> milkdata = scan() 1: 1.958 1.951 2.107 2.092 1.955 2.162 2.168 2.134 1.971 10: 2.072 2.049 2.017 2.117 1.977 2.034 2.062 2.110 1.974 19: 1.992 2.018 2.135 2.107 2.084 2.169 2.085 2.018 1.977 28: 2.116 1.988 2.066 2.126 2.167 1.969 2.198 2.078 2.119 37: 2.088 2.172 2.133 2.112 2.066 2.128 2.142 2.042 2.050 46: 2.102 2.000 2.188 1.960 2.128 51: Read 50 items > > hist(milkdata) > hist(milkdata, breaks = 10, probability = TRUE) > lines(density(milkdata), col="red")
31 / 1
Graphical summaries of the data
numerical data - example: boxplot in R
A boxplot is a schematic presentation of the sample median, the upper and lower quartiles, and the largest and smallest data
- bservations.
That is, it is a graphical representation of the five number summary!
32 / 1
Graphical summaries of the data
numerical data - example: boxplot in R
The inter-quartile range (IQR): the centre spans the quartiles (Q1 to Q3) If the median is near the centre, the distribution is reasonably symmetric. Upper whisker, Q3 + 1.5 × IQR, and lower whisker, Q1 − 1.5 × IQR: if the values are outside of these ranges, it could be potential outliers.
33 / 1
Graphical summaries of the data
numerical data - example: boxplot in R
1.95 2.00 2.05 2.10 2.15 2.20
MILK WEIGHT IN LITRE
Weight in litre 1.95 2.00 2.05 2.10 2.15 2.20
MILK WEIGHT IN LITRE
Weight in litre
34 / 1
Graphical summaries of the data
Pros and cons of histograms and boxplots Advantage Disadvantage Histogram Common Information lost due to grouping Can handle large data Effect of bin width Boxplots Easy Can’t see multimodality Shape and outliers Don’t know the sample size
35 / 1
More than one datasets
36 / 1
Example 1
Nerve conductivity speeds
A neurologist is investigating how diseases of the periphery nerves in humans influence the conductivity speed of the nervous system. The conductivity speed of nerves is determined by administering an electric shock to a patient’s leg and measuring the time it takes to flex a muscle in the patient’s foot. Nerve conductivity speed measurements are made on n = 32 healthy patients and on m = 277 patients who are known to have a periphery nerve disorder.
37 / 1
Example 2
Drug allergies
Three drugs (A,B,C) are compared with respect to the types
- f allergic reaction that they cause to patients.
A group of n = 300 patients is randomly split into three groups of 100 patients, each of which is given one of the three drugs. The patients are then categorized as being hyperallergic, allergic, mildly allergic, or as having no allergy.
38 / 1
Example 3
Heart rate reduction
A new drug for inducing a temporary reduction in a patients’s heart rate is to be compared with a standard drug. The drugs are to be administered to a patient at rest, and the percentage reduction in the heart rate it to be measured after five minutes. The comparison between the two drugs is based on the differences for each patient in the percentage heart rate reductions achieved by the two drugs.
39 / 1
Bivariate data
two categorical data
Bivariate, categorical data is often presented in the form of a contingency table ◮ Counting the occurrences of each possible pair of levels and placing the frequencies in each cell ◮ Can focus on the relationships by comparing the rows or columns. = ⇒ Later, statistical tests will be developed to determine whether there is any association between variables.
40 / 1
Bivariate data
two categorical data - example 2: drug allergies
Hyperallergic Allergic Mildy No Sum Drug A 11 30 36 23 100 Drug B 8 31 25 36 100 Drug C 13 28 28 31 100 Sum 32 89 89 90 300
41 / 1
Bivariate data
two categorical data - example 2: drug allergies
Table shows that the three drugs are any different with respect to the types of allergic reaction that they cause. Marginal distribution of the data shows three similar distribution.
42 / 1
Bivariate data
graphical summaries of two-way contingency table
Barplots can be used effectively to show the data in a two-way table.
Hyperallergic Allergic Mildly allergic No allergy 20 40 60 80
43 / 1
Bivariate data
two numerical data - independent samples
In many situations we have two samples that may or may not come from the same population. When two samples are drawn from populations in such a manner that knowing the outcomes of one sample doesn’t affect the knowledge of the distribution of the other sample; => they are independent samples.
44 / 1
Bivariate data
two numerical data - independent samples
The stem-and-leaf plot and boxplot are very effective at summarizing a distribution. ◮ when the data is small: the stem-and-leaf plot ◮ for larger data: the boxplot ◮ comparisons of the two samples: putting them side by side (back to back)
45 / 1
Bivariate data
two numerical data - independent samples
We want to compare the means of two populations: selecting a random sample from each of the two populations Sample 1 Population 1 Sample 2 Population 2
46 / 1
Bivariate data
two numerical data - matched or dependent samples
Before (or Pre) After (or Post) sample
47 / 1
Bivariate data
two numerical data - independent samples: t-test
If the populations are normally distributed or nearly so, and want to compare the mean of one population with the mean of another population, then a t-test can be used (cf. nonparametric Wilcoxon test).
48 / 1
Bivariate data
Understanding of t-test −2 2 4 6 0.2 0.4 0.6
49 / 1
Bivariate data
Understanding of t-test −2 −1 1 2 3 0.2 0.4 0.6
50 / 1
Bivariate data
two numerical data - independent samples: nerve conductivity speeds
Healthy patients data:
52.20 53.81 53.68 54.47 54.65 52.43 54.43 54.06 52.85 54.12 54.17 55.09 53.91 52.95 54.41 54.14 55.12 53.35 54.40 53.49 52.52 54.39 55.14 54.64 53.05 54.31 55.90 52.23 54.90 55.64 54.48 52.89
Nerve disorder patients data:
50.68 47.49 51.47 48.47 52.50 48.55 45.96 50.40 45.07 48.21 50.06 50.63 44.99 47.22 48.71 49.64 47.09 48.73 45.08 45.73 44.86 50.18 52.65 48.50 47.93 47.25 53.98
51 / 1
Bivariate data
two numerical data - independent samples: nerve conductivity speeds
40 45 50 55 60 0.0 0.1 0.2 0.3 0.4
Density plots
N = 32 Bandwidth = 0.4181 Density
52 / 1
Bivariate data
Result
> plot(density(healthpatient), xlim=c(40,60), main="Density plots") > lines(density(nervedis),lty=2,col="red") > t.test(healthpatient,nervedis) Welch Two Sample t-test data: healthpatient and nervedis t = 10.608, df = 32.684, p-value = 4.064e-12 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.364492 6.436850 sample estimates: mean of x mean of y 53.99437 48.59370 >
53 / 1
Practice
Hand-on practice
54 / 1