Computing for engineering simulation Data analysis I, II and - - PowerPoint PPT Presentation

computing for engineering simulation data analysis i ii
SMART_READER_LITE
LIVE PREVIEW

Computing for engineering simulation Data analysis I, II and - - PowerPoint PPT Presentation

Computing for engineering simulation Data analysis I, II and Experimental Thinking Jin Yoon Statistical Consulting Unit The Australian National University May 2020 What is Statistics Definitions In everyday usage, the term


slide-1
SLIDE 1

Computing for engineering simulation Data analysis I, II and Experimental Thinking

Jin Yoon Statistical Consulting Unit The Australian National University May 2020

slide-2
SLIDE 2

What is ” Statistics”

Definitions

◮ In everyday usage, the term statistics : numerical facts or data, e.g., the unemployment rate 9.2%, or the average smartphone price is $1000 ◮ The field or study of statistics : more complex - collecting, summarising, analysing and interpreting data ◮ Both above : use of the word ’data’

1 / 1

slide-3
SLIDE 3

Data

Types of statistical data

In Statistics, data can be considered to be one of two types: ◮ categorical data : generally non-numeric or qualitative, e.g., color, gender, religion, etc

◮ subdivided into two types: nominal and ordinal

◮ numerical data : quantitative and generally measurements, e.g., age, income, height, etc.

◮ subdivided into two types: discrete and continuous

2 / 1

slide-4
SLIDE 4

Data

Types of statistical data

“Methods for viewing and summarizing data depend on which type of data it is.”

3 / 1

slide-5
SLIDE 5

Data

Types of statistical data

Depend on the number of variables: ◮ univariate data: when a dataset consists of a single variable: graphical and numeric summaries of a dataset ◮ bivariate data: there are two variables in a dataset. ◮ multivariate data: two or more variables in a dataset.

4 / 1

slide-6
SLIDE 6

Data

Population vs. sample

◮ ‘Data’ can refer either to the population or just sample selected from that population. ◮ Very important to distinguish between numerical measures of population and numerical measures of a sample. ◮ A parameter and a statistic : a numerical measure of a population and a sample, respectively

5 / 1

slide-7
SLIDE 7

Data

Summarising data

◮ Summarise the (sample) data in order to present them in a more meaningful or more easily interpreted form. ◮ Descriptive Statistics and graphical summaries - methods for summarising, describing and displaying data ◮ Using descriptive or summary measures in order to learn about characteristic of the population - statistical inference.

6 / 1

slide-8
SLIDE 8

Summarising Data

Example: machine breakdowns

The engineer in charge of the maintenance of the machine keeps records on the breakdown causes over a period of a year. Altogether there are 46 breakdowns, of which 9 are “electrical causes” , 24 are “mechanical causes” , and 13 are “operator misuse” . Summaries this data?

7 / 1

slide-9
SLIDE 9

Summarising Data

Example: machine breakdowns

Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x1, . . . , x46, with each observation taking one of the values {electrical, mechanical, misuse}

8 / 1

slide-10
SLIDE 10

Summarising Data

Example: machine breakdowns

Is this categorical data or numerical data? How many variables? Actually consists of 46 categorical observations, x1, . . . , x46, with each observation taking one of the values {electrical, mechanical, misuse}

9 / 1

slide-11
SLIDE 11

Summarising Data

categorical data - table

◮ Table is the most common way to summarise categorical data Breakdown cause Frequency Electrical 9 Mechanical 24 Misuse 13 Total 46 ◮ Graphically with barplots, dot charts and pie charts.

10 / 1

slide-12
SLIDE 12

Summarising Data

categorical data - barplot

Electrical Mechanical Misuse Breakdown cause Frequency 5 10 15 20 25

11 / 1

slide-13
SLIDE 13

Summarising Data

categorical data - misleading plots

When we look at a barplot, we are trying to visualize differences between the data presented, assuming a scale starts at 0. The reader can be deliberately misled by making a graphic non-0

  • based. Such misleading barplots appear frequently in the

media.

12 / 1

slide-14
SLIDE 14

Summarising Data

categorical data - misleading plots

IE Firefox Chrome

Web Preference

Thousands 10 20 30 40 IE Firefox Chrome

Web Preference

Thousands 42 43 44 45 46

13 / 1

slide-15
SLIDE 15

Summarising Data

categorical data - others, Pie chart

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A barchart or dotchart is a preferable way of displaying this type of data.

14 / 1

slide-16
SLIDE 16

Summarising Data

numerical data

Now consider a numerical data and basically what we want to understand is the distribution of the data ◮ what is the range of the data? ◮ what is the central tendency? ◮ how spread out are the values? Answer these questions graphically or numerically.

15 / 1

slide-17
SLIDE 17

Summarising Data

measure of central tendency

A measure of central tendency : a numerical measure that locates the centre of a distribution of measurements or describes ’typical value’ ◮ Most common measures of centre: 3M (Mode, Median, Mean) ◮ Not only simplify a description of the data but also comparing different data quantitatively

16 / 1

slide-18
SLIDE 18

Summarising Data

measure of central tendency - 3M

The most common measures of centre, 3M: ◮ Mode: the observation in the dataset that occurs most often (i.e., has the highest frequency of occurrence.) ◮ Median: the middle number in an ordered dataset. ◮ Mean: the arithmetic average of all the measurements in the dataset.

17 / 1

slide-19
SLIDE 19

Summarising Data

measure of central tendency - example

Find the 3M of the following sample of dataset: X = {18, 19, 18, 20, 18, 18, 20, 21, 37, 18} ◮ Mode : 18 ◮ Median : 18.5 ◮ Mean : 20.7

18 / 1

slide-20
SLIDE 20

Summarising Data

measures of variability - how spread out

A measure of variability: a single value to measure the internal variation of the data - which data items vary from one another or from a central point. ◮ Three of the more commonly used measures of variability: Range, Variance, Standard deviation

19 / 1

slide-21
SLIDE 21

Summarising Data

measure of variability

◮ Range: the difference between the largest and the smallest values in the data (the simplest one) ◮ Variance: a single value obtained by summing the squares of the deviations from the mean and dividing this sum by (n − 1), n is the sample size ◮ Standard deviation: the square root of the variance

20 / 1

slide-22
SLIDE 22

Summarising Data

measure of variability - variance and standard deviation

How to calculate the variance of the sample data, x with sample size, n: ◮ First, calculate the sample mean, ¯ x, then ◮ Calculate the deviation from the mean or the residual, x − ¯ x, then take the squares (x − ¯ x)2 ◮ Summing the squared residuals and dividing by (n − 1)

21 / 1

slide-23
SLIDE 23

Summarising Data

measures of variability - variance and standard deviation

Consider the following sample of data: x = {10, 12, 15, 17, 21} ◮ The sample mean is 10+12+···+21

5

= 15 ◮ x 10 12 15 17 21 75 (x − ¯ x)

  • 5
  • 3

2 6 (x − ¯ x)2 25 9 4 36 74 ◮ The variance of x is 74/(5 − 1) = 18.5 and √ 18.5 = 4.30

22 / 1

slide-24
SLIDE 24

Summarising Data

how to get the mean and variance (standard deviation in R) > x = c(10,12,15,17,21) # input data as a vector > xbar = mean(x) # calculate the mean directly > x - xbar [1] -5 -3 2 6 > (x-xbar)^2 [1] 25 9 4 36 > sum((x-xbar)^2) [1] 74 > sum((x-xbar)^2) / (length(x) - 1) [1] 18.5 > var(x) # calculate the variance directly [1] 18.5

23 / 1

slide-25
SLIDE 25

Summarising Data

Five number summary

The five number summary gives you a rough idea about what the dataset looks like and includes 5 values (items): ◮ The minimum (min) and the maximum (max) ◮ The first quartile (25%), the median (50%), and the third quartile (75%)

24 / 1

slide-26
SLIDE 26

Summarising Data

Five number summary

> x = c(10,12,15,17,21) > length(x) [1] 5 > sum(sort(x)[3]) # the median the hard way [1] 15 > median(x) # easy way [1] 15 > quantile(x, c(0.25,0.5,0.75)) 25% 50% 75% 12 15 17 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 12 15 15 17 21

25 / 1

slide-27
SLIDE 27

Graphical summaries of the data

categorical data and numerical data

In general, a table of numbers is not very informative, whereas a picture or graphical representation can be quite informative: ◮ Categorical (or Qualitative) data: pie charts, bar charts and dotplots - easily grasp the distribution of the data quickly ◮ Numerical (or Quantitative) data: boxplot, histogram and density curve

26 / 1

slide-28
SLIDE 28

Graphical summaries of the data

numerical (quantitative) data

Numerical data are summarized by graphically with histogram, boxplot, and density curve. ◮ Histogram: constructed by binning the data and counting the number of observations in each bin ◮ Density plot: thought of as plots of smoothed histogram ◮ Boxplot: visualization of five number summary (shown above)

27 / 1

slide-29
SLIDE 29

Graphical summaries of the data

numerical data - example: histogram and density plot in R

A random sample of 50 milk containers is selected and their milk contents are weighed and is shown below:

1.958 1.951 2.107 2.092 1.955 2.162 2.168 2.134 1.971 2.072 2.049 2.017 2.117 1.977 2.034 2.062 2.110 1.974 1.992 2.018 2.135 2.107 2.084 2.169 2.085 2.018 1.977 2.116 1.988 2.066 2.126 2.167 1.969 2.198 2.078 2.119 2.088 2.172 2.133 2.112 2.066 2.128 2.142 2.042 2.050 2.102 2.000 2.188 1.960 2.128

28 / 1

slide-30
SLIDE 30

Graphical summaries of the data

numerical data - example: histogram and density plot in R

Histogram of milkdata

milkdata Frequency 1.95 2.00 2.05 2.10 2.15 2.20 5 10 15

Histogram of milkdata

milkdata Density 1.95 2.00 2.05 2.10 2.15 2.20 2 4 6 8

29 / 1

slide-31
SLIDE 31

Graphical summaries of the data

numerical data - example: histogram and density plot in R

A histogram and density plot show ◮ Overall pattern: where is the centre of the data, what is its spread, what is the shape of the spread ◮ Unusual differences: a point lying away from main part of pattern is an outlier

30 / 1

slide-32
SLIDE 32

Graphical summaries of the data

numerical data - example: histogram and density plot in R codes

> milkdata = scan() 1: 1.958 1.951 2.107 2.092 1.955 2.162 2.168 2.134 1.971 10: 2.072 2.049 2.017 2.117 1.977 2.034 2.062 2.110 1.974 19: 1.992 2.018 2.135 2.107 2.084 2.169 2.085 2.018 1.977 28: 2.116 1.988 2.066 2.126 2.167 1.969 2.198 2.078 2.119 37: 2.088 2.172 2.133 2.112 2.066 2.128 2.142 2.042 2.050 46: 2.102 2.000 2.188 1.960 2.128 51: Read 50 items > > hist(milkdata) > hist(milkdata, breaks = 10, probability = TRUE) > lines(density(milkdata), col="red")

31 / 1

slide-33
SLIDE 33

Graphical summaries of the data

numerical data - example: boxplot in R

A boxplot is a schematic presentation of the sample median, the upper and lower quartiles, and the largest and smallest data

  • bservations.

That is, it is a graphical representation of the five number summary!

32 / 1

slide-34
SLIDE 34

Graphical summaries of the data

numerical data - example: boxplot in R

The inter-quartile range (IQR): the centre spans the quartiles (Q1 to Q3) If the median is near the centre, the distribution is reasonably symmetric. Upper whisker, Q3 + 1.5 × IQR, and lower whisker, Q1 − 1.5 × IQR: if the values are outside of these ranges, it could be potential outliers.

33 / 1

slide-35
SLIDE 35

Graphical summaries of the data

numerical data - example: boxplot in R

1.95 2.00 2.05 2.10 2.15 2.20

MILK WEIGHT IN LITRE

Weight in litre 1.95 2.00 2.05 2.10 2.15 2.20

MILK WEIGHT IN LITRE

Weight in litre

34 / 1

slide-36
SLIDE 36

Graphical summaries of the data

Pros and cons of histograms and boxplots Advantage Disadvantage Histogram Common Information lost due to grouping Can handle large data Effect of bin width Boxplots Easy Can’t see multimodality Shape and outliers Don’t know the sample size

35 / 1

slide-37
SLIDE 37

More than one datasets

36 / 1

slide-38
SLIDE 38

Example 1

Nerve conductivity speeds

A neurologist is investigating how diseases of the periphery nerves in humans influence the conductivity speed of the nervous system. The conductivity speed of nerves is determined by administering an electric shock to a patient’s leg and measuring the time it takes to flex a muscle in the patient’s foot. Nerve conductivity speed measurements are made on n = 32 healthy patients and on m = 277 patients who are known to have a periphery nerve disorder.

37 / 1

slide-39
SLIDE 39

Example 2

Drug allergies

Three drugs (A,B,C) are compared with respect to the types

  • f allergic reaction that they cause to patients.

A group of n = 300 patients is randomly split into three groups of 100 patients, each of which is given one of the three drugs. The patients are then categorized as being hyperallergic, allergic, mildly allergic, or as having no allergy.

38 / 1

slide-40
SLIDE 40

Example 3

Heart rate reduction

A new drug for inducing a temporary reduction in a patients’s heart rate is to be compared with a standard drug. The drugs are to be administered to a patient at rest, and the percentage reduction in the heart rate it to be measured after five minutes. The comparison between the two drugs is based on the differences for each patient in the percentage heart rate reductions achieved by the two drugs.

39 / 1

slide-41
SLIDE 41

Bivariate data

two categorical data

Bivariate, categorical data is often presented in the form of a contingency table ◮ Counting the occurrences of each possible pair of levels and placing the frequencies in each cell ◮ Can focus on the relationships by comparing the rows or columns. = ⇒ Later, statistical tests will be developed to determine whether there is any association between variables.

40 / 1

slide-42
SLIDE 42

Bivariate data

two categorical data - example 2: drug allergies

Hyperallergic Allergic Mildy No Sum Drug A 11 30 36 23 100 Drug B 8 31 25 36 100 Drug C 13 28 28 31 100 Sum 32 89 89 90 300

41 / 1

slide-43
SLIDE 43

Bivariate data

two categorical data - example 2: drug allergies

Table shows that the three drugs are any different with respect to the types of allergic reaction that they cause. Marginal distribution of the data shows three similar distribution.

42 / 1

slide-44
SLIDE 44

Bivariate data

graphical summaries of two-way contingency table

Barplots can be used effectively to show the data in a two-way table.

Hyperallergic Allergic Mildly allergic No allergy 20 40 60 80

43 / 1

slide-45
SLIDE 45

Bivariate data

two numerical data - independent samples

In many situations we have two samples that may or may not come from the same population. When two samples are drawn from populations in such a manner that knowing the outcomes of one sample doesn’t affect the knowledge of the distribution of the other sample; => they are independent samples.

44 / 1

slide-46
SLIDE 46

Bivariate data

two numerical data - independent samples

The stem-and-leaf plot and boxplot are very effective at summarizing a distribution. ◮ when the data is small: the stem-and-leaf plot ◮ for larger data: the boxplot ◮ comparisons of the two samples: putting them side by side (back to back)

45 / 1

slide-47
SLIDE 47

Bivariate data

two numerical data - independent samples

We want to compare the means of two populations: selecting a random sample from each of the two populations Sample 1 Population 1 Sample 2 Population 2

46 / 1

slide-48
SLIDE 48

Bivariate data

two numerical data - matched or dependent samples

Before (or Pre) After (or Post) sample

47 / 1

slide-49
SLIDE 49

Bivariate data

two numerical data - independent samples: t-test

If the populations are normally distributed or nearly so, and want to compare the mean of one population with the mean of another population, then a t-test can be used (cf. nonparametric Wilcoxon test).

48 / 1

slide-50
SLIDE 50

Bivariate data

Understanding of t-test −2 2 4 6 0.2 0.4 0.6

49 / 1

slide-51
SLIDE 51

Bivariate data

Understanding of t-test −2 −1 1 2 3 0.2 0.4 0.6

50 / 1

slide-52
SLIDE 52

Bivariate data

two numerical data - independent samples: nerve conductivity speeds

Healthy patients data:

52.20 53.81 53.68 54.47 54.65 52.43 54.43 54.06 52.85 54.12 54.17 55.09 53.91 52.95 54.41 54.14 55.12 53.35 54.40 53.49 52.52 54.39 55.14 54.64 53.05 54.31 55.90 52.23 54.90 55.64 54.48 52.89

Nerve disorder patients data:

50.68 47.49 51.47 48.47 52.50 48.55 45.96 50.40 45.07 48.21 50.06 50.63 44.99 47.22 48.71 49.64 47.09 48.73 45.08 45.73 44.86 50.18 52.65 48.50 47.93 47.25 53.98

51 / 1

slide-53
SLIDE 53

Bivariate data

two numerical data - independent samples: nerve conductivity speeds

40 45 50 55 60 0.0 0.1 0.2 0.3 0.4

Density plots

N = 32 Bandwidth = 0.4181 Density

52 / 1

slide-54
SLIDE 54

Bivariate data

Result

> plot(density(healthpatient), xlim=c(40,60), main="Density plots") > lines(density(nervedis),lty=2,col="red") > t.test(healthpatient,nervedis) Welch Two Sample t-test data: healthpatient and nervedis t = 10.608, df = 32.684, p-value = 4.064e-12 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.364492 6.436850 sample estimates: mean of x mean of y 53.99437 48.59370 >

53 / 1

slide-55
SLIDE 55

Practice

Hand-on practice

54 / 1