Statistical Methods: Lecture 1 Dennis Dobler Vrije Universiteit - - PowerPoint PPT Presentation

▶

Feb 21, 2023 136 likes •871 views

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data Statistical Methods: Lecture 1 Dennis Dobler Vrije Universiteit Amsterdam October 30, 2017 Dennis Dobler Vrije Universiteit

SLIDE 1

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Statistical Methods: Lecture 1

Dennis Dobler

Vrije Universiteit Amsterdam

October 30, 2017

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 2

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Lecture Overview

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 3

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Organisation

◮ Lecturer: Dennis Dobler ◮ Assistants: Nurzhan, Paul, Francisco, Birgit ◮ Lectures: 10, Mon and Wed, see course manual ◮ Computer sessions: Tue (1st week: also Thu), bring fully charged laptop!

Division: check Canvas this evening / tomorrow!

◮ Exercise classes: Thu Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 4

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Assessment

◮ Assignments: 4 weekly assignments (graded) + Assignment 0 (fail/pass)

all have to be handed in, otherwise you fail the course

◮ Midterm Exam: Monday November 20, 16:00–17:45 (instead of lecture), TenT ◮ Final Exam: Tuesday December 19, 15:15–18:00, Emergohal (Amstelveen) ◮ Exam Grade: Exam = 0.4×Midterm + 0.6×Final, if both at least 5. Otherwise

Exam = min{Midterm,Final}

◮ Grade:

3 4 ×Exam + 1 4 ×Assignments

◮ Pass: if(Exam >= 5.5 & Grade >= 5.5) print(’Pass’) Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 5

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Resources

◮ Course manual: available on Canvas ◮ Book: Elementary Statistics, by Mario F. Triola, twelfth edition (Pearson New

International Edition), ISBN: 9781292039411

◮ A lot of sections are divided in Part 1 (basics) and Part 2 (more advanced): unless

stated otherwise during lectures, only Part 1 has to be studied.

◮ A copy of the book is stationary available in the course literature shelfs in the library on

the first floor of the VU main building (1C-02).

◮ Lecture slides: available on Canvas

(NB. sometimes we treat topics not in the book).

◮ Software: R, software package, and RStudio, IDE for R. Downloadable from

r-project.org and rstudio.com

◮ R manual: pdf available on Canvas ◮ Setting up R: pdf available on Canvas. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 6

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Assignments

◮ Assignments have to be made in groups of 2 students. ◮ Published on Wednesday on Canvas (after lecture), due Wed. 23.59 week after ◮ Enrol yourself on Canvas asap (if you have not done it yet). ◮ Assignments can only be submitted if you are enrolled to a group. ◮ Groups of size 1 are not allowed!

Otherwise: Two groups of size 1 are randomly merged.

◮ Hand in online via Canvas. ◮ Deadlines are strict: too late → course failed. ◮ Theoretical questions: solve these without R. ◮ Other questions: use R. Ask questions about these exercises during computer

sessions.

◮ On Canvas: how to write your assignment. ◮ On average, one day of work. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 7

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Exercise classes

◮ The exercise classes are a good oppurtunity to prepare for the exams. ◮ Exercises from the Triola book are discussed.

Warning: former book editions have a different numbering!

◮ Prepare the exercises before class - this will maximise your learning effect. ◮ Division into exercise classes: basically the same as for computer sessions. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 8

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Motivation Cupid in your network

Source: ”Cupid in your network” http://research.facebook.com/ Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 9

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Motivation

Why do I have to learn statistics? Can’t I just crunch the numbers?

◮ Answer research questions: test claims/hypotheses ◮ Make decisions and/or predictions ◮ Statistical literacy ◮ Statistics is used (almost) everywhere:

business (data/business analyst), medical sciences, politics, sports, economics, . . .

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 10

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Motivation

Statistical methods are used to:

◮ Compare search engines ◮ Analyse experiments in human-computer interaction ◮ Analyse and interpret survey results ◮ Analyse and interpret user data of social media ◮ Error analysis of social web ◮ Design and analyse data of experiments for social networks ◮ Google Analytics Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 11

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Motivation

Statistical Methods (the course) can be followed up by:

◮ Information Retrieval ◮ Human Computer Interaction ◮ Machine Learning ◮ Data Mining Techniques ◮ The Social Web ◮ Collective Intelligence

More advanced statistics courses in Master:

◮ Experimental Design and Data Analysis (CS, AI) ◮ Research Methods (IS, AI) Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 12

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Goals and topics

After this course you should be:

◮ familiar with basic principles and techniques of statistics; ◮ able to apply them to data using the statistical package R; ◮ able to present results from statistical analyses in a clear, concise way; ◮ able to interpret and critically evaluate these results.

The topics you will learn about are:

◮ summarising data; ◮ basics of probability theory; ◮ estimating means and proportions; ◮ hypothesis testing for one- and two-sample problems; ◮ correlation and linear regression; ◮ contingency tables. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 13

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

What is statistics?

Statistics is the science of data: the study of collecting, organising, analysing, interpreting and presenting data. We use statistics to gain information about a group of objects (i.e. population) and/or to make decisions and predictions when randomness is involved. Census is collection of data from every member of population. Usually too large too collect. Therefore, a sample, a selected subcollection from the population, is studied: Sample → Data → Analysis → Conclusion about population

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 14

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.2 Statistical and critical thinking

A statistical study consists of the following steps:

1. Prepare

◮ Context ◮ Source ◮ Sampling method (how to obtain samples?)

2. Analyse

◮ Graph data ◮ Explore data ◮ Apply statistical methods

3. Conclude

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 15

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.2 Statistical and critical thinking

Recall: sample is subcollection of population. So different sample → different data. Hence, possibly different conclusions about population! A sample should be representative (same characteristics as population) and unbiased (no systematic difference with population).

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 16

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.4 Collecting sample data

There are different methods to collect sample data:

◮ Voluntary response sample: subjects decide themselves to be included in sample. ◮ Random sample: each member of population has equal probability of being

selected.

◮ Simple random sample: each sample of size n has equal probability of being

chosen.

◮ Systematic sampling: after starting point, select every k-th member. ◮ Convenience sampling: easily available results. ◮ Stratified sampling: divide population into subgroups (strata) such that subjects

within groups have same characteristics, then draw a (simple) random sample from each group.

◮ Cluster sampling: Divide popluation into sections (clusters), then randomly select

some of these clusters.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 17

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.4 Collecting sample data

Part 2 Important concepts:

◮ Variable: quantity that may vary

In cause and effect studies:

◮ Explanatory (independent) variable: variable which might cause the effect being

studied

◮ Response (dependent) variable: variable that represents the effect being studied ◮ Confounding: occurs when influences of different explanatory variables on

response variable mix and can not be distinguished anymore.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 18

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.4 Collecting sample data

Part 2 Different types of study:

◮ Observational study: characteristics of subjects are observed, but subjects are not

modified.

◮ Retrospective (case-control): data from the past ◮ Cross-sectional: data from one point in time ◮ Prospective (longitudinal): data to be collected ◮ Experiment: some treatment is applied to subjects. ◮ Sometimes control and treatment group: single-blind and double-blind. ◮ Placebo effect, experimenter effect. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 19

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.3 Types of data

We have the data, and now what? Parameter: numerical measurement describing some characteristic of a population. Notation: typically Greek symbols, e.g. µ, σ, . . .. Statistic : numerical measurement describing some characteristic of a sample. Notation: small letters, e.g. ¯ x, s.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 20

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.3 Types of data

Data is not only numbers Quantitative (numerical) data: numbers representing counts or measurements. Qualitative (categorical) data: names or labels (‘1’, not 1) representing counts or measurements. Quantitative data:

◮ Discrete data: number of possible values is ”countable”. E.g. word counts,

number of coin tosses.

◮ Continuous data: collection of values is not countable. E.g. length, weight,

distance.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 21

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.3 Types of data

Level of measurement of data is used to determine which statistical methods might apply to the data.

◮ Qualitative data: ◮ Nominal: names, labels, categories (no ordering).

E.g. gender, eye color. Can not be used for computations.

◮ Ordinal: categories with ordering, but no (meaningful) differences.

E.g. U.S. grades (A-F), opinions (totally disagree / disagree / . . . / totally agree)

◮ Quantitative data: ◮ Interval: ordering possible and differences between numbers are meaningful, but there is

no natural zero starting point. E.g. year of birth, temperatures (Celsius/Fahrenheit).

◮ Ratio: ordering possible, differences are meaningful and there is a natural starting point.

E.g. body length, marathon times

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 22

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

1.3 Types of data

Determine the level of measurement for the following data:

◮ M&M colours ◮ Inauguration years of U.S. presidents ◮ Brain volumes (in cm3) ◮ Level of lead in blood (low/medium/high) Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 23

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Recap data

◮ Population vs. sample ◮ Different sample → possibly different conclusion about population ◮ Sample has to be representative and unbiased ◮ Different types of data and levels of measurement Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 24

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summarising and graphing data

Until the slides about numerical summaries, the coming topics are not in the book. From now on, we assume that data are from a representative and unbiased sample. Next step: summarise data. How do we summarise data? Numerically/graphically? Consider the following data set

f the amount of cotinine in blood. (Only top rows are displayed.)

## Smoker Passive smoker Non-smoker ## [1,] 1 384 ## [2,] ## [3,] 131 69 ## [4,] 173 19 ## [5,] 265 1 ## [6,] 210 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 25

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summarising and graphing data

Until the slides about numerical summaries, the coming topics are not in the book. From now on, we assume that data are from a representative and unbiased sample. Next step: summarise data. How do we summarise data? Numerically/graphically? Consider the following data set

f the amount of cotinine in blood. (Only top rows are displayed.)

## Smoker Passive smoker Non-smoker ## [1,] 1 384 ## [2,] ## [3,] 131 69 ## [4,] 173 19 ## [5,] 265 1 ## [6,] 210

Example of numerical summary:

## Smoker Passive smoker Non-smoker ## Mean 172.4750 60.5750 16.3500 ## Std. deviation 119.4983 138.0839 62.5335 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 26

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summarising and graphing data

Or we could summarise the data graphically:

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8 12

Passive smokers

Cotinine Frequency 100 300 500 5 15 25 35

Non−smokers

Cotinine Frequency 50 150 250 350 10 20 30

What gives most insight in data?

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 27

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summarising and graphing data

Every data set comes with a research question. Use your summary to answer your research question. Typically we are interested in the data distribution — where does the data lie? Good summary shows:

◮ what the data distribution looks like: location, spread/dispersion, range,

extremes, accumulations, gaps/holes, symmetry, . . . Depending on context and goal, also whether:

◮ data could be sampled from a certain distribution ◮ data is rounded ◮ different groups are needed for further analysis ◮ there are influences of other variables, e.g. time ◮ there is dependence between variables. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 28

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summarising and graphing data

Summarise to describe or find structure in data distribution:

◮ Graphical: tables, graphs, other figures of data distribution ◮ Descriptive ◮ Qualitative: describe shape, location and dispersion/variation of data distribution ◮ Quantitative: numerical summaries of location and variation

NB: first step in every data analysis: make some figures of data (if possible) for own

use. Could prevent wrong choice of statistical methods.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 29

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

◮ Frequency distribution (table) ◮ Bar chart ◮ Pareto bar chart ◮ Pie chart ◮ Histogram ◮ Time series

Some of these summaries can only be used for some types of data.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 30

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Data: exam grades

grades=c(10,7,6,10,8,5,8,7,5,9,7) grades2=cbind(1:11,grades) colnames(grades2)=c("Student", "Grade") head(grades2) ## Student Grade ## [1,] 1 10 ## [2,] 2 7 ## [3,] 3 6 ## [4,] 4 10 ## [5,] 5 8 ## [6,] 6 5

Frequency distribution: count occurences of category or number of values in interval:

freq=cbind(table(grades2[,2])) colnames(freq)=c("Frequency") print(freq) ## Frequency ## 5 2 ## 6 1 ## 7 3 ## 8 2 ## 9 1 ## 10 2 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 31

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Frequency distribution in R together with cumulative and relative frequency distribution:

freq=cbind(table(grades2[,2])) freq=cbind(freq[,1],cumsum(freq[,1]),freq[,1]/length(grades),cumsum(freq[,1])/length(grades)) colnames(freq)=c("Frequency","Cumulative","Rel. frequency","Cum. rel. frequency")

ptions(digits=2)

print(freq) ## Frequency Cumulative Rel. frequency Cum. rel. frequency ## 5 2 2 0.182 0.18 ## 6 1 3 0.091 0.27 ## 7 3 6 0.273 0.55 ## 8 2 8 0.182 0.73 ## 9 1 9 0.091 0.82 ## 10 2 11 0.182 1.00

Nicely looking frequency distribution table: Grades Frequency Cumulative

Rel. frequency
Cum. rel. frequency

1 5.00 2.00 2.00 0.18 0.18 2 6.00 1.00 3.00 0.09 0.27 3 7.00 3.00 6.00 0.27 0.55 4 8.00 2.00 8.00 0.18 0.73 5 9.00 1.00 9.00 0.09 0.82 6 10.00 2.00 11.00 0.18 1.00

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 32

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Bar charts:

5 6 7 8 9 10

Bar chart

Grades Frequency 0.0 1.0 2.0 3.0 5 6 7 8 9 10

Cumulative bar chart

Grades

Cum. frequency

2 4 6 8 10

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 33

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Data: population size in 2015.

population=c(322,1372,147,127,65,81,1278,36,407,1111) names(population)=c("US", "Chi", "Rus", "Jap", "GB", "Ger", "Ind", "Can", "SAm","Afr") par(mfrow=c(1,1)) barplot(population,main="Bar chart", ylab="Pop. size (mln)",col="red")

US Chi Rus Jap GB Ger Ind Can SAm Afr

Bar chart

Pop. size (mln)

400 800 1200

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 34

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Data: population size in 2015. Pareto chart:

par(mfrow=c(1,1)) barplot(sort(population,decreasing = TRUE), main="Pareto bar chart", ylab="Pop. size (mln)", col="blue")

Chi Ind Afr SAm US Rus Jap Ger GB Can

Pareto bar chart

Pop. size (mln)

400 800 1200

The Pareto chart orders the categories with respect to frequency. Only applies to data

f nominal level of measurement.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 35

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Data: population size in 2015. Pie chart:

pie(population/sum(population), col=c("green", "yellow" , "brown", "blue","red", "grey","purple", "orange", "pink", "black"))

US Chi Rus Jap GB Ger Ind Can SAm Afr

Size of pieces of pie is determined by relative frequency of category. Mainly used for qualitative data.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 36

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Histogram: area of bar is proportional to frequency in interval below the bar. Data: cotinine.

par(mfrow=c(1,3)) hist(cotinine[,1],main="Smokers",xlab="Cotinine",ylab="Frequency") hist(cotinine[,2],main="Passive smokers",xlab="Cotinine",ylab="Frequency") hist(cotinine[,3],main="Non-smokers",xlab="Cotinine",ylab="Frequency")

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8 10 12 14

Passive smokers

Cotinine Frequency 100 300 500 5 10 15 20 25 30 35

Non−smokers

Cotinine Frequency 50 150 250 350 10 20 30 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 37

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Differences in histograms are caused by the number of cells (intervals) and the location of the bins.

par(mfrow=c(1,2)) hist(cotinine[,1],main="Smokers",xlab="Cotinine",ylab="Frequency",breaks=8) hist(cotinine[,3],main="Non-smokers",xlab="Cotinine",ylab="Frequency",xlim=c(0,max(cotinine)))

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8

Non−smokers

Cotinine Frequency 100 300 500 10 30

Difference with bar chart: bars are in natural order and always adjacent. Histogram is for quantitative data.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 38

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Pay attention to the presentation of graphical summaries: use reasonable dimensions (preferably square) and scale.

par(mfrow=c(1,3)) hist(cotinine[,1],main="Smokers",xlab="Cotinine",ylab="Frequency",xlim=c(100,400),ylim=c(0,100)) hist(cotinine[,1],main="Smokers",xlab="Cotinine",ylab="Frequency",ylim=c(0,12)) hist(cotinine[,1],main="Smokers",xlab="Cotinine",ylab="Frequency")

Smokers

Cotinine Frequency 100 200 300 400 20 40 60 80 100

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8 10 12

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8 10 12 14

Also, appropriate title and labeling of axes.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 39

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Smokers

Cotinine Density 100 200 300 400 500 0.000 0.002 0.004

Non−smokers

Cotinine Density 100 200 300 400 500 0.000 0.010

Smokers

Cotinine Density 100 200 300 400 500 0.000 0.010 0.020

Non−smokers

Cotinine Density 100 200 300 400 500 0.000 0.010 0.020

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 40

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Time series plot: graphical presentation of quantity that varies over time. Data: Yearly number of sunspots.

par(mfrow=c(1,2)) plot(1700:1988,sunspot.year,xlab="Year",ylab="Number sunspots",type="l") plot(1700:1988,log(sunspot.year),xlab="Year",ylab="Log number sunspots",type="l")

1700 1800 1900 50 150 Year Number sunspots 1700 1800 1900 1 3 5 Year Log number sunspots

Same data? Pay attention to scale!

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 41

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Graphical summaries

Important for graphical summaries:

◮ Choice of summary depends on type of data (level of measurement) and context ◮ Choose appropriate dimensions and scale of your figures ◮ Try to use same scale to compare data sets (if applicable) Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 42

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Recall, there are two ways of describing data:

◮ Qualitatively: describe shape location and dispersion/variation of data distribution ◮ Quantitatively (numerically): numerical summaries of location and variation Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 43

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Qualitative description

Shape: make a smooth approximation of histogram. Shape of the smooth curve gives an idea of the data distribution.

Symmetrical

Density −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

Right−skewed

Density 5 10 15 0.00 0.05 0.10 0.15 0.20

Uniform

Density −3 −2 −1 1 2 3 0.00 0.05 0.10 0.15 0.20 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 44

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Qualitative description

Location: data distribution can also be described by location (position on x axis). Same shape, but different location:

Around 0

Density −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

Around 10

Density 7 8 9 10 11 12 13 0.0 0.1 0.2 0.3 0.4

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 45

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Qualitative description

Dispersion (spread/variation): amount that data value vary among themselves. Same shape, same location, but different dispersion:

Smaller dispersion

Density −15 −5 5 10 15 0.0 0.1 0.2 0.3 0.4

Larger dispersion

Density −15 −5 5 10 15 0.0 0.1 0.2 0.3 0.4

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 46

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Qualitative description

Here, difference in shape is easy to see, but difference in location and/or dispersion not:

Density 0.0 0.2 0.4 0.6 2 4 6 8 Density −0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0

Luckily we can describe location and spread/dispersion also numerically.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 47

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Numerical summaries

Numerical summaries aim to describe the data distribution with numerical values for

◮ location ◮ spread ◮ skewness ◮ ...

From now on we follow the book again in this lecture: Chapter 2.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 48

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Measure of center: value at the center or middle of a data set. There are different measures of center:

◮ Mean ◮ Median ◮ Mode Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 49

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Let (x1, . . . , xn) be a data set of size n. The mean is the ”average” (in colloquial language) and is denoted and computed by mean = n

i=1 xi

n = x1 + . . . + xn n .

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 50

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Let (x1, . . . , xn) be a data set of size n. The mean is the ”average” (in colloquial language) and is denoted and computed by mean = n

i=1 xi

n = x1 + . . . + xn n . The mean uses every data value. It is not ”robust”: strongly affected by extreme values. In R: mean()

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 51

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Let (x1, . . . , xn) be a data set of size n. The mean is the ”average” (in colloquial language) and is denoted and computed by mean = n

i=1 xi

n = x1 + . . . + xn n . The mean uses every data value. It is not ”robust”: strongly affected by extreme values. In R: mean() The sample mean is denoted by ¯

x. So ¯

x = (n

i=1 xi)/n.

The population mean is denoted by µ. So µ = (N

i=1 xi)/N.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 52

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

The median is the ”middle” value of the data set (after sorting). The median is robust: not much affected by extreme values. In R: median()

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 53

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

The median is the ”middle” value of the data set (after sorting). The median is robust: not much affected by extreme values. In R: median() The mode is the value that occurs with highest frequency. The mode is hardly used for numerical data, but it can be applied to nominal data. Also, R does not have a built-in function to compute the mode... (Try to write your own!)

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 54

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

The median is the ”middle” value of the data set (after sorting). The median is robust: not much affected by extreme values. In R: median() The mode is the value that occurs with highest frequency. The mode is hardly used for numerical data, but it can be applied to nominal data. Also, R does not have a built-in function to compute the mode... (Try to write your own!) If a data set has a unique mode, it is unimodal. If it has two different modes, it is

bimodal. Or multimodal if there are more than two different modes.

Loosely speaking, graphs with two different peaks are called bimodal as well:

Bimodal

Density −2 2 4 6 0.00 0.20

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 55

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Data: exam grades.

## Student Grade ## [1,] 1 10 ## [2,] 2 7 ## [3,] 3 6 ## [4,] 4 10 ## [5,] 5 8 ## [6,] 6 5 ## [7,] 7 8 ## [8,] 8 7 ## [9,] 9 5 ## [10,] 10 9 ## [11,] 11 7 ## [1] 5 5 6 7 7 7 8 8 9 10 10

Mean = 11

i=1 xi

11 = 82/11 = 7.45454545 . . . ≈ 7.5, Median = 7 (7 is middle value in sorted vector), Mode = 7 (7 occurs most frequently).

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 56

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.2 Measures of center

Data:cotinine. Recall:

Smokers

Cotinine Frequency 100 200 300 400 500 4 8 12

Non−smokers

Cotinine Frequency 50 150 250 350 10 20 30

Mean and median computed in R for these two data sets:

means=apply(cotinine[,c(1,3)],2,mean); medians=apply(cotinine[,c(1,3)],2,median) table=matrix(c(means,medians),2,2,byrow=T) colnames(table)=c("Smoker","Non-smoker");rownames(table)=c("Mean", "Median")

ptions(digits=4); print(table)

## Smoker Non-smoker ## Mean 172.5 16.35 ## Median 170.0 0.00

Note the larger difference between mean and median for non-smokers. This is caused by the skewed distribution, in particular the outliers to the right.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 57

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

Consider the following data set, which consists of waiting times (min) at two banks:

## Bank1 Bank2 ## 1 4.1 6.6 ## 2 5.2 6.7 ## 3 5.6 6.7 ## 4 6.2 6.9 ## 5 6.7 7.1 ## 6 7.2 7.2 ## 7 7.7 7.3 ## 8 7.7 7.4 ## 9 8.5 7.7 ## 10 9.3 7.8 ## 11 11.0 7.8

For both banks we have that the mean and median waiting time are equal to 7.2

minutes. But the customers of which bank would be more satisfied?

Histogram of bank1

Waiting time Density 4 5 6 7 8 9 10 0.00 0.10 0.20

Histogram of bank2

Waiting time Density 6.6 7.0 7.4 7.8 0.0 0.4 0.8 1.2 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 58

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

Adjust the scale to allow a better comparison of the histograms:

par(mfrow=c(1,2)) hist(bank1,xlab="Waiting time",prob=T,xlim=c(4,11),ylim=c(0,0.7)) hist(bank2,xlab="Waiting time",prob=T,xlim=c(4,11),ylim=c(0,0.7),breaks=c(6,7,8)) Histogram of bank1

Waiting time Density 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6

Histogram of bank2

Waiting time Density 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6

We see that the variation (dispersion/spread) is much smaller at Bank 2. How to quantify this?

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 59

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

The (sample) standard deviation is a very common measure of variation: measures how much the values deviate from the sample mean. It is denoted and computed by s = n

i=1(xi − ¯

x)2 n − 1 =

i=1 x2 i − (n i=1 xi)2

n(n − 1) .

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 60

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

The (sample) standard deviation is a very common measure of variation: measures how much the values deviate from the sample mean. It is denoted and computed by s = n

i=1(xi − ¯

x)2 n − 1 =

i=1 x2 i − (n i=1 xi)2

n(n − 1) . The sample standard deviation is the square root of the sample variance (”mean quadratic deviation from the mean”): s2 = n

i=1(xi − ¯

x)2 n − 1 . The sample standard deviation is computed in R by sd() and the variance by var().

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 61

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

The (sample) standard deviation is a very common measure of variation: measures how much the values deviate from the sample mean. It is denoted and computed by s = n

i=1(xi − ¯

x)2 n − 1 =

i=1 x2 i − (n i=1 xi)2

n(n − 1) . The sample standard deviation is the square root of the sample variance (”mean quadratic deviation from the mean”): s2 = n

i=1(xi − ¯

x)2 n − 1 . The sample standard deviation is computed in R by sd() and the variance by var(). The population standard deviation is denoted by σ and the population variance by σ2. Advantage of the standard deviation: measured in the same units as the data values. (As opposed to variance: square of units.)

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 62

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

The (sample) standard deviation is a very common measure of variation: measures how much the values deviate from the sample mean. It is denoted and computed by s = n

i=1(xi − ¯

x)2 n − 1 =

i=1 x2 i − (n i=1 xi)2

n(n − 1) . The sample standard deviation is the square root of the sample variance (”mean quadratic deviation from the mean”): s2 = n

i=1(xi − ¯

x)2 n − 1 . The sample standard deviation is computed in R by sd() and the variance by var(). The population standard deviation is denoted by σ and the population variance by σ2. Advantage of the standard deviation: measured in the same units as the data values. (As opposed to variance: square of units.) Another measure for variation is the range: maximum - minimum. The ranges uses

nly two values, so very sensitive to extreme values.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 63

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.3 Measures of variation

Let’s compute the standard deviation and variance for the two banks:

s1=sd(bank1);s2=sd(bank2) v1=var(bank1);v2=var(bank2) Table=matrix(c(s1,s2,v1,v2),2,2,byrow=T) rownames(Table)=c("Standard deviation","Variance");colnames(Table)=c("Bank 1","Bank 2") print(Table) ## Bank 1 Bank 2 ## Standard deviation 1.961 0.445 ## Variance 3.846 0.198

Which customers would be more satisfied?

Histogram of bank1

Waiting time Density 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6

Histogram of bank2

Waiting time Density 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 64

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

We can use percentiles as well as measures for location and dispersion. Percentile Pi: i% of data values is smaller than Pi and (100 − i)% is larger than Pi.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 65

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

We can use percentiles as well as measures for location and dispersion. Percentile Pi: i% of data values is smaller than Pi and (100 − i)% is larger than Pi. Special percentiles: the so-called quartiles Q1, Q2 and Q3. The quartiles divide the data set in four groups, each having (approximately) 25% of data values. Furthermore:

◮ Q1 = P25: first quartile ◮ Q2 = P50 = median: second quartile ◮ Q3 = P75: third quartile. Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 66

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

We can use percentiles as well as measures for location and dispersion. Percentile Pi: i% of data values is smaller than Pi and (100 − i)% is larger than Pi. Special percentiles: the so-called quartiles Q1, Q2 and Q3. The quartiles divide the data set in four groups, each having (approximately) 25% of data values. Furthermore:

◮ Q1 = P25: first quartile ◮ Q2 = P50 = median: second quartile ◮ Q3 = P75: third quartile.

In R the quartiles (and extrema) are easily obtained with summary() or quantile():

summary(bank1) ##

Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 4.1 5.9 7.2 7.2 8.1 11.0 quantile(bank2) ## 0% 25% 50% 75% 100% ## 6.60 6.80 7.20 7.55 7.80 Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 67

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

5-number summary:

1. Minimum
2. First quartile, Q1
3. Median, Q2
4. Third quartile, Q3
5. Maximum

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 68

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

5-number summary:

1. Minimum
2. First quartile, Q1
3. Median, Q2
4. Third quartile, Q3
5. Maximum

The interquartile range (IQR) is difference between third and first quartile: IQR = Q3 − Q1.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 69

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

5-number summary:

1. Minimum
2. First quartile, Q1
3. Median, Q2
4. Third quartile, Q3
5. Maximum

The interquartile range (IQR) is difference between third and first quartile: IQR = Q3 − Q1. Graphical representation of 5-number summary is boxplot (R: boxplot()):

4 5 6 7 8 9 10 11

Bank 1

Waiting time

Highest value: maximum Top of box: Q3 Band in box: median Bottom of box: Q1 Lowest value: minimum

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 70

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

2.4 Measures of relative standing and boxplots

The boxplot shows the 5-number summary, but also provides information about the distribution: if median is not in center of box and/or there are outliers, then the distribution is asymmetrical.

Smokers Passive smokers Non−smokers 100 200 300 400 500 cotinine

Smokers

Cotinine Frequency 100 200 300 400 500 2 4 6 8 10 12 14

Passive smokers

Cotinine Frequency 100 300 500 5 10 15 20 25 30 35

Non−smokers

Cotinine Frequency 50 150 250 350 10 20 30

The whiskers are the lines extending from the box. In R they end by default at values not exceeding 1.5 × IQR. Outliers are points which are not included between the whiskers.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 71

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summary: important aspects

When presenting a graphical and/or numerical summary, provide brief comments on the figure/numbers as well. I.e. comment on the most relevant aspects of the figure/numbers.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1

SLIDE 72

Course parameters Course Introduction Introduction to Statistics Summarising and graphing data Describing data

Summary: important aspects

When presenting a graphical and/or numerical summary, provide brief comments on the figure/numbers as well. I.e. comment on the most relevant aspects of the figure/numbers. Recall:

◮ If possible, always start by making figures for own use to get an impression of data ◮ Summaries can be graphical and/or numerical ◮ Choice of summary depends on data type and context ◮ Graphical summaries: choose right size and briefly comment on most relevant

aspects

◮ Numerical summaries: choose right measure of location and/or variation and

briefly comment on what the numbers are showing.

Dennis Dobler Vrije Universiteit Amsterdam Statistical Methods: Lecture 1