Data presentation and descriptive statistics
Paola Grosso SNE research group Today with Jeroen van der Ham as “special guest”
Data presentation and descriptive statistics Paola Grosso SNE - - PowerPoint PPT Presentation
Data presentation and descriptive statistics Paola Grosso SNE research group Today with Jeroen van der Ham as special guest Instructions for use I do talk fast: Ask me to repeat if something is not clear; I made an effort to
Paola Grosso SNE research group Today with Jeroen van der Ham as “special guest”
Sep.06 2010 - Slide 2
– Ask me to repeat if something is not clear; – I made an effort to keep it ‘interesting’, but you are the ‘guinea pigs’…feedback is welcome!
– But you will have to do some ‘work’;
– We will start slow and accelerate; – We will (ambitiously?) cover lots of material; – We will also use more than the standard two hours.
Sep.06 2010 - Slide 4
Sep.06 2010 - Slide 5
(… but useful also in commercial/industry/business settings);
(incidentally, it also means higher grades during RPs). We want to avoid to hear this from you.
Sep.06 2010 - Slide 6
This is our main focus!
Sep.06 2010 - Slide 7
Sep.06 2010 - Slide 9
Estimate the height?
Sep.06 2010 - Slide 10
Sep.06 2010 - Slide 11
Sep.06 2010 - Slide 12
some elements of the population have no chance of selection, or where the probability of selection can't be accurately determined. – Accidental (or convenience) Sampling; – Quota Sampling; – Purposive Sampling.
every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. – Simple random sample – Systematic random sample – Stratified random sample – Cluster sample
Sep.06 2010 - Slide 13
values are distinct and separate, i.e. they can be counted
values can be sorted according to category.
values can be assigned a code in the form of a number, where the numbers are simply labels
values can be ranked or have a rating scale attached
Values may take on any value within a finite or infinite interval
The attribute that varies in each experiment.
Sep.06 2010 - Slide 14
– The number of suitcases lost by an airline. – The height of apple trees. – The number of apples produced. – The number of green M&M's in a bag. – The time it takes for a hard disk to fail. – The production of cauliflower by weight.
Sep.06 2010 - Slide 16
Friends Frequency Relative Frequency Percentage (%) Cumulative (less than) Cumulative (greater than) 0-50 6 6/20 30% 6 20 51-100 4 4/20 20% 10 14 101-150 2 2/20 10% 12 10 151-200 4 4/20 20% 16 8 201-250 1 1/20 5% 17 4 251-300 3 3/20 15% 20 3
How many friends do you have on Facebook? …. 23,44,156,246,37,79,156,123,267,12, 145,88,95,156,32,287,167,55,256,47,
– Identify lower and upper limits – Number of classes and width – Segment data in classes – Each value should fit in one (and no more) than one class: classes are mutually exclusive
Sep.06 2010 - Slide 17
Of course not everybody is a believer: “As the Chinese say, 1001 words is worth more than a picture” John McCartey
Sep.06 2010 - Slide 18
Useful when dealing with large data sets; Show outliers and gaps in the data set;
Sep.06 2010 - Slide 19
Add values Add title (or caption in document) Add axis legends
Sep.06 2010 - Slide 20
Caution:
show historical data over time;
frequency distribution.
Sep.06 2010 - Slide 21
Year RP2 thesis Students 2004/2005 9 17 2005/2006 7 14 2006/2007 8 15 2007/2008 11 13 2008/2009 10 17
Sep.06 2010 - Slide 22
Sep.06 2010 - Slide 23
– Positive (bottom left -> top right) – Negative (top left -> bottom right) – Null
Sep.06 2010 - Slide 24
Arrhenius plot Bland-Altman plot Bode plot Lineweaver–Burk plot Forest plot Funnel plot Nyquist plot Nichols plot Galbraith plot Recurrence plot Q-Q plot Star plot Shmoo plot Stemplot Violin plot Ternary plot
Sep.06 2010 - Slide 26
$> apt-get install r-base-core
Sep.06 2010 - Slide 27
> salaries <- read.csv(file=”Path-to-file/Salary.csv") > salaries > salaries$Salary > barplot(salaries$Salary) > dev.copy(png,’MyBarPlot.png’) > dev.off()
Can you improve this barplot?
help(barplot) ??plot
Student,Salary 1,1250 2,2200 3,2345 4,6700 5,15000 6,3300 7,2230 8,1750 9,1900 10,1750 11,2100 12,2050
Sep.06 2010 - Slide 29
Sep.06 2010 - Slide 30
OS3 graduates Monthly salary (gross in €) Grad 1 1250 Grad 2 2200 Grad 3 2345 Grad 4 6700 Grad 5 15000 Grad 6 3300 Grad 7 2230 Grad 8 1750 Grad 9 1900 Grad 10 1750 Grad 11 2100 Grad 12 2050
What is median, mean and mode of this data set? Can you figure
this in R? What did you learn?
Sep.06 2010 - Slide 31
Causes:
heavy-tailed distribution
same number of values above and below the mean which is represented by the peak of the curve.
symmetrical distribution are equal. Outliers create skewed distributions:
above the mean: the mean is greater than the median and the mode;
below the mean: the mean is smaller than the median and the mode.
Sep.06 2010 - Slide 33
i
2
i
2
Sep.06 2010 - Slide 34
i
x – mean of our sample µ – mean of our parent dist σ – S.D. of our parent dist s – S.D. of our sample Beware Notational Confusion! x s σ σ µ
Data Sample Parent Distribution
(from which data sample was drawn)
but be clear which one you mean!
data sample using
parent =
i
x sdata σparent µ
Data Sample Parent Distribution (from which data sample was drawn)
sdata = 1 N (x − x )2
i
Sep.06 2010 - Slide 36
25% of data points ≤ Q1; 50% of data points ≤ Q2; (Q2 is the median); 75% of data points ≤ Q3.
1
Sep.06 2010 - Slide 37
It uses the 5-number summary.
Sep.06 2010 - Slide 39
Sep.06 2010 - Slide 40
(has dimension D(x)D(y))
(is dimensionless)
Sep.06 2010 - Slide 42
Sep.06 2010 - Slide 43
attach(obesity) plot(Weight,Food_consumption) cor(Weight,Food_consumption) cor(obesity) cor.test(Weight,Food_consumption)
Weight, Food_consumption 84,32 93,33 81,33 61,24 95,39 86,32 90,34 78,28 85,33 72,27 65,26 75,29
Sep.06 2010 - Slide 44
Sep.06 2010 - Slide 45
2
i
Note, the correlation coefficient here
Sep.06 2010 - Slide 46
> pairs(obesity) > fit <- lm(Food_Consumption~Weight) > fit > summary(fit) > plot(Weight,Food_consumption,pch=16) > abline(lm(Food_consumption~Weight),col='red')