Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - - PDF document

introduction to statistics
SMART_READER_LITE
LIVE PREVIEW

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - - PDF document

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html Other courses Microarrays, Lausanne, 30 March 1 April (3


slide-1
SLIDE 1

Swiss Institute of Bioinformatics

Introduction to statistics

Frédéric Schütz

Frederic.Schutz@isb-sib.ch

19 January 2009 EMBnet course

http://bcf.isb-sib.ch/Services.html

slide-2
SLIDE 2

Other courses

Microarrays, Lausanne, 30 March – 1 April (3 days)

– Lower-level analysis – Normalization – Finding differentially-expressed genes, – Introduction to GSEA and other group-level analysis methods – Classification.

Advanced statistics course: planned ! Other courses: depending on your needs !

http://www.lematin.ch/fr/actu/suisse/va-t-on-bientot-tuer-le-steak_9-220826

Elly Tzogalis et Michel Jeanneret (15 août 2008), Le Matin

Consumption of meat per person per year in Switzerland (in kg)

slide-3
SLIDE 3

What is statistics good for?

Descriptive statistics: summarizing datasets by a few numbers Exploratory data analysis and visualization: find patterns and construct hypotheses Significance testing: do the data support the existence of a significant trend or is it just noise? Clustering: finding patterns in the noise Regression: can you explain the behaviour of a variable as a function of the others? Classification: putting objects into the right drawers Not a complete list !

slide-4
SLIDE 4

Exploratory data analysis

Also called descriptive statistics. Process of looking at the data prior to formal analysis. Data examined in two ways:

– Numerical summaries of data (mean, standard deviation, 5- numbers summary, etc) – Graphical summaries: viewing your data in graphs to detect errors, unusual values, trends and patterns.

Particularly relevant to large datasets Remember: summarising means losing some information !

– See “The Median Isn't the Message” by Stephen Jay Gould http://www.edwardtufte.com/tufte/gould

Measures of location: mean

  • “Arithmetic mean”
  • Sum of the values divided by the number of values
  • All observations treated equally
  • Suitable for symmetrical distributions
  • Sensitive to presence of outliers (“unusual values”)
  • Trimmed mean:

– “Olympic scoring” – Remove extreme values (e.g. 10%) on each side before calculating the mean

  • In R:

> mean(data) > mean(data, trim=0.1)

slide-5
SLIDE 5

Mean: (lack of) robustness

Mean

Trimmed mean

Trimmed mean (0.3)

slide-6
SLIDE 6

Side note: removing data

In the past, data was removed if it “looked” incorrect

– Gregor Mendel’s peas (results too good to be true) – Albert Michelson’s data on the speed of light – Johannes Kepler on planet orbits

Outliers (unusual observations, far away from the rest of the data) do occur naturally. Data points can be removed (e.g. trimmed mean)

– if the decision is made before looking at the data; or – if the discrepancies can be explained.

Otherwise, this is akin to data snooping. There are statistical methods (called robust methods) which can handle outliers.

slide-7
SLIDE 7

In R: > library(MASS) > data(phones) > ?phones

Measures of location: Median

Median 50% of the data 50% of the data In R: > median(data)

More appropriate for skewed distributions Mean=Median if the distribution is symmetrical Not sensitive to the presence of outliers since it “ignores” almost all the values

slide-8
SLIDE 8

Quartiles and percentiles

1st quartile 25% of the data 50% of the data 25% of the data 3rd quartile xth percentile x% of the data In R: > quantile(data, 0.25) > quantile(data, 0.5) # Same as median(data) > quantile(data, x)

Median: resistance to outliers

Median Mean

slide-9
SLIDE 9

Mode

For discrete data, the mode is the most-common value in the data. For continuous-valued data, the mode is an infinitesimal concept: it is defined as the maximum of the density. There is no simple finite-sample estimator of the mode, all depend on some sort of smoothing.

Mean=Median=Mode

Bimodal and multimodal data

Most often, we are not interested in “the” mode of the data Of interest is whether the distribution has several prominent “peaks” (local maximums of the density), in which case it is bimodal or multimodal. Bimodality often indicates that the data is not homogenous and is in fact made of two sub-populations.

Most (if not all) the numerical summaries that we discuss here will break down if the data is bimodal !

slide-10
SLIDE 10

Spread

Same mean Narrower spread Wider spread

Standard Deviation

Mean

The standard deviation (SD, σ) of a variable is the square root of the average of squared deviations from the mean. Used in conjunction with the mean. Same unit as the data In R: > sd(data)

=

− − =

n i i

x x n

1 2

) ( 1 1 σ

slide-11
SLIDE 11

Interquartile range (IQR)

1st quartile 25% of the data 50% of the data 25% of the data 3rd quartile IQR= 3rd quartile – 1st quartile

Used in conjunction with the median In R: > IQR(data)

Histograms

Histograms are an intuitive way to represent a large number of data points:

– The range of the data is converted into a number of intervals (“bins”), usually with the same width – The number of observations which falls into each histogram is counted and plotted as a bar – Alternatively, a density scale can be used (area of each bar represents the proportion of observations in each interval)

Helps visualizing the distribution of values for a numerical variable Main complication: choice of bin width/number of bins Most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

slide-12
SLIDE 12

Area of this bar represents the proportion

  • f observations between

16 and 17. R default parameters (here: 1 bin for 5 units) User choice (1 bin for 0.5 units) User choice (1 bin for 0.006 units)

slide-13
SLIDE 13

Density

The density describes the theoretical probability distribution of a variable Conceptually, it is obtained in the limit of infinitely many data points When we estimate it from a finite set of data, we usually assume that the density is a smooth function You can think of it as a “smoothed histogram” (but to actually compute it, there are much better methods!)

Density for normal distribution and SD

mean 1 SD from the mean Area indicates the probability that a random observation will fall into this range.

slide-14
SLIDE 14

Fourth Annual BSA and IDC Global Software Piracy Study, May 2007

More information: http://en.wikipedia.org/wiki/Business_Software_Alliance, version as of 16:15, 18 February 2008

Estimating an illegal phenomenon (unauthorized copy of computer programs) is hard, and the methodology is very contested. Estimations probably carry a large uncertaintly, which is not indicated, making comparisons between percentages very difficult. Calculations of actual losses is even more contentious !

Representing data: some bad practices

Scientists seem to do better: a «random» sample

slide-15
SLIDE 15

Representing data: « bar+error » plot

A B **

** p<0.01

Legend: mean of measurement for groups A (25 subjects) and B (18 subjects); error bars indicate the standard deviation in each group; two- sided two-sample t-test. 1.5 2.0 2.4 2.7 Mean Mean + SD SD

slide-16
SLIDE 16

Boxplot

50% of the data is in the box “Interquartile range” 25% of the data is below the box 25% of the data is above the box Median 50% of the data is above 50% of the data is below Outliers Outlier

  • Outliers (unusual values) are those data points whose distance from the

box is larger than 1.5 times the interquartile range.

  • The whiskers extend to the last point which is not an outlier.
  • A boxplot is a graphical representation of the Five-number summary:

Minimum, First quartile, Median, Third quartile, Maximum

Whisker

Boxplots: example

From Moritz et al., Anal. Chem. 2004 Aug 15; 76(16):4811-24 If there are only a few datapoints in the boxplot, it can be “degenerate” (i.e. not all features are present).

slide-17
SLIDE 17

Boxplot: a different example

With this definition, almost all datasets will produce outliers (20% of all points are “outliers”). In this case, the plots are made of several thousands of data points; a boxplot with outliers would not be very relevant because there would be too many of them.

Comparisons of some graphs

In the next 4 slides, we are going to compare different methods for graphing univariate data Four methods are shown in each case:

– Individual data points on the x-axis; some random displacement (jitter) is added on the y-axis to avoid superimposition of too many points – Histogram with density superimposed – Mean +/- standard deviation – Boxplot

Other examples are given in the exercises.

slide-18
SLIDE 18

Dataset 1 (500 points)

Individual points with jitter

  • n y-axis

Histogram and density Mean +/- SD Boxplot

Dataset 2 (37 points)

Individual points with jitter

  • n y-axis

Histogram and density Mean +/- SD Boxplot

slide-19
SLIDE 19

Dataset 3 (100 points)

Individual points with jitter

  • n y-axis

Histogram and density Mean +/- SD Boxplot

(courtesy Nadine Zangger)

Dataset 4 (4 points)

Individual points with jitter

  • n y-axis

Histogram and density Mean +/- SD Boxplot

slide-20
SLIDE 20

Among the many other possible graphs: pie chart

Albumin 54.31% Immunoglobulin G 16.61% a1-Antitrypsin 3.83% a2-Macroglobulin 3.64% Immunoglobulin A 3.45% Transferrin 3.32% Haploglobin 2.94% Immunoglobulin M 1.98% Other 9.91%

10%

a1-Acid glycoprotein 1.25% Complement C3 1.12% Hemopexin 1.05% a2HS-Glycoprotein 0.80% a1-Antichymotrypsin 0.58% a-trypsin inhibitor 0.58% Gc-Globulin 0.48% Ceruloplasmin 0.48% Complement C4 0.45% Fibronectin 0.42% Prealbumin (thyroxine-binding) 0.32% C1 Esterase inhibitor 0.32% a1B-Glycoprotein 0.29% b2-Glycoprotein I 0.29% b2-Glycoprotein II 0.27% Complement C1 0.22%

Remaining

1%

Relative protein abundances in human plasma

Pie charts should be avoided !

Pie charts are used a lot in the business and mass media world, but (fortunately) rarely in science. It is difficult to compare different sections of a pie charts, and even more difficult to compare the sections of two pie charts Bar charts or dot charts are almost always better Main exception: when your data is split at 25%, 50%

  • r 75% (humans are good at judging right angles)

William Playfair, “Statistical Breviary”, 1801.

slide-21
SLIDE 21

See http://en.wikipedia.org/wiki/Pie chart

(revision as of 17 February 2008, 20:34)

Worse than a pie chart: a “3D pie chart”…

PME Magazine, April 2006

slide-22
SLIDE 22

Bivariate data

Summarising univariate data is important Most of the time, however, interesting information comes from looking at two variables simultaneously and comparing them, or finding a relation between them. Scatterplots are the easiest way to display bivariate data.

Scatterplots

slide-23
SLIDE 23

Reproducibility of duplicate measurements. Dotted line represents identity line (x = y). Dashed lines represent the cut-off between unmeth and methylated samples. Pearson correlation 0.996, Spearman correlation 0.93, N=94.

(courtesy Eugenia Migliavacca)

Example: univariate data

slide-24
SLIDE 24

Same data on a scatterplot

PME Magazine

slide-25
SLIDE 25

PME Magazine

37% 0% Increase since 1999 206 150 Electronic watches 97% 72% 64% 64% 60% 267% 0% Increase since 1999 1590 1385 1320 1320 1288 1017 804 Area (based on v) 1.02 0.98 1.00 0.95 0.94 0.94 0.91 Ratio h/v 41% 31% 28% 28% 26% 13% 0% Increase since 1999 45 42 41 41 40.5 36 32 Dimension v 59% 41% 41% 34% 31% 17% 0% Increase since 1999 46 41 41 39 38 34 29 Dimension h 41% 31% 30% 30% 28% 14% 0% Increase since 1999 2148 2002 1988 1981 1949 1742 1528 Price (CHF) 2005 2004 2003 2002 2001 2000 1999

slide-26
SLIDE 26

Three dimensional plots

Should only be used when the data to show actually requires 3 dimensions,

  • therwise the 3rd dimension is only chartjunk.

Alternative: image plot

slide-27
SLIDE 27

Chartjunk

  • “The interior decoration of graphics generates a lot of ink that does not

tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.”

Edward Tufte, “The Visual Display of Quantitative Information”, p. 107

Ink in plots should be used parsimonously, to display the data, not to distract from it !

Edward Tufte, “Beautiful Evidence”, p. 175.

http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000Jr

slide-28
SLIDE 28

Use of Excel for statistics ?

  • Extensively tested

– Known bugs – Limited functions – Not particularly easy for slightly advanced functions

  • http://gcrc.ucsd.edu/biostatistics/Excel.pdf
  • http://www.practicalstats.com/Pages/excelstats.html
  • http://www.agresearch.co.nz/Science/Statistics/exceluse1.htm

Errors

Non-robust procedures (this particular example works in newer versions) Linear regressions can fail if the data is close to being colinear Excel’s random numbers are not as random as they should be

slide-29
SLIDE 29

Annoyances

Many standard methods are not available in Excel

– Boxplots, non-parametric statistical tests, etc.

Other methods only offer basic options (linear regression) Several Excel procedures are misleading (confidence intervals) Different analysis require user to reorganize the data Many types of graphics violate standards of good graphics.

– perspective, glitz

1 2 3 8 12 14 15 16 17 A C 5 10 15 20 25 30 35 40 A B C

slide-30
SLIDE 30

1 2 3 8 12 14 15 16 17 A B C 5 10 15 20 25 30 35 40 A B C 1 2 3 8 12 14 15 16 17 A B C 5 10 15 20 25 30 35 40 A B C

slide-31
SLIDE 31

5 10 15 20 25 30 35 40 1 2 3 8 12 14 15 16 17 A B C 5 10 15 20 25 30 35 40 1 2 3 8 12 14 15 16 17 A B C

slide-32
SLIDE 32

5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 A B C

Solutions ?

Probably ok for simple calculations (“accounting”, basic summary statistics, simple regression) Little chance of having problems with more complicated problems (but you may not notice if it happens…) Missing functions: add-ons (e.g. StatPlus) Excel-like environments

– Minitab (http://www.minitab.com/) – KaleidaGraph (http://www.synergy.com/)

S-Plus, R, SAS, other statistical packages

slide-33
SLIDE 33

R

  • “S is a programming language and environment for all kinds of computing

involving data. It has a simple goal: to turn ideas into software, quickly and faithfully.”

John M. Chambers, Bell Laboratories (major contributor and developer of the S language)

  • S-Plus: commercial implementation of the S language
  • R: free software implementation of the S language

(http://www.r-project.org)

  • Developed by R. Gentleman and R. Ihaka (U of Auckland, NZ) during the

1990s

  • Advanced statistical computing system, freely available for most

computing platforms.

  • Updated versions available every 3-4 months

6 January 2009 “R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use.”

slide-34
SLIDE 34

Pros and cons

  • Powerful, state-of-the-art
  • Used by professional statisticians
  • Lot of documentation
  • Learn by example
  • Easy to extend

– Modify and improve – Create add-on packages – Many already available

  • Freely available
  • Unix, Windows & Mac
  • Not very easy to learn (many details)
  • Easy to forget
  • Learn by example
  • Documentation sometimes cryptic
  • Not very (easily) interactive
  • Command-based
  • Memory intensive
  • Still evolving: backward-compatibility

has been an issue

  • Slow at times

If you “just want to do statistical analysis” Easy to find alternatives If you intend to do microarray data analysis Probably one of best options

Bioconductor

  • http://www.bioconductor.org
  • 2001 by Robert Gentleman (Dana Farber Cancer Institute)
  • Open source and open development project for the analysis of

biomedical and genomic data.

  • About 25 core developers, at various institutions
  • Provides many additional R packages for statistical data analysis in

biosciences:

– SNP – SAGE – cDNA microarrays, – Affymetrix chips, – etc.

  • Packages are often created by the inventor of the method of analysis.
slide-35
SLIDE 35

Practical Works

We will use OpenOffice Calc/Excel and R OpenOffice Calc/Excel

– Widely available – Very useful for storing and managing data – Allows to easily perform many simple tasks – Not recommenced for statistical or graphical tasks.

R

– State-of-the-art for many types of statistical analysis – Currently the reference for many bioinformatics analyses