Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - PDF document

Introduction to statistics Frédéric Schütz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html

Other courses � Microarrays, Lausanne, 30 March – 1 April (3 days) – Lower-level analysis – Normalization – Finding differentially-expressed genes, – Introduction to GSEA and other group-level analysis methods – Classification. � Advanced statistics course: planned ! � Other courses: depending on your needs ! Consumption of meat per person per year in Switzerland (in kg) Elly Tzogalis et Michel Jeanneret (15 août 2008), Le Matin http://www.lematin.ch/fr/actu/suisse/va-t-on-bientot-tuer-le-steak_9-220826

What is statistics good for? � Descriptive statistics: summarizing datasets by a few numbers � Exploratory data analysis and visualization: find patterns and construct hypotheses � Significance testing: do the data support the existence of a significant trend or is it just noise? � Clustering: finding patterns in the noise � Regression: can you explain the behaviour of a variable as a function of the others? � Classification : putting objects into the right drawers � Not a complete list !

Exploratory data analysis � Also called descriptive statistics. � Process of looking at the data prior to formal analysis . � Data examined in two ways: – Numerical summaries of data (mean, standard deviation, 5- numbers summary, etc) – Graphical summaries: viewing your data in graphs to detect errors, unusual values, trends and patterns. � Particularly relevant to large datasets � Remember: summarising means losing some information ! – See “The Median Isn't the Message” by Stephen Jay Gould http://www.edwardtufte.com/tufte/gould Measures of location: mean � “Arithmetic mean” � Sum of the values divided by the number of values � All observations treated equally � Suitable for symmetrical distributions � Sensitive to presence of outliers (“unusual values”) � Trimmed mean: – “Olympic scoring” – Remove extreme values (e.g. 10%) on each side before calculating the mean � In R: > mean(data) > mean(data, trim=0.1)

Mean: (lack of) robustness Mean Trimmed mean Trimmed mean (0.3)

Side note: removing data � In the past, data was removed if it “looked” incorrect – Gregor Mendel’s peas (results too good to be true) – Albert Michelson’s data on the speed of light – Johannes Kepler on planet orbits � Outliers (unusual observations, far away from the rest of the data) do occur naturally. � Data points can be removed (e.g. trimmed mean) – if the decision is made before looking at the data; or – if the discrepancies can be explained. � Otherwise, this is akin to data snooping. � There are statistical methods (called robust methods ) which can handle outliers.

In R: > library(MASS) > data(phones) > ?phones Measures of location: Median Median 50% of the data 50% of the data � More appropriate for skewed distributions � Mean=Median if the distribution is symmetrical � Not sensitive to the presence of outliers since it “ignores” almost all the values In R: > median(data)

Quartiles and percentiles 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data x th percentile x% of the data In R: > quantile(data, 0.25) > quantile(data, 0.5) # Same as median(data) > quantile(data, x) Median: resistance to outliers Median Mean

Mode � For discrete data, the mode is the most-common value in the data. � For continuous-valued data, the mode is an infinitesimal concept: it is defined as the maximum of the density. � There is no simple finite-sample estimator of the mode, all depend on some sort of smoothing. Mean=Median=Mode Bimodal and multimodal data � Most often, we are not interested in “the” mode of the data � Of interest is whether the distribution has several prominent “peaks” (local maximums of the density), in which case it is bimodal or multimodal . � Bimodality often indicates that the data is not homogenous and is in fact made of two sub-populations. Most (if not all) the numerical summaries that we discuss here will break down if the data is bimodal !

Spread Narrower spread Wider spread Same mean Standard Deviation Mean � The standard deviation (SD, σ ) of a variable is the square root of the average of squared deviations from the mean. � Used in conjunction with the mean . � Same unit as the data � In R: > sd(data) n 1 ∑ σ = − 2 ( ) x x − i n 1 = 1 i

Interquartile range (IQR) IQR= 3 rd quartile – 1 st quartile 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data � Used in conjunction with the median � In R: > IQR(data) Histograms � Histograms are an intuitive way to represent a large number of data points: – The range of the data is converted into a number of intervals (“bins”), usually with the same width – The number of observations which falls into each histogram is counted and plotted as a bar – Alternatively, a density scale can be used (area of each bar represents the proportion of observations in each interval) � Helps visualizing the distribution of values for a numerical variable � Main complication: choice of bin width/number of bins � Most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

Area of this bar represents the proportion of observations between 16 and 17. R default parameters (here: 1 bin for 5 units) User choice (1 bin for 0.5 units) User choice (1 bin for 0.006 units)

Density � The density describes the theoretical probability distribution of a variable � Conceptually, it is obtained in the limit of infinitely many data points � When we estimate it from a finite set of data, we usually assume that the density is a smooth function � You can think of it as a “smoothed histogram” (but to actually compute it, there are much better methods!) Density for normal distribution and SD Area indicates the probability that a random observation will fall into this range. 1 SD from the mean mean

Representing data: some bad practices Estimating an illegal phenomenon (unauthorized copy of computer programs) is hard, and the methodology is very contested. Estimations probably carry a large uncertaintly, which is not indicated, making comparisons between percentages very difficult. Calculations of actual losses is even more contentious ! Fourth Annual BSA and IDC Global Software Piracy Study, May 2007 More information: http://en.wikipedia.org/wiki/Business_Software_Alliance, version as of 16:15, 18 February 2008 Scientists seem to do better: a «random» sample

Representing data: « bar+error » plot ** Mean + SD 2.7 2.4 2.0 SD 1.5 Mean 0 A B ** p<0.01 Legend: mean of measurement for groups A (25 subjects) and B (18 subjects); error bars indicate the standard deviation in each group; two- sided two-sample t-test.

Boxplot 50% of the data is in the box “Interquartile range” Whisker Outliers Outlier Median 25% of the data is below the box 25% of the data is above the box 50% of the data is above 50% of the data is below � Outliers (unusual values) are those data points whose distance from the box is larger than 1.5 times the interquartile range. � The whiskers extend to the last point which is not an outlier. � A boxplot is a graphical representation of the Five-number summary : Minimum, First quartile, Median, Third quartile, Maximum Boxplots: example If there are only a few datapoints in the boxplot, it can be “degenerate” (i.e. not all features are present). From Moritz et al., Anal. Chem. 2004 Aug 15; 76(16):4811-24

Boxplot: a different example With this definition, almost all datasets will produce outliers (20% of all points are “outliers”). In this case, the plots are made of several thousands of data points; a boxplot with outliers would not be very relevant because there would be too many of them. Comparisons of some graphs � In the next 4 slides, we are going to compare different methods for graphing univariate data � Four methods are shown in each case: – Individual data points on the x-axis; some random displacement (jitter) is added on the y-axis to avoid superimposition of too many points – Histogram with density superimposed – Mean +/- standard deviation – Boxplot � Other examples are given in the exercises.

Dataset 1 (500 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot Dataset 2 (37 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

Dataset 3 (100 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot (courtesy Nadine Zangger) Dataset 4 (4 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - PDF document

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html Other courses Microarrays, Lausanne, 30 March 1 April (3

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Advanced Statistics Janette Walde janette.walde@uibk.ac.at Department of Statistics University

Introduction to Business Statistics Professor Jarad Niemi STAT 226 - Iowa State University

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Who we are? OECD STATISTICS ESTONIA AUSTRALIAN BUREAU OF STATISTICS STATISTICS NEW ZEALAND

workload:Making the most of Electronic Repeat Dispensing Session 5 - The benefits and

Case 45 yow comes to see you complaining of fatigue, depressive symptoms and weight gain over

Dr. Sheldon M. Permack MD CCFP Medical Director Family Medicine Primary Care WRHA MD training

Chaos of chiral condensate Koji Hashimoto (Osaka u) w/ Keiju Murata (Keio u) Kentaroh Yoshida

Hardware-Software Codesign 6. System Simulation Lothar Thiele Swiss Federal Computer

Knowledge Bases in the Age of Big Data Analytics Fabian Suchanek Gerhard Weikum Tlcom

UCSF Acute Stroke Protocol Stroke/TIA TIA Minor/Mod Large Evolving 1 Stroke/TIA Stroke/TIA

service to our citizens. The Problem. More people are buying more fuel efficient

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - PDF document

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html Other courses Microarrays, Lausanne, 30 March 1 April (3

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Advanced Statistics Janette Walde janette.walde@uibk.ac.at Department of Statistics University

Introduction to Business Statistics Professor Jarad Niemi STAT 226 - Iowa State University

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Who we are? OECD STATISTICS ESTONIA AUSTRALIAN BUREAU OF STATISTICS STATISTICS NEW ZEALAND

workload:Making the most of Electronic Repeat Dispensing Session 5 - The benefits and

Case 45 yow comes to see you complaining of fatigue, depressive symptoms and weight gain over

Dr. Sheldon M. Permack MD CCFP Medical Director Family Medicine Primary Care WRHA MD training

Chaos of chiral condensate Koji Hashimoto (Osaka u) w/ Keiju Murata (Keio u) Kentaroh Yoshida

Hardware-Software Codesign 6. System Simulation Lothar Thiele Swiss Federal Computer

Knowledge Bases in the Age of Big Data Analytics Fabian Suchanek Gerhard Weikum Tlcom

UCSF Acute Stroke Protocol Stroke/TIA TIA Minor/Mod Large Evolving 1 Stroke/TIA Stroke/TIA

service to our citizens. The Problem. More people are buying more fuel efficient

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning