introduction to statistics
play

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch - PDF document

Introduction to statistics Frdric Schtz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html Other courses Microarrays, Lausanne, 30 March 1 April (3


  1. Introduction to statistics Frédéric Schütz Frederic.Schutz@isb-sib.ch 19 January 2009 EMBnet course Swiss Institute of Bioinformatics http://bcf.isb-sib.ch/Services.html

  2. Other courses � Microarrays, Lausanne, 30 March – 1 April (3 days) – Lower-level analysis – Normalization – Finding differentially-expressed genes, – Introduction to GSEA and other group-level analysis methods – Classification. � Advanced statistics course: planned ! � Other courses: depending on your needs ! Consumption of meat per person per year in Switzerland (in kg) Elly Tzogalis et Michel Jeanneret (15 août 2008), Le Matin http://www.lematin.ch/fr/actu/suisse/va-t-on-bientot-tuer-le-steak_9-220826

  3. What is statistics good for? � Descriptive statistics: summarizing datasets by a few numbers � Exploratory data analysis and visualization: find patterns and construct hypotheses � Significance testing: do the data support the existence of a significant trend or is it just noise? � Clustering: finding patterns in the noise � Regression: can you explain the behaviour of a variable as a function of the others? � Classification : putting objects into the right drawers � Not a complete list !

  4. Exploratory data analysis � Also called descriptive statistics. � Process of looking at the data prior to formal analysis . � Data examined in two ways: – Numerical summaries of data (mean, standard deviation, 5- numbers summary, etc) – Graphical summaries: viewing your data in graphs to detect errors, unusual values, trends and patterns. � Particularly relevant to large datasets � Remember: summarising means losing some information ! – See “The Median Isn't the Message” by Stephen Jay Gould http://www.edwardtufte.com/tufte/gould Measures of location: mean � “Arithmetic mean” � Sum of the values divided by the number of values � All observations treated equally � Suitable for symmetrical distributions � Sensitive to presence of outliers (“unusual values”) � Trimmed mean: – “Olympic scoring” – Remove extreme values (e.g. 10%) on each side before calculating the mean � In R: > mean(data) > mean(data, trim=0.1)

  5. Mean: (lack of) robustness Mean Trimmed mean Trimmed mean (0.3)

  6. Side note: removing data � In the past, data was removed if it “looked” incorrect – Gregor Mendel’s peas (results too good to be true) – Albert Michelson’s data on the speed of light – Johannes Kepler on planet orbits � Outliers (unusual observations, far away from the rest of the data) do occur naturally. � Data points can be removed (e.g. trimmed mean) – if the decision is made before looking at the data; or – if the discrepancies can be explained. � Otherwise, this is akin to data snooping. � There are statistical methods (called robust methods ) which can handle outliers.

  7. In R: > library(MASS) > data(phones) > ?phones Measures of location: Median Median 50% of the data 50% of the data � More appropriate for skewed distributions � Mean=Median if the distribution is symmetrical � Not sensitive to the presence of outliers since it “ignores” almost all the values In R: > median(data)

  8. Quartiles and percentiles 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data x th percentile x% of the data In R: > quantile(data, 0.25) > quantile(data, 0.5) # Same as median(data) > quantile(data, x) Median: resistance to outliers Median Mean

  9. Mode � For discrete data, the mode is the most-common value in the data. � For continuous-valued data, the mode is an infinitesimal concept: it is defined as the maximum of the density. � There is no simple finite-sample estimator of the mode, all depend on some sort of smoothing. Mean=Median=Mode Bimodal and multimodal data � Most often, we are not interested in “the” mode of the data � Of interest is whether the distribution has several prominent “peaks” (local maximums of the density), in which case it is bimodal or multimodal . � Bimodality often indicates that the data is not homogenous and is in fact made of two sub-populations. Most (if not all) the numerical summaries that we discuss here will break down if the data is bimodal !

  10. Spread Narrower spread Wider spread Same mean Standard Deviation Mean � The standard deviation (SD, σ ) of a variable is the square root of the average of squared deviations from the mean. � Used in conjunction with the mean . � Same unit as the data � In R: > sd(data) n 1 ∑ σ = − 2 ( ) x x − i n 1 = 1 i

  11. Interquartile range (IQR) IQR= 3 rd quartile – 1 st quartile 1 st quartile 3 rd quartile 25% of the data 50% of the data 25% of the data � Used in conjunction with the median � In R: > IQR(data) Histograms � Histograms are an intuitive way to represent a large number of data points: – The range of the data is converted into a number of intervals (“bins”), usually with the same width – The number of observations which falls into each histogram is counted and plotted as a bar – Alternatively, a density scale can be used (area of each bar represents the proportion of observations in each interval) � Helps visualizing the distribution of values for a numerical variable � Main complication: choice of bin width/number of bins � Most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

  12. Area of this bar represents the proportion of observations between 16 and 17. R default parameters (here: 1 bin for 5 units) User choice (1 bin for 0.5 units) User choice (1 bin for 0.006 units)

  13. Density � The density describes the theoretical probability distribution of a variable � Conceptually, it is obtained in the limit of infinitely many data points � When we estimate it from a finite set of data, we usually assume that the density is a smooth function � You can think of it as a “smoothed histogram” (but to actually compute it, there are much better methods!) Density for normal distribution and SD Area indicates the probability that a random observation will fall into this range. 1 SD from the mean mean

  14. Representing data: some bad practices Estimating an illegal phenomenon (unauthorized copy of computer programs) is hard, and the methodology is very contested. Estimations probably carry a large uncertaintly, which is not indicated, making comparisons between percentages very difficult. Calculations of actual losses is even more contentious ! Fourth Annual BSA and IDC Global Software Piracy Study, May 2007 More information: http://en.wikipedia.org/wiki/Business_Software_Alliance, version as of 16:15, 18 February 2008 Scientists seem to do better: a «random» sample

  15. Representing data: « bar+error » plot ** Mean + SD 2.7 2.4 2.0 SD 1.5 Mean 0 A B ** p<0.01 Legend: mean of measurement for groups A (25 subjects) and B (18 subjects); error bars indicate the standard deviation in each group; two- sided two-sample t-test.

  16. Boxplot 50% of the data is in the box “Interquartile range” Whisker Outliers Outlier Median 25% of the data is below the box 25% of the data is above the box 50% of the data is above 50% of the data is below � Outliers (unusual values) are those data points whose distance from the box is larger than 1.5 times the interquartile range. � The whiskers extend to the last point which is not an outlier. � A boxplot is a graphical representation of the Five-number summary : Minimum, First quartile, Median, Third quartile, Maximum Boxplots: example If there are only a few datapoints in the boxplot, it can be “degenerate” (i.e. not all features are present). From Moritz et al., Anal. Chem. 2004 Aug 15; 76(16):4811-24

  17. Boxplot: a different example With this definition, almost all datasets will produce outliers (20% of all points are “outliers”). In this case, the plots are made of several thousands of data points; a boxplot with outliers would not be very relevant because there would be too many of them. Comparisons of some graphs � In the next 4 slides, we are going to compare different methods for graphing univariate data � Four methods are shown in each case: – Individual data points on the x-axis; some random displacement (jitter) is added on the y-axis to avoid superimposition of too many points – Histogram with density superimposed – Mean +/- standard deviation – Boxplot � Other examples are given in the exercises.

  18. Dataset 1 (500 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot Dataset 2 (37 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

  19. Dataset 3 (100 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot (courtesy Nadine Zangger) Dataset 4 (4 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend