introduction variability in data
play

Introduction Variability in Data Summarizing variability in a data - PDF document

Introduction Variability in Data Summarizing variability in a data set CS 239 Estimating variability in sample data Experimental Methodologies for System Software Peter Reiher April 10, 2007 Lecture 3 Lecture 3 Page 1 Page 2 CS


  1. Introduction Variability in Data • Summarizing variability in a data set CS 239 • Estimating variability in sample data Experimental Methodologies for System Software Peter Reiher April 10, 2007 Lecture 3 Lecture 3 Page 1 Page 2 CS 239, Spring 2007 CS 239, Spring 2007 Summarizing Variability Why Is Variability Important? • Consider two Web servers - • A single number rarely tells the entire • Server A services all requests in 1 second story of a data set • Server B services 90% of all requests in .5 • Usually, you need to know how much seconds the rest of the data set varies from that • But 10% in 55 seconds index of central tendency • Both have mean service times of 1 second • But which would you prefer to use? Lecture 3 Lecture 3 Page 3 Page 4 CS 239, Spring 2007 CS 239, Spring 2007 Indices of Dispersion Range • Minimum and maximum values in data set • Measures of how much a data set • Can be kept track of as data values arrive varies • Variability characterized by difference –Range between minimum and maximum –Variance and standard deviation • Often not useful, due to outliers –Percentiles • Minimum tends to go to zero –Semi-interquartile range • Maximum tends to increase over time –Mean absolute deviation • Not useful for unbounded variables Lecture 3 Lecture 3 Page 5 Page 6 CS 239, Spring 2007 CS 239, Spring 2007 1

  2. Example of Range Variance (and Its Cousins) • Sample variance is • For data set: 1 n ? ? 2 2 ? ? ? s x x 2, 5.4, -17, 2056, 445, -4.8, 84.3, 92, ? i n 1 ? i 1 27, -10 • Variance is expressed in units of the • Maximum is 2056 measured quantity squared • Minimum is -17 – Which isn’t always easy to understand • Range is 2073 • Standard deviation and the coefficient of variation are derived from variance • While arithmetic mean is 268 Lecture 3 Lecture 3 Page 7 Page 8 CS 239, Spring 2007 CS 239, Spring 2007 Variance Example Standard Deviation • For data set • The square root of the variance 2, 5.4, -17, 2056, 445, -4.8, 84.3, 92, • In the same units as the units of the 27, -10 metric • Variance is 413746.6 • So easier to compare to the metric • Given a mean of 268, what does that variance indicate? Lecture 3 Lecture 3 Page 9 Page 10 CS 239, Spring 2007 CS 239, Spring 2007 Standard Deviation Example Coefficient of Variation • For data set • The ratio of the mean and standard 2, 5.4, -17, 2056, 445, -4.8, 84.3, 92, deviation 27, -10 • Normalizes the units of these quantities • Standard deviation is 643 into a ratio or percentage • Given a mean of 268, clearly the • Often abbreviated C.O.V. standard deviation shows a lot of variability from the mean Lecture 3 Lecture 3 Page 11 Page 12 CS 239, Spring 2007 CS 239, Spring 2007 2

  3. Coefficient of Variation Example Percentiles • Specification of how observations fall • For data set into buckets 2, 5.4, -17, 2056, 445, -4.8, 84.3, 92, • E.g., the 5-percentile is the observation 27, -10 that is at the lower 5% of the set • Standard deviation is 643 • The 95-percentile is the observation at • The mean of 268 the 95% boundary of the set • So the C.O.V. is 643/268 = 2.4 • Useful even for unbounded variables Lecture 3 Lecture 3 Page 13 Page 14 CS 239, Spring 2007 CS 239, Spring 2007 Relatives of Percentiles Calculating Quantiles • Quantiles - fraction between 0 and 1 • The ? -quantile is estimated by sorting – Instead of percentage the set – Also called fractiles • Then take the [(n-1) ? +1] th element • Deciles - percentiles at the 10% boundaries – First is 10-percentile, second is 20- –Rounding to the nearest integer percentile, etc. index • Quartiles -divide data set into four parts – 25% of sample below first quartile, etc. – Second quartile is also the median Lecture 3 Lecture 3 Page 15 Page 16 CS 239, Spring 2007 CS 239, Spring 2007 Quartile Example Interquartile Range • Yet another measure of dispersion • For data set 2, 5.4, -17, 2056, 445, -4.8, 84.3, 92, 27, • The difference between Q3 and Q1 -10 • Semi-interquartile range - – (10 observations) • Sort it: ? Q Q ? 3 1 SIQR -17, -10, -4.8, 2, 5.4, 27, 84.3, 92, 445, 2056 2 • The first quartile Q 1 is -4.8 • The third quartile Q 3 is 92 • Often interesting measure of what’s going on in the middle of the range Lecture 3 Lecture 3 Page 17 Page 18 CS 239, Spring 2007 CS 239, Spring 2007 3

  4. Semi-Interquartile Range Mean Absolute Deviation Example • For data set • Another measure of variability -17, -10, -4.8, 2, 5.4, 27, 84.3, 92, 445, 2056 1 n ? ? • Mean absolute deviation = x x i • Q 3 is 92 n ? 1 i • Q 1 is -4.8 ? ? ? • Doesn’t require multiplication or Q Q 92 4 8 . ? ? ? 3 1 SIQR 48 square roots 2 2 • So outliers cause much of variability Lecture 3 Lecture 3 Page 19 Page 20 CS 239, Spring 2007 CS 239, Spring 2007 Mean Absolute Deviation Sensitivity To Outliers Example • For data set • From most to least, -17, -10, -4.8, 2, 5.4, 27, 84.3, 92, 445, –Range 2056 –Variance 10 x –Mean absolute deviation 1 ? ? • Mean absolute deviation = x i 10 ? –Semi-interquartile range i 1 • Or 393 Lecture 3 Lecture 3 Page 21 Page 22 CS 239, Spring 2007 CS 239, Spring 2007 So, Which Index of Dispersion Determining Distributions for Should I Use? Datasets • If a data set has a common distribution, Yes Range Bounded? that’s the best way to summarize it No • Saying a data set is uniformly Unimodal distributed is more informative than Yes C.O.V symmetrical? just giving its mean and standard No deviation Percentiles or SIQR • But always remember what you’re looking for Lecture 3 Lecture 3 Page 23 Page 24 CS 239, Spring 2007 CS 239, Spring 2007 4

  5. Some Commonly Used Distributions Uniform Distribution • Uniform distribution • All values in a given range are equally likely • Often normalized to a range from zero to one • Normal distribution • Suggests randomness in phenomenon being tested • Exponential distribution 1 ? – Pdf: f ( x ) ? B A • There are many others ? f ( x ) x – CDF: ? x ? • Assuming 0 1 Lecture 3 Lecture 3 Page 25 Page 26 CS 239, Spring 2007 CS 239, Spring 2007 CDF for Uniform Distribution Normal Distribution • Some value of random variable is most likely – Declining probabilities of values as one moves away from this value – Equally on either side of most probable value • Extremely widely used • Generally sort of a “default distribution” – Which isn’t always right . . . Lecture 3 Lecture 3 Page 27 Page 28 CS 239, Spring 2007 CS 239, Spring 2007 PDF and CDF for Normal PDF for Normal Distribution Distribution • PDF expressed in terms of – Location parameter µ (the popular value) – Scale parameter s (how much spread) – PDF is ? ? ? 2 ? 2 ( x ) /( 2 ) e ? f ( x ) ? ? 2 – CDF doesn’t exist in closed form Lecture 3 Lecture 3 Page 29 Page 30 CS 239, Spring 2007 CS 239, Spring 2007 5

  6. Exponential Distribution PDF of Exponential Distribution • Describes value that declines over time – E.g., failure probabilities – Described in terms of location parameter µ – And scale parameter ß – Standard exponential when µ = 0 and ß =1 • PDF: 1 ? ? ? ? ? ( x ) / ? f ( x ) e ? x for µ = 0 and ß =1 f ( x ) e ? • CDF: ? ? ? ? x / ( ) 1 f x e Lecture 3 Lecture 3 Page 31 Page 32 CS 239, Spring 2007 CS 239, Spring 2007 Methods of Determining Plotting a Histogram a Distribution • Suitable if you have a relatively large • So how do we determine if a data set number of data points matches a distribution? 1. Determine range of observations –Plot a histogram 2. Divide range into buckets –Quantile-quantile plot 3.Count number of observations in each bucket –Statistical methods (not covered in 4. Divide by total number of observations and this class) plot it as column chart Lecture 3 Lecture 3 Page 33 Page 34 CS 239, Spring 2007 CS 239, Spring 2007 Problem With Histogram Quantile-Quantile Plots Approach • More suitable for small data sets • Determining cell size • Basically, guess a distribution –If too small, too few observations per • Plot where quantiles of data cell theoretically should fall in that –If too large, no useful details in plot distribution • If fewer than five observations in a –Against where they actually fall cell, cell size is too small • If plot is close to linear, data closely matches that distribution Lecture 3 Lecture 3 Page 35 Page 36 CS 239, Spring 2007 CS 239, Spring 2007 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend