Outline Boxplots: De fi nition, Strengths & W eaknesses Letter - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Boxplots: De fi nition, Strengths & W eaknesses Letter - - PowerPoint PPT Presentation

Outline Boxplots: De fi nition, Strengths & W eaknesses Letter V alue Boxplot Letter V alue Statistics Heike Hofmann, Karen Kafadar, Hadley Wickham Letter V alue Boxplots I OWA S TATE U NIVERSITY Examples Conclusion


slide-1
SLIDE 1

Letter V alue Boxplot

Heike Hofmann, Karen Kafadar, Hadley Wickham IOWA STATE UNIVERSITY

Outline

  • Boxplots: Definition, Strengths & W

eaknesses

  • Letter V

alue Statistics

  • Letter V

alue Boxplots

  • Examples
  • Conclusion

Boxplots

  • Early V

ersion: Tukey 1972 (Snedecor Festzeitschrift, at Iowa State University)

  • Most common version in EDA (1977):
  • Median (Center Line), Fourths (Box Edges), adjacent values

(ends of whiskers) and extreme values

  • All marks correspond to actual data values

1 2 3 4 5 6 7

Boxplot: Strengths

  • Quick summary without overwhelming amount of

detail

  • Approximate location, spread, shape of

distribution

  • Outlier identification
  • Associations among variables

1 2 3 4 5 6 7

slide-2
SLIDE 2

Boxplots: W eaknesses

  • Expected rate of labeled outliers approx 0.4+ 0.007n
  • For n = 100000 expect approx. 700 outliers!

Exponential Distribution, n= 100, 1000, 10000, 100000

1 2 3 4 5 2 4 6 2 4 6 8 2 4 6 8 10 12

Modifications

  • Notched box-and-whisker (McGill, Larsen, Tukey 1987)
  • Nonparametric density estimates
  • V

ase plots (Benjamini, 1988)

  • Violin plots (Hintze, Nelson 1998)
  • Box-percentile plots (Esty, Banfield 2003)

Implementations: S routines (David James), package vioplot (Adler, Romain), package HMisc bpplot (Harre, Banfield), examples at R Graph Gaery

Letter V alue Statistics

  • Estimate quantiles corresponding to tail areas 2-j
  • Median (1/2): depth =
  • Fourths (1/4): depth =
  • Eights (1/8): depth =
  • Boxplots show median, fourths
  • Large Data Sets: tail quantiles become more reliable

include LV s beyond Fourths

dM = (1 + n)/2 dF = (1 + ⌊dM⌋)/2 dE = (1 + ⌊dF ⌋)/2

Letter V alue Boxplot

  • How many boxes to show?
  • Outlier identification?
  • All marks are based on actual data values

x

  • 3
  • 2
  • 1

1 2 3

LVboxplot(rnorm(1000))

slide-3
SLIDE 3

Stopping Rules & Outliers

  • EDA: 5-8 outliers
  • Percentage of data, e.g. 0.5-1%
  • uncertainty in LVi extends beyond or into LVi-1

(i.e. upper limit for LVi crosses LVi-1)

k = ⌊log2 n⌋ − 4

  • k

=

  • log2 n − log2
  • 4z2

1−α/2

  • + 1

Rules lead to similar answers ... Examples

Gaussian, Exponential & Normal

Gaussian, n=10000

x

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

Exponential, n=10000

x 2 4 6 8 2 4 6 8

Uniform, n=10000

x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Gene Expression V alues

T1-1

x 6 8 10 12 14

T1-2

x 6 8 10 12 14

T1-3

x 6 8 10 12 14

T2-1

x 6 8 10 12 14

T2-2

x 6 8 10 12 14

T2-3

x 6 8 10 12 14

WT-1

x 6 8 10 12 14

WT-2

x 6 8 10 12 14

WT-3

x 6 8 10 12 14

Conclusion

  • appropriate for large number of values
  • based on actual data values
  • simple to compute
  • reduce number of labeled outliers shown in

conventional boxplots

  • do not depend on a smoothing parameter

Letter V alue Boxplots are Download (for now) at http://www.public.iastate.edu/~hofmann

slide-4
SLIDE 4

Graphical Displays of Large Data Sets

  • Quick summary without overwhelming amount of

detail

  • Approximate location, spread, shape of

distribution

  • Outlier identification
  • Associations among variables

“The greatest value of a picture us when it forces us to notice what we never expected to see” (Tukey 1977)