Univariate Graphics
STAT 133 Gaston Sanchez
Department of Statistics, UC–Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133
Univariate Graphics STAT 133 Gaston Sanchez Department of - - PowerPoint PPT Presentation
Univariate Graphics STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 Looking at one single variable 2 Univariate Statistical Graphics
STAT 133 Gaston Sanchez
Department of Statistics, UC–Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133
2
3
3
Qualitative Variable
◮ Bar chart ◮ Dot chart ◮ Pie chart ◮ Pareto chart
Quantitative Variable
◮ All of qualitative ◮ Histogram ◮ Density curve ◮ Boxplot ◮ Ogive 4
5
6
C1 C2 C3 C4 C5 frequency f1 f2 f3 f4 f5 7
◮ categories on horizontal axis ◮ frequencies on vertical axis ◮ length of bar equal to frequency
(Note that you can also make a horizontal bar-chart, in which case the axes play inverse roles)
8
9
## color count percent ## 1 black 5 2.58 ## 2 blue 40 20.62 ## 3 brown 2 1.03 ## 4 gold 19 9.79 ## 5 green 31 15.98 ## 6
4 2.06 ## 7 red 71 36.60 ## 8 white 22 11.34
10
black blue brown gold green
red white 10 20 30 40 50 60 70 5 40 2 19 31 4 71 22
11
black blue brown gold green
red white 0% 10% 20% 30% 40% 2.6% 20.6% 1% 9.8% 16% 2.1% 36.6% 11.3%
12
brown
black gold white green blue red 10 20 30 40 50 60 70 2 4 5 19 22 31 40 71
13
red blue green white gold black
brown 10 20 30 40 50 60 70 71 40 31 22 19 5 4 2
14
red blue green white gold black
brown 10 20 30 40 50 60 70 71 40 31 22 19 5 4 2
15
16
◮ Dot-charts are very similar to bar charts. ◮ Instead of using bars, dot-charts display frequencies with
dots.
◮ They are simpler and cleaner than bar charts ◮ They are also less used than bar charts 17
black blue brown gold green
red white 10 20 30 40 50 60 70 80
18
brown
black gold white green blue red 10 20 30 40 50 60 70 80
19
all values roughly the same differences decrease by roughly the same amount differences from one value to the next vary significantly differences from one value to the next increase
20
differences from one value to the next decrease shifting differences from
different from the rest
21
22
red blue green white gold black
brown
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
23
red blue green white gold black
brown
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36.6% 57.22% 73.2% 84.54% 94.33% 96.91% 98.97% 100%
24
red blue green white gold black
brown
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36.6% 57.22% 73.2% 84.54% 94.33% 96.91% 98.97% 100%
25
◮ Pareto charts contains both bars and a line graph ◮ Individual values are representing in descending order ◮ Cumulative frequencies are represented by the line ◮ The left vertical axis is the frequency of occurrence 26
27
black blue brown gold green
red white 28
black blue brown gold green
red white 29
◮ Pie charts force us to compare either 2-D areas formed by
each slice or the angles formed
◮ Visual perception handles neitheir of these comparisons
easily or accurately
30
31
## teams tickets teams tickets ## 1 cowboys 110.20 falcons 83.71 ## 2 patriots 117.84 vikings 78.69 ## 3 giants 111.69 rams 74.49 ## 4 bears 103.60 seahawks 71.21 ## 5 jets 110.28 cardinals 79.56 ## 6 redskins 94.80 dolphins 71.14 ## 7 ravens 100.19 raiders 64.80 ## 8 eagles 93.01 titans 65.28 ## 9 texans 88.98 lions 67.60 ## 10 chargers 84.55 bengals 68.96 ## 11 steelers 81.13 jaguars 68.44 ## 12 packers 82.61 chiefs 64.92 ## 13 49ers 83.54 buccaneers 63.59 ## 14 saints 74.99 bills 57.75 ## 15 broncos 84.27 panthers 66.84 ## 16 colts 86.32 browns 54.20
32
◮ We can use bar charts with quantitative variables ◮ In this case we need to first categorize the variable, and
then get a frequency table
33
34
below $70 $70 − $99.99 $100 or above 2 4 6 8 10 12 14 16 absolute frequency 35
36
Histograms provide a way of viewing the general distribution of values in a quantitative variable
37
50 60 70 80 90 100 110 120 price 2 4 6 8
frequency
38
partitioned into a number of non-overlapping “cells” or bins.
into each cell is counted (either absolute or relative freqs)
represented as a “bar” drawn over the cell
39
◮ The bins represent ranges of values ◮ The bins (intervals) must be adjacent, and usually of equal
size
◮ The bars are adjacent (not discontinuous) ◮ The areas of the bars are meaningful ◮ Height of bars equal to the frequency ◮ Width equal to the bin size ◮ The area of a bar gives the proportion of data values which
fall in the bin
40
Histogram of price tickets
price Frequency 54 62 70 78 86 94 102 110 118 2 4 6 8 10 12
41
Histogram of price tickets (4 bins)
50 60 70 80 90 100 110 120 130
price
2 4 6 8 10 12 14 frequency
Histogram of price tickets (5 bins)
40 60 80 100 120 140
price
2 4 6 8 10 12 14 frequency
Histogram of price tickets (6 bins)
50 62 74 86 98 110 122
price
2 4 6 8 10 frequency
Histogram of price tickets (7 bins)
50 60 70 80 90 100 110 120
price
2 4 6 8 frequency
42
Histogram of price tickets (3 bins)
50 80 110 140 price 2 4 6 8 10 12 14 16 frequency
Histogram of price tickets (14 bins)
50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 price 1 2 3 4 5 6 frequency
43
◮ The shape of a histogram depends on the chosen bins ◮ This suggests that there is a fundamental instability at the
heart of its construction
◮ The bars are adjacent (not discontinuous) ◮ The areas of the bars are meaningful 44
Symmetrical Skewed to the right Skewed to the left
45
Curved Flat or Uniform Curved Downward Multiple peaks (e.g. bimodal, trimodal. etc) 46
Concentration Gap Outlier
47
Concentration Peak Concentration
48
49
plots”
50
51
51
Box
whiskers
52
Low Value (min) 75th percentile (Q3) Median, 50th percentile (Q2) 25th percentile (Q1) High Value (max)
IQR or midspread (50% of values) Range (100% of values)
53
55 60 65 70 75 80 85 90 95 100 105 110 115 120 price
54
55 60 65 70 75 80 85 90 95 100 105 110 115 120 price
min Q1 median Q3 max
55
55 60 65 70 75 80 85 90 95 100 105 110 115 120 price
56
55 60 65 70 75 80 85 90 95 100 105 110 115 120 price
57
Call an observation a suspected outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile
58
Outliers 75th percentile (Q3) Median, 50th percentile (Q2) 25th percentile (Q1) Outliers
Q3 + 1.5 x IQR Q1 - 1.5 x IQR
59
60
40 50 60 70 80 90 100 110 120 130 140 0.000 0.005 0.010 0.015 0.020 0.025 price density
61
◮ Describes the distribution of values by a smooth curve ◮ Is always on or above the horizontal axis ◮ Has area equal to 1 underneath it ◮ Is an idealized distribution 62
◮ The mode is the peak point of the curve (could be more
than one or none)
◮ The median is the equal-areas point ◮ The mean is the balance point ◮ The median and the mean are always equal on a symmetric
density curve
63
64
◮ Ogives help us examine the cumulative distribution of
values in a quantitative variable
◮ An ogive tells us how many data are less than the indicated
value on the horizontal axis
◮ An ogive shows how slowly or rapidly the data values
accumulate over the range of the data
65
Bin Interval Mid-point Frequency Cum Freq 1 [50-60) 55 2 2 2 [60-70) 65 8 10 3 [70-80) 75 6 16 4 [80-90) 85 8 24 5 [90-100) 95 2 26 6 [100-110) 105 2 28 7 [110-120) 115 4 32
66
50 60 70 80 90 100 110 120 price
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 cumulative frequency 2 10 16 24 26 28 32
67
◮ Make a frequency table showing bin intervals and
cumulative frequencies.
◮ An ogive begins on the horizontal axis at the lower
boundary of the first bin.
◮ For each bin, make a dot over the upper interval limit at
the height of the cumulative frequency.
◮ Connect the dots with line segments. 68
69
Histogram 1
Frequency 20 40 60 80 100 10 20 30 40
Histogram 2
Frequency 30 50 70 90 20 40 60 80
Histogram 3
Frequency 20 40 60 80 10 20 30 40 50 60
70
Histogram 1
Frequency 20 40 60 80 100 10 20 30 40
Ogive 1
20 40 60 80 100
0% 20% 40% 60% 80% 100%
0%2% 9% 17% 31.5% 46% 67.5% 83.5% 93.5% 99.5% 100%
71
Histogram 2
Frequency 30 50 70 90 20 40 60 80
Ogive 2
30 50 70 90
0% 20% 40% 60% 80% 100%
0% 1% 1%4.5% 16% 34% 58% 100%
72
Histogram 3
Frequency 20 40 60 80 10 20 30 40 50 60
Ogive 3
20 40 60 80
0% 20% 40% 60% 80% 100%
0% 22% 51.5% 71% 83% 91% 95.5% 98% 99.5% 100%
73