Univariate Graphics STAT 133 Gaston Sanchez Department of - - PowerPoint PPT Presentation

univariate graphics
SMART_READER_LITE
LIVE PREVIEW

Univariate Graphics STAT 133 Gaston Sanchez Department of - - PowerPoint PPT Presentation

Univariate Graphics STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 Looking at one single variable 2 Univariate Statistical Graphics


slide-1
SLIDE 1

Univariate Graphics

STAT 133 Gaston Sanchez

Department of Statistics, UC–Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133

slide-2
SLIDE 2

Looking at one single variable

2

slide-3
SLIDE 3

Univariate Statistical Graphics

Getting started with graphics for exploration requires underdstanding charts and plots for single variables

3

slide-4
SLIDE 4

Univariate Statistical Graphics

Getting started with graphics for exploration requires underdstanding charts and plots for single variables Many multivariate graphics are extensions or combinations of univariate charts

3

slide-5
SLIDE 5

Univariate graphics by type of variable

Qualitative Variable

◮ Bar chart ◮ Dot chart ◮ Pie chart ◮ Pareto chart

Quantitative Variable

◮ All of qualitative ◮ Histogram ◮ Density curve ◮ Boxplot ◮ Ogive 4

slide-6
SLIDE 6

Bar Charts

5

slide-7
SLIDE 7

From Frequency Tables ...

Category Absolute Relative Frequency Frequency C1 f1 f1/n C2 f2 f2/n C3 f3 f3/n . . . . . . . . . Ck fk fk/n total n 1

6

slide-8
SLIDE 8

to Bar-charts

C1 C2 C3 C4 C5 frequency f1 f2 f3 f4 f5 7

slide-9
SLIDE 9

Bar-charts

Elements of vertical bar-charts

◮ categories on horizontal axis ◮ frequencies on vertical axis ◮ length of bar equal to frequency

(Note that you can also make a horizontal bar-chart, in which case the axes play inverse roles)

8

slide-10
SLIDE 10

Bar-chart: predominant color in flags

9

slide-11
SLIDE 11

Predominant Color in Flags

## color count percent ## 1 black 5 2.58 ## 2 blue 40 20.62 ## 3 brown 2 1.03 ## 4 gold 19 9.79 ## 5 green 31 15.98 ## 6

  • range

4 2.06 ## 7 red 71 36.60 ## 8 white 22 11.34

10

slide-12
SLIDE 12

Bar-chart example

black blue brown gold green

  • range

red white 10 20 30 40 50 60 70 5 40 2 19 31 4 71 22

11

slide-13
SLIDE 13

Bar-chart: predominant color in flags

black blue brown gold green

  • range

red white 0% 10% 20% 30% 40% 2.6% 20.6% 1% 9.8% 16% 2.1% 36.6% 11.3%

12

slide-14
SLIDE 14

Bar-chart: predominant color in flags

brown

  • range

black gold white green blue red 10 20 30 40 50 60 70 2 4 5 19 22 31 40 71

13

slide-15
SLIDE 15

Bar-chart: predominant color in flags

red blue green white gold black

  • range

brown 10 20 30 40 50 60 70 71 40 31 22 19 5 4 2

14

slide-16
SLIDE 16

Bar-chart: predominant color in flags

red blue green white gold black

  • range

brown 10 20 30 40 50 60 70 71 40 31 22 19 5 4 2

15

slide-17
SLIDE 17

Dot charts

16

slide-18
SLIDE 18

Dot charts

◮ Dot-charts are very similar to bar charts. ◮ Instead of using bars, dot-charts display frequencies with

dots.

◮ They are simpler and cleaner than bar charts ◮ They are also less used than bar charts 17

slide-19
SLIDE 19

Dot-chart: predominant color in flags

black blue brown gold green

  • range

red white 10 20 30 40 50 60 70 80

18

slide-20
SLIDE 20

Ranked Dot-charts

brown

  • range

black gold white green blue red 10 20 30 40 50 60 70 80

19

slide-21
SLIDE 21

Ranked dot-chart patterns

all values roughly the same differences decrease by roughly the same amount differences from one value to the next vary significantly differences from one value to the next increase

20

slide-22
SLIDE 22

Ranked dot-chart patterns

differences from one value to the next decrease shifting differences from

  • ne value to the next
  • ne or more values are extraordinarily

different from the rest

21

slide-23
SLIDE 23

Pareto charts

22

slide-24
SLIDE 24

Bar-chart with Pareto Line

red blue green white gold black

  • range

brown

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

23

slide-25
SLIDE 25

Bar-chart with Pareto Line

red blue green white gold black

  • range

brown

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36.6% 57.22% 73.2% 84.54% 94.33% 96.91% 98.97% 100%

24

slide-26
SLIDE 26

Bar-chart with Pareto Line

red blue green white gold black

  • range

brown

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36.6% 57.22% 73.2% 84.54% 94.33% 96.91% 98.97% 100%

25

slide-27
SLIDE 27

Pareto charts

◮ Pareto charts contains both bars and a line graph ◮ Individual values are representing in descending order ◮ Cumulative frequencies are represented by the line ◮ The left vertical axis is the frequency of occurrence 26

slide-28
SLIDE 28

Pie charts

27

slide-29
SLIDE 29

Pie Chart

black blue brown gold green

  • range

red white 28

slide-30
SLIDE 30

Donut Chart

black blue brown gold green

  • range

red white 29

slide-31
SLIDE 31

Pie charts disadvantages

◮ Pie charts force us to compare either 2-D areas formed by

each slice or the angles formed

◮ Visual perception handles neitheir of these comparisons

easily or accurately

30

slide-32
SLIDE 32

Univariate Quantitative Charts

31

slide-33
SLIDE 33

NFL Ticket prices (2013)

## teams tickets teams tickets ## 1 cowboys 110.20 falcons 83.71 ## 2 patriots 117.84 vikings 78.69 ## 3 giants 111.69 rams 74.49 ## 4 bears 103.60 seahawks 71.21 ## 5 jets 110.28 cardinals 79.56 ## 6 redskins 94.80 dolphins 71.14 ## 7 ravens 100.19 raiders 64.80 ## 8 eagles 93.01 titans 65.28 ## 9 texans 88.98 lions 67.60 ## 10 chargers 84.55 bengals 68.96 ## 11 steelers 81.13 jaguars 68.44 ## 12 packers 82.61 chiefs 64.92 ## 13 49ers 83.54 buccaneers 63.59 ## 14 saints 74.99 bills 57.75 ## 15 broncos 84.27 panthers 66.84 ## 16 colts 86.32 browns 54.20

32

slide-34
SLIDE 34

Bar charts for quantitative variables

◮ We can use bar charts with quantitative variables ◮ In this case we need to first categorize the variable, and

then get a frequency table

33

slide-35
SLIDE 35

Frequency Table of Ticket Prices

Category Absolute Relative Name Frequency Frequency Below $70 10 0.3125 $70 - $99.99 16 0.5000 $100 or above 6 0.1875 Total 32 1.00

34

slide-36
SLIDE 36

NFL Ticket prices (2013)

below $70 $70 − $99.99 $100 or above 2 4 6 8 10 12 14 16 absolute frequency 35

slide-37
SLIDE 37

Histograms

36

slide-38
SLIDE 38

Histograms

Histograms provide a way of viewing the general distribution of values in a quantitative variable

37

slide-39
SLIDE 39

NFL Ticket prices (2013)

50 60 70 80 90 100 110 120 price 2 4 6 8

frequency

38

slide-40
SLIDE 40

Building a Histogram

  • 1. Partition of values: The range of the data values is

partitioned into a number of non-overlapping “cells” or bins.

  • 2. Counting frequencies: The number of data values falling

into each cell is counted (either absolute or relative freqs)

  • 3. Drawing Bars: The observations falling into a cell are

represented as a “bar” drawn over the cell

39

slide-41
SLIDE 41

About Histograms

◮ The bins represent ranges of values ◮ The bins (intervals) must be adjacent, and usually of equal

size

◮ The bars are adjacent (not discontinuous) ◮ The areas of the bars are meaningful ◮ Height of bars equal to the frequency ◮ Width equal to the bin size ◮ The area of a bar gives the proportion of data values which

fall in the bin

40

slide-42
SLIDE 42

Histogram with 4 bins

Histogram of price tickets

price Frequency 54 62 70 78 86 94 102 110 118 2 4 6 8 10 12

41

slide-43
SLIDE 43

Histograms with different bins

Histogram of price tickets (4 bins)

50 60 70 80 90 100 110 120 130

price

2 4 6 8 10 12 14 frequency

Histogram of price tickets (5 bins)

40 60 80 100 120 140

price

2 4 6 8 10 12 14 frequency

Histogram of price tickets (6 bins)

50 62 74 86 98 110 122

price

2 4 6 8 10 frequency

Histogram of price tickets (7 bins)

50 60 70 80 90 100 110 120

price

2 4 6 8 frequency

42

slide-44
SLIDE 44

Avoid too few and too many bins

Histogram of price tickets (3 bins)

50 80 110 140 price 2 4 6 8 10 12 14 16 frequency

Histogram of price tickets (14 bins)

50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 price 1 2 3 4 5 6 frequency

43

slide-45
SLIDE 45

About Histograms

◮ The shape of a histogram depends on the chosen bins ◮ This suggests that there is a fundamental instability at the

heart of its construction

◮ The bars are adjacent (not discontinuous) ◮ The areas of the bars are meaningful 44

slide-46
SLIDE 46

Histogram patterns

Symmetrical Skewed to the right Skewed to the left

45

slide-47
SLIDE 47

Histogram patterns

Curved Flat or Uniform Curved Downward Multiple peaks (e.g. bimodal, trimodal. etc) 46

slide-48
SLIDE 48

Histogram patterns

Concentration Gap Outlier

47

slide-49
SLIDE 49

Histogram patterns

Concentration Peak Concentration

48

slide-50
SLIDE 50

Box plots

49

slide-51
SLIDE 51

Building a Histogram

  • 1. Box-and-whisker plots, most commonly known as “box

plots”

  • 2. created by John Tukey
  • 3. simple and effective way to display the distribution of values
  • 4. relies on the so-called 5-summary indicators

50

slide-52
SLIDE 52

Box plots based on 5-number summary

5 summary indicators

51

slide-53
SLIDE 53

Box plots based on 5-number summary

5 summary indicators

  • 1. minimum
  • 2. 25th percentile (1st quartile)
  • 3. 50th percentile (2nd quartile, or median)
  • 4. 75th percentile (3rd quartile)
  • 5. maximum

51

slide-54
SLIDE 54

Box plot basics

Box

whiskers

52

slide-55
SLIDE 55

Box plot basics

Low Value (min) 75th percentile (Q3) Median, 50th percentile (Q2) 25th percentile (Q1) High Value (max)

IQR or midspread (50% of values) Range (100% of values)

53

slide-56
SLIDE 56

NFL Ticket Prices

55 60 65 70 75 80 85 90 95 100 105 110 115 120 price

54

slide-57
SLIDE 57

5 number summary

55 60 65 70 75 80 85 90 95 100 105 110 115 120 price

| | | | |

min Q1 median Q3 max

55

slide-58
SLIDE 58

Box plot

55 60 65 70 75 80 85 90 95 100 105 110 115 120 price

56

slide-59
SLIDE 59

Box plot

55 60 65 70 75 80 85 90 95 100 105 110 115 120 price

57

slide-60
SLIDE 60

Box plot and outliers

The 1.5 x IQR rule for outliers

Call an observation a suspected outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile

58

slide-61
SLIDE 61

Modified Box plot

Outliers 75th percentile (Q3) Median, 50th percentile (Q2) 25th percentile (Q1) Outliers

Q3 + 1.5 x IQR Q1 - 1.5 x IQR

59

slide-62
SLIDE 62

Density Curves

60

slide-63
SLIDE 63

Density Curve

40 50 60 70 80 90 100 110 120 130 140 0.000 0.005 0.010 0.015 0.020 0.025 price density

61

slide-64
SLIDE 64

Density Curve

A Density Curve

◮ Describes the distribution of values by a smooth curve ◮ Is always on or above the horizontal axis ◮ Has area equal to 1 underneath it ◮ Is an idealized distribution 62

slide-65
SLIDE 65

Density Curve

About Density Curve

◮ The mode is the peak point of the curve (could be more

than one or none)

◮ The median is the equal-areas point ◮ The mean is the balance point ◮ The median and the mean are always equal on a symmetric

density curve

63

slide-66
SLIDE 66

Ogives

64

slide-67
SLIDE 67

Ogives

About Ogives

◮ Ogives help us examine the cumulative distribution of

values in a quantitative variable

◮ An ogive tells us how many data are less than the indicated

value on the horizontal axis

◮ An ogive shows how slowly or rapidly the data values

accumulate over the range of the data

65

slide-68
SLIDE 68

Frequency Table NFL Price Tickets

Bin Interval Mid-point Frequency Cum Freq 1 [50-60) 55 2 2 2 [60-70) 65 8 10 3 [70-80) 75 6 16 4 [80-90) 85 8 24 5 [90-100) 95 2 26 6 [100-110) 105 2 28 7 [110-120) 115 4 32

66

slide-69
SLIDE 69

Ogive

50 60 70 80 90 100 110 120 price

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 cumulative frequency 2 10 16 24 26 28 32

67

slide-70
SLIDE 70

Ogives

Building an Ogive

◮ Make a frequency table showing bin intervals and

cumulative frequencies.

◮ An ogive begins on the horizontal axis at the lower

boundary of the first bin.

◮ For each bin, make a dot over the upper interval limit at

the height of the cumulative frequency.

◮ Connect the dots with line segments. 68

slide-71
SLIDE 71

Distributions and Ogives

69

slide-72
SLIDE 72

Three histograms

Histogram 1

Frequency 20 40 60 80 100 10 20 30 40

Histogram 2

Frequency 30 50 70 90 20 40 60 80

Histogram 3

Frequency 20 40 60 80 10 20 30 40 50 60

70

slide-73
SLIDE 73

Symmetric Distribution

Histogram 1

Frequency 20 40 60 80 100 10 20 30 40

Ogive 1

20 40 60 80 100

0% 20% 40% 60% 80% 100%

0%2% 9% 17% 31.5% 46% 67.5% 83.5% 93.5% 99.5% 100%

71

slide-74
SLIDE 74

Skewed to the left Distribution

Histogram 2

Frequency 30 50 70 90 20 40 60 80

Ogive 2

30 50 70 90

0% 20% 40% 60% 80% 100%

0% 1% 1%4.5% 16% 34% 58% 100%

72

slide-75
SLIDE 75

Skewed to the right Distribution

Histogram 3

Frequency 20 40 60 80 10 20 30 40 50 60

Ogive 3

20 40 60 80

0% 20% 40% 60% 80% 100%

0% 22% 51.5% 71% 83% 91% 95.5% 98% 99.5% 100%

73