I t Introduction to d t i t Descriptive Descriptive - - PowerPoint PPT Presentation

i t introduction to d t i t descriptive descriptive
SMART_READER_LITE
LIVE PREVIEW

I t Introduction to d t i t Descriptive Descriptive - - PowerPoint PPT Presentation

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring 2012 1 Key measures Key measures Describing data Moment Non-mean based measure Mean Mode, median Center Variance Range, Spread Spread


slide-1
SLIDE 1

t t I t d i Statistics Introduction to Descriptive Descriptive Statistics

17.871 Spring 2012

1
slide-2
SLIDE 2

Key measures Key measures

Describing data Moment Non-mean based measure

Center

Mean Mode, median

Spread Spread

Variance

(standard deviation)

Range, Interquartile range

Skew

Skewness ­-

Peaked

Kurtosis ­-

2
slide-3
SLIDE 3

Key distinction Key distinction

Population vs. Sample Notation

Population Population vs

  • vs. Sample

Sample

Greeks Romans β μ, σ, β b s, b

3
slide-4
SLIDE 4

Mean

n

xi

i1

   X n

4
slide-5
SLIDE 5

Variance, Standard Deviation of Variance, Standard Deviation of a Population

n (xi  )2

  2,

n

i1

(xi  )2

n

 

n

i1

5
slide-6
SLIDE 6

V i Variance, S D S.D. of f a S Sampl le

n (x  )2 2

 n

i

1

1

i

 s ,

Degrees of freedom

n

)2 (xi   ( )

i

 s

i1

n 1

6
slide-7
SLIDE 7

t Bi d Binary data

X  prob(X )  1  x  1 sx

2  x(1 x)  sx 

time

  • f

proportion ) 1 ( x x 

7
slide-8
SLIDE 8

Example of this, using today’s NBC News/M /Marist P Poll i ll in Mi Michigan i hi

Candidate Pct. Santorum 35 Romney 37 y Paul 13 Gingrich 8 [U t d [7] [Unaccounted for] [7]  gen santorum = 1 if

candidate==“Santorum”

 replace santorum = 0 if

candidate~=“Santorum”

 th

the command summ santorum produces

 Mean

Mean = .35 35

 Var = .35(1­.35)=.2275  S.d. = . 4769696

8
slide-9
SLIDE 9

t t

Normal di l distributi ion example ib l

 IQ  SAT  Height  “N

k “No skew”

 “Zero skew”  Symmetrical

Symmetrical

 Mean = median = mode

1

2

1

( x ) / 2

f (x)  e

2

 2

9

# PEOPLE

600

y c n e

400

u q e r F

200

HEIGHT (inches)

46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare.

slide-10
SLIDE 10

t

Skewness Skewness

Asymmetrical distribution

 Income

Frequency

 Contribution to

did candidates

 Populations of

countries countries

 “Residual vote” rates

Value

 “Positive skew”  “Right skew”

Right skew

10
slide-11
SLIDE 11

Distribution of the average $$ of dividends/tax return (in K’s)

1.5 1 Density

Hyde County SD

.5 D

Hyde County, SD

5 10 15 dividends_pc

Mitsubishi i­MiEV (which is supposed to be all electric)

6 .08 .1 .02 .04 .06 Density 50 100 150 var1

Fuel economy of cars for sale in the US

11
slide-12
SLIDE 12

Skewness Skewness

Asymmetrical distribution

 GPA of MIT students

Frequency

 “Negative skew”  “Left skew”

Value

12
slide-13
SLIDE 13

Placement of Republican Party Placement of Republican Party

  • n 100­point scale

.08 .06 ty .02 .04 Densit . 20 40 60 80 100 place on ideological scale ­ republican party place on ideological scale republican party

13
slide-14
SLIDE 14

Sk Skewness

Frequency Value

14
slide-15
SLIDE 15

Placement of Republican Party Placement of Republican Party

  • n 100­point scale

.1 .06 .08 sity 02 .04 Dens .0 20 40 60 80 100 place on ideological scale ­ democratic party

Mean = 26.8; median = 25; mode = 25

15
slide-16
SLIDE 16

K t i Kurtosis

Frequency k > 3

leptokurtic

k = 3

mesokurtic tykurtic

Value

16

Image by MIT OpenCourseWare.

k < 3

pla

Image by MIT OpenCourseWare.

slide-17
SLIDE 17 .15 .05 .1 Density 20 40 60 80 100 place on ideological scale ­ yourself 1

Mean s.d. Skew. Kurt. Self­ placement 55.1 26.4 ­0.14 2.21

.04 .06 .08 .1 Density
  • Rep. pty.

26.8 21.2 0.87 3.59

  • Dem. pty

74.7 21.8 ­1.18 4.29

.02 20 40 60 80 100 place on ideological scale ­ democratic party .06 .08 .1 ensity .02 .04 De 20 40 60 80 100 place on ideological scale ­ republican party

Source: Cooperative Congressional Election Study, 2008

17
slide-18
SLIDE 18

t t

N l di ib i Normal distribution

 Skewness = 0  Kurtosis = 3

1

( x ) / 2

f (x)  e

2

f  2 e

18

Frequency

# PEOPLE HEIGHT (inches)

600 400 46 52 58 64 70 76 82 88 94 200 Image by MIT OpenCourseWare.

slide-19
SLIDE 19

t t More words ab bout th he normal l curve

19

0.4 0.3 0.2 34.1% 34.1% 0.1 2.1% 2.1% 0.1% 0.1% 13.6% 13.6% 0.0

µ

1σ 2σ 2σ

Image by MIT OpenCourseWare.

slide-20
SLIDE 20

The z-score

  • r the

“standardized score” standardized score

x x

z

x

x

20
slide-21
SLIDE 21

Commands in STATA for Commands in STATA for univariate statistics

 summarize varname  summarize varname detail  summarize varname, detail  histogram varname, bin() start() width()

density/fraction/frequency normal density/fraction/frequency normal

 graph box varnames  tabulate

21
slide-22
SLIDE 22

t E l f Fl id Example of Florida voters

 Question: does the age of voters vary by race?  Combine Florida voter extract files, 2008  gen new_birth_date=date(birth_date,"MDY")  gen birth_year=year(new_b)

2010 bi

 gen age= 2010­birth

h_year

22
slide-23
SLIDE 23

t t t t L k di ib i f bi h Look at distribution of birth year

.02 .025 01 .015 Density .005 .0 1850 1900 1950 2000 birth_year

23
slide-24
SLIDE 24

t

  • E

l b d Explore age by voti ing mode

. table race if birth_year>1900,c(mean age) race | mean(age)

  • ---------+-----------

1 | 45.61229 2 | 2 | 42 89916 42.89916 3 | 42.6952 4 | 45.09718 5 | 52.08628 6 | 44.77392 9 | 40.86704 3 = Black

  • 4 = Hispanic

5 Whit 5 = White

24
slide-25
SLIDE 25

G h bi th Graph birth year

.03 .02 .01 Density 20 40 60 80 100 20 40 60 80 100 age

. hist age if birth_year>1900 (bin=71, start=9, width=1.3802817)

25
slide-26
SLIDE 26

Divide into “bins” so that each bar Divide into bins so that each bar represents 1 year

.015 .02 .01 Density .005 20 40 60 80 100 age

hi t if bi th 1900 idth(1) . hist age if birth_year>1900,width(1)

26
slide-27
SLIDE 27

t t Add i k 10 i t l Add ticks at 10­year intervals

histogram totalscore, width(1) xlabel(-.2 (.1) 1)

.015 .02 .01 . Density .005 20 30 40 50 60 70 80 90 100 age

27
slide-28
SLIDE 28

c r e

i i i i l l

S perimpose the normal Superimpose the normal curve

(with the same mean and s.d. as the empirical distribution)

hist age if birth_year>1900,wid(1) xlabel(20 (10) 100) normal

5 .02 .01 .015 Density .005 20 30 40 50 60 70 80 90 100 age

28
slide-29
SLIDE 29
  • . summ age if birth_year>1900,det

age Percentiles Smallest 1% 1% 18 18 9 5% 21 16 10% 24 16 25% 34 16 50% 48 Largest 75% 75% 63 63 107 107 90% 77 107 95% 83 107 99% 91 107 Obs 12612114 Sum of Wgt. 12612114 Mean 49.47549

  • Std. Dev.

19.01049 Variance 361.3986 Skewness .2629496 Kurtosis 2.222442

29
slide-30
SLIDE 30

Histograms by race Histograms by race

hist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1) xlabel(20 (10) 100) normal by(race)

4

3 = Black 4 = Hispanic 5 = White 5 = White

Density normal age

Graphs by race .01 .02 .03

3

.02 .03 .01 . 20 30 40 50 60 70 80 90 100

5

Density

20 30 40 50 60 70 80 90 100

D it age

30
slide-31
SLIDE 31

t M i ith hi Main i issues with histograms

 Proper level of aggregation  Non­regular data categories  Non regular data categories

31
slide-32
SLIDE 32

Draw the previous graph with a box Draw the previous graph with a box plot

graph box age if birth_year>1900

}

100

}

1.5 x IQR

50 age

Upper quartile Median Lower quartile

}

Inter-quartile range

}

q

32
slide-33
SLIDE 33

Draw the box plots for the different Draw the box plots for the different races

graph box age if birth_year>1900&race>=3&race<=5,by(race)

3 4

3 = Black 4 = Hispanic 5 = White 5 = White

age

50 100 50 100

5

Graphs by race 33
slide-34
SLIDE 34

Draw the box plots for the different races using “over” option

graph box age if birth graph box age if birth_year>1900&race>=3&race<=5,over(race) _year>1900&race>=3&race<=5,over(race) 3 = Black 4 = Hispanic 5 = White 5 = White

age 50 100 3 4 5

34
slide-35
SLIDE 35

A note about histograms with A note about histograms with unnatural categories

From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? ­9 No Response ­3 Refused ­2 Don't know ­1 Not in universe 1 Less than 1 month 2 1­6 months 3 7­11 months 7 11 months 4 1­2 years 5 3­4 years 6 5 years or longer

35
slide-36
SLIDE 36

Solution, Ste p p 1 Map artificial category onto “natural” midpoint natural midpoint

­9 No Response  missing ­3 Refused  missing ­2 Don't know  missing ­1 Not in universe  missing 1 Less than 1 month  1/24 = 0.042 2 1 6 months  3 5/12 = 0 29 1­6 months  3.5/12 = 0.29 3 7­11 months  9/12 = 0.75 4 1­2 years  1.5 5 3­4 years  3.5 6 5 years or longer  10 (arbitrary) recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10) recode live_length (min/­1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10)

36
slide-37
SLIDE 37

t Graph h of recod ded data f d d

histogram longevity, fraction histogram longevity, fraction

.557134 .557134 Fraction 1 2 3 4 5 6 7 8 9 10 longevity

37
slide-38
SLIDE 38

t t t

=

D i f d Density pl lot of data

Total area of last bar = Total area of last bar .557 557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051

15 1 2 3 4 5 6 7 8 9 10 longevity

38
slide-39
SLIDE 39

t t t t D i l Density pl lot template

Category Fraction X­min X­max X­length Height (density) < 1 mo < 1 mo. 0156 .0156 1/12 1/12 082 .082 19* .19 1­6 mo. .0909 1/12 ½ .417 .22 7 11 mo 7­11 mo. 0430 .0430 ½ 1 500 .500 09 .09 1­2 yr. .1529 1 2 1 .15 3 4 3­4 yr. 1404 .1404 2 4 2 07 .07 5+ yr. .5571 4 15 11 .05

* = .0156/.082

39
slide-40
SLIDE 40

Three words about pie charts: Three words about pie charts: don’t use them

40
slide-41
SLIDE 41

t t t So, wh hat’ ’s wrong wi h ith th hem

 For non­time series data, hard to get a

comparison among groups; ; the ey ye is very y g g p bad in judging relative size of circle slices

 For time series data hard to grasp cross-  For time series, data, hard to grasp cross

time comparisons

41
slide-42
SLIDE 42

Some words about graphical Some words about graphical presentation

 Aspects of graphical integrity (following

Edward Tufte, Visual Display of p y Quantitative Information)

Main point should be readily apparent Main point should be readily apparent Show as much data as possible Write clear labels on the graph Write clear labels on the graph Show data variation, not design variation

42
slide-43
SLIDE 43

ns

60 70 80 90 100

There is a difference between graphs in There is a difference between graphs in research and publication

Publish OK

.01 .02 .03

3 4

.02 .03 20 30 40 50

5

Density

2 .03

Black Hispanic

.01 20 30 40 50 60 70 80 90 100

De it age

.01 .02

White Total

nsity Density normal age

Graphs by race .01 .02 .03

White Total

De

20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100

Age

Graphs by race

Do not publish

43
slide-44
SLIDE 44

MIT OpenCourseWare http://ocw.mit.edu

17.871 Political Science Laboratory

Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.