t t I t d i Statistics Introduction to Descriptive Descriptive Statistics
17.871 Spring 2012
1
I t Introduction to d t i t Descriptive Descriptive - - PowerPoint PPT Presentation
I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring 2012 1 Key measures Key measures Describing data Moment Non-mean based measure Mean Mode, median Center Variance Range, Spread Spread
17.871 Spring 2012
1Describing data Moment Non-mean based measure
Center
Mean Mode, median
Spread Spread
Variance
(standard deviation)
Range, Interquartile range
Skew
Skewness -
Peaked
Kurtosis -
2Population vs. Sample Notation
Population Population vs
Sample
Greeks Romans β μ, σ, β b s, b
3n (xi )2
2,
n
i1
(xi )2
n
n
i1
5n (x )2 2
i
1
1
i
s ,
Degrees of freedom
n
)2 (xi ( )
i
s
i1
n 1
6X prob(X ) 1 x 1 sx
2 x(1 x) sx
time
proportion ) 1 ( x x
7Candidate Pct. Santorum 35 Romney 37 y Paul 13 Gingrich 8 [U t d [7] [Unaccounted for] [7] gen santorum = 1 if
candidate==“Santorum”
replace santorum = 0 if
candidate~=“Santorum”
th
the command summ santorum produces
Mean
Mean = .35 35
Var = .35(1.35)=.2275 S.d. = . 4769696
8”
IQ SAT Height “N
k “No skew”
“Zero skew” Symmetrical
Symmetrical
Mean = median = mode
1
2
1
( x ) / 2
f (x) e
2
2
9# PEOPLE
600
y c n e
400
u q e r F
200
HEIGHT (inches)
46 52 58 64 70 76 82 88 94 Image by MIT OpenCourseWare.
t
Asymmetrical distribution
Income
Frequency
Contribution to
did candidates
Populations of
countries countries
“Residual vote” rates
Value
“Positive skew” “Right skew”
Right skew
10Distribution of the average $$ of dividends/tax return (in K’s)
1.5 1 Density
Hyde County SD
.5 D
Hyde County, SD
5 10 15 dividends_pc
Mitsubishi iMiEV (which is supposed to be all electric)
6 .08 .1 .02 .04 .06 Density 50 100 150 var1Fuel economy of cars for sale in the US
11Asymmetrical distribution
GPA of MIT students
Frequency
“Negative skew” “Left skew”
Value
12.08 .06 ty .02 .04 Densit . 20 40 60 80 100 place on ideological scale republican party place on ideological scale republican party
13Frequency Value
14.1 .06 .08 sity 02 .04 Dens .0 20 40 60 80 100 place on ideological scale democratic party
Mean = 26.8; median = 25; mode = 25
15Frequency k > 3
leptokurtic
k = 3
mesokurtic tykurtic
Value
16Image by MIT OpenCourseWare.
k < 3
pla
Image by MIT OpenCourseWare.
Mean s.d. Skew. Kurt. Self placement 55.1 26.4 0.14 2.21
.04 .06 .08 .1 Density26.8 21.2 0.87 3.59
74.7 21.8 1.18 4.29
.02 20 40 60 80 100 place on ideological scale democratic party .06 .08 .1 ensity .02 .04 De 20 40 60 80 100 place on ideological scale republican partySource: Cooperative Congressional Election Study, 2008
17
Skewness = 0 Kurtosis = 3
1
( x ) / 2
f (x) e
2
f 2 e
18Frequency
# PEOPLE HEIGHT (inches)
600 400 46 52 58 64 70 76 82 88 94 200 Image by MIT OpenCourseWare.
0.4 0.3 0.2 34.1% 34.1% 0.1 2.1% 2.1% 0.1% 0.1% 13.6% 13.6% 0.0
µ
1σ 2σ 2σ
Image by MIT OpenCourseWare.
summarize varname summarize varname detail summarize varname, detail histogram varname, bin() start() width()
density/fraction/frequency normal density/fraction/frequency normal
graph box varnames tabulate
21 Question: does the age of voters vary by race? Combine Florida voter extract files, 2008 gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b)
2010 bi
gen age= 2010birth
h_year
22.02 .025 01 .015 Density .005 .0 1850 1900 1950 2000 birth_year
23. table race if birth_year>1900,c(mean age) race | mean(age)
1 | 45.61229 2 | 2 | 42 89916 42.89916 3 | 42.6952 4 | 45.09718 5 | 52.08628 6 | 44.77392 9 | 40.86704 3 = Black
5 Whit 5 = White
24.03 .02 .01 Density 20 40 60 80 100 20 40 60 80 100 age
. hist age if birth_year>1900 (bin=71, start=9, width=1.3802817)
25.015 .02 .01 Density .005 20 40 60 80 100 age
hi t if bi th 1900 idth(1) . hist age if birth_year>1900,width(1)
26histogram totalscore, width(1) xlabel(-.2 (.1) 1)
.015 .02 .01 . Density .005 20 30 40 50 60 70 80 90 100 age
27i i i i l l
(with the same mean and s.d. as the empirical distribution)
hist age if birth_year>1900,wid(1) xlabel(20 (10) 100) normal
5 .02 .01 .015 Density .005 20 30 40 50 60 70 80 90 100 age
28age Percentiles Smallest 1% 1% 18 18 9 5% 21 16 10% 24 16 25% 34 16 50% 48 Largest 75% 75% 63 63 107 107 90% 77 107 95% 83 107 99% 91 107 Obs 12612114 Sum of Wgt. 12612114 Mean 49.47549
19.01049 Variance 361.3986 Skewness .2629496 Kurtosis 2.222442
29hist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1) xlabel(20 (10) 100) normal by(race)
4
3 = Black 4 = Hispanic 5 = White 5 = White
Density normal age
Graphs by race .01 .02 .033
.02 .03 .01 . 20 30 40 50 60 70 80 90 100
5
Density
20 30 40 50 60 70 80 90 100
D it age
30 Proper level of aggregation Nonregular data categories Non regular data categories
31graph box age if birth_year>1900
100
1.5 x IQR
50 age
Upper quartile Median Lower quartile
Inter-quartile range
q
32graph box age if birth_year>1900&race>=3&race<=5,by(race)
3 4
3 = Black 4 = Hispanic 5 = White 5 = White
age
50 100 50 100
5
Graphs by race 33Draw the box plots for the different races using “over” option
graph box age if birth graph box age if birth_year>1900&race>=3&race<=5,over(race) _year>1900&race>=3&race<=5,over(race) 3 = Black 4 = Hispanic 5 = White 5 = White
age 50 100 3 4 5
34From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? 9 No Response 3 Refused 2 Don't know 1 Not in universe 1 Less than 1 month 2 16 months 3 711 months 7 11 months 4 12 years 5 34 years 6 5 years or longer
359 No Response missing 3 Refused missing 2 Don't know missing 1 Not in universe missing 1 Less than 1 month 1/24 = 0.042 2 1 6 months 3 5/12 = 0 29 16 months 3.5/12 = 0.29 3 711 months 9/12 = 0.75 4 12 years 1.5 5 34 years 3.5 6 5 years or longer 10 (arbitrary) recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10) recode live_length (min/1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10)
36histogram longevity, fraction histogram longevity, fraction
.557134 .557134 Fraction 1 2 3 4 5 6 7 8 9 10 longevity
37=
Total area of last bar = Total area of last bar .557 557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051
15 1 2 3 4 5 6 7 8 9 10 longevity
38Category Fraction Xmin Xmax Xlength Height (density) < 1 mo < 1 mo. 0156 .0156 1/12 1/12 082 .082 19* .19 16 mo. .0909 1/12 ½ .417 .22 7 11 mo 711 mo. 0430 .0430 ½ 1 500 .500 09 .09 12 yr. .1529 1 2 1 .15 3 4 34 yr. 1404 .1404 2 4 2 07 .07 5+ yr. .5571 4 15 11 .05
* = .0156/.082
39 For nontime series data, hard to get a
comparison among groups; ; the ey ye is very y g g p bad in judging relative size of circle slices
For time series data hard to grasp cross- For time series, data, hard to grasp cross
time comparisons
41 Aspects of graphical integrity (following
Edward Tufte, Visual Display of p y Quantitative Information)
Main point should be readily apparent Main point should be readily apparent Show as much data as possible Write clear labels on the graph Write clear labels on the graph Show data variation, not design variation
42ns
60 70 80 90 100
There is a difference between graphs in There is a difference between graphs in research and publication
Publish OK
.01 .02 .03
3 4
.02 .03 20 30 40 50
5
Density
2 .03
Black Hispanic
.01 20 30 40 50 60 70 80 90 100
De it age
.01 .02
White Total
nsity Density normal age
Graphs by race .01 .02 .03White Total
De
20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100
Age
Graphs by raceDo not publish
43MIT OpenCourseWare http://ocw.mit.edu
17.871 Political Science Laboratory
Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.