MATH 185 – University of California San Diego – Ery Arias-Castro 1 / 10
Univariate Categorical Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation
Univariate Categorical Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation
Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math185.html MATH 185 University of California San Diego
The first 2000 digits of π
MATH 185 – University of California San Diego – Ery Arias-Castro 2 / 10
We use the pi2000 data in the package UsingR – call ?pi2000.
> library(UsingR) > str(pi2000) num [1:2000] 3 1 4 1 5 9 2 6 5 3 ...
Q: Though this is not the role of a statistician per se, what kind of questions would we ask of such data?
Counts/Frequencies
MATH 185 – University of California San Diego – Ery Arias-Castro 3 / 10
Say we are insterested in the number of times certain digits appear. We therefore summarize the data as counts in the different categories
> table(pi2000) pi2000 1 2 3 4 5 6 7 8 9 181 213 207 189 195 205 200 197 202 211
Alternatively, we can compute frequencies
> table(pi2000)/length(pi2000) pi2000 1 2 3 4 5 6 7 8 9 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055
Barplot
MATH 185 – University of California San Diego – Ery Arias-Castro 4 / 10
For categorical data with a few categories, a barplot is often useful.
> barplot(table(pi2000), col = "#ffffcc")
1 2 3 4 5 6 7 8 9 50 100 150 200
Pie Chart
MATH 185 – University of California San Diego – Ery Arias-Castro 5 / 10
We can also use a pie chart.
> pie(table(pi2000))
1 2 3 4 5 6 7 8 9
Testing for equal proportions
MATH 185 – University of California San Diego – Ery Arias-Castro 6 / 10
The Pearson χ2-goodness-of-fit test: We observe an i.i.d. sample ξ1, . . . , ξn with P(ξi = rs) = ps We want to test
H0 : ps = p0
s for all s = 1, . . . , t
H1 : there is s = 1, . . . , t such that ps = p0
s
The Pearson χ2-goodness-of-fit test rejects when D below is large D =
t
- s=1
(Xs − np0
s)2
np0
s
How large? Under the null, D has approximately the χ2 distribution with t − 1 degrees of freedom.
Testing for equal proportions
MATH 185 – University of California San Diego – Ery Arias-Castro 7 / 10
Here, n = 2000, t = 10 (with rs = s) and p0
s = 1 10.
> chisq.test(table(pi2000)) Chi-squared test for given probabilities data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.8817
The p-value is fairly large and so there is not enough evidence to reject the null.
Testing for Dependencies
MATH 185 – University of California San Diego – Ery Arias-Castro 8 / 10
Many possible dependency structures. Here is an example. Compute the differences of successive digits and group them into {−9, . . . , 9}
> table(diff(pi2000))
- 9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
1 2 3 4 5 6 7 8 9 18 33 66 93 103 119 145 170 156 190 181 162 131 114 116 83 46 45 28
If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9}, the differences would have the following distribution on {−9, . . . , 9}
> p0 = c(1:9, 10, 9:1)/100 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.09 0.08 0.07 0.06
Testing for Dependencies
MATH 185 – University of California San Diego – Ery Arias-Castro 9 / 10
We therefore perform a χ2-goodness-of-fit test to verify that
> chisq.test(table(diff(pi2000)), p = p0) Chi-squared test for given probabilities data: table(diff(pi2000)) X-squared = 19.4219, df = 18, p-value = 0.3663
Again, there is not enough evidence to reject the null.
Testing for Dependencies
MATH 185 – University of California San Diego – Ery Arias-Castro 10 / 10