Univariate Categorical Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation

▶

Aug 09, 2023 118 likes •229 views

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math185.html MATH 185 University of California San Diego

SLIDE 1

MATH 185 – University of California San Diego – Ery Arias-Castro 1 / 10

Univariate Categorical Data

MATH 185 – Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/∼eariasca/math185.html

SLIDE 2

The first 2000 digits of π

MATH 185 – University of California San Diego – Ery Arias-Castro 2 / 10

We use the pi2000 data in the package UsingR – call ?pi2000.

> library(UsingR) > str(pi2000) num [1:2000] 3 1 4 1 5 9 2 6 5 3 ...

Q: Though this is not the role of a statistician per se, what kind of questions would we ask of such data?

SLIDE 3

Counts/Frequencies

MATH 185 – University of California San Diego – Ery Arias-Castro 3 / 10

Say we are insterested in the number of times certain digits appear. We therefore summarize the data as counts in the different categories

> table(pi2000) pi2000 1 2 3 4 5 6 7 8 9 181 213 207 189 195 205 200 197 202 211

Alternatively, we can compute frequencies

> table(pi2000)/length(pi2000) pi2000 1 2 3 4 5 6 7 8 9 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055

SLIDE 4

Barplot

MATH 185 – University of California San Diego – Ery Arias-Castro 4 / 10

For categorical data with a few categories, a barplot is often useful.

> barplot(table(pi2000), col = "#ffffcc")

1 2 3 4 5 6 7 8 9 50 100 150 200

SLIDE 5

Pie Chart

MATH 185 – University of California San Diego – Ery Arias-Castro 5 / 10

We can also use a pie chart.

> pie(table(pi2000))

1 2 3 4 5 6 7 8 9

SLIDE 6

Testing for equal proportions

MATH 185 – University of California San Diego – Ery Arias-Castro 6 / 10

The Pearson χ2-goodness-of-fit test: We observe an i.i.d. sample ξ1, . . . , ξn with P(ξi = rs) = ps We want to test

H0 : ps = p0

s for all s = 1, . . . , t

H1 : there is s = 1, . . . , t such that ps = p0

The Pearson χ2-goodness-of-fit test rejects when D below is large D =

(Xs − np0

s)2

np0

How large? Under the null, D has approximately the χ2 distribution with t − 1 degrees of freedom.

SLIDE 7

Testing for equal proportions

MATH 185 – University of California San Diego – Ery Arias-Castro 7 / 10

Here, n = 2000, t = 10 (with rs = s) and p0

s = 1 10.

> chisq.test(table(pi2000)) Chi-squared test for given probabilities data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.8817

The p-value is fairly large and so there is not enough evidence to reject the null.

SLIDE 8

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 8 / 10

Many possible dependency structures. Here is an example. Compute the differences of successive digits and group them into {−9, . . . , 9}

> table(diff(pi2000))

1 2 3 4 5 6 7 8 9 18 33 66 93 103 119 145 170 156 190 181 162 131 114 116 83 46 45 28

If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9}, the differences would have the following distribution on {−9, . . . , 9}

> p0 = c(1:9, 10, 9:1)/100 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.09 0.08 0.07 0.06

SLIDE 9

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 9 / 10

We therefore perform a χ2-goodness-of-fit test to verify that

> chisq.test(table(diff(pi2000)), p = p0) Chi-squared test for given probabilities data: table(diff(pi2000)) X-squared = 19.4219, df = 18, p-value = 0.3663

Again, there is not enough evidence to reject the null.

SLIDE 10

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 10 / 10

Univariate Categorical Data

MATH 185 – Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/∼eariasca/math185.html

The first 2000 digits of π

We use the pi2000 data in the package UsingR – call ?pi2000.

> library(UsingR) > str(pi2000) num [1:2000] 3 1 4 1 5 9 2 6 5 3 ...

Q: Though this is not the role of a statistician per se, what kind of questions would we ask of such data?

Counts/Frequencies

Say we are insterested in the number of times certain digits appear. We therefore summarize the data as counts in the different categories

> table(pi2000) pi2000 1 2 3 4 5 6 7 8 9 181 213 207 189 195 205 200 197 202 211

Alternatively, we can compute frequencies

> table(pi2000)/length(pi2000) pi2000 1 2 3 4 5 6 7 8 9 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055

Barplot

For categorical data with a few categories, a barplot is often useful.

> barplot(table(pi2000), col = "#ffffcc")

Pie Chart

We can also use a pie chart.

> pie(table(pi2000))

Testing for equal proportions

The Pearson χ2-goodness-of-fit test: We observe an i.i.d. sample ξ1, . . . , ξn with P(ξi = rs) = ps We want to test

H0 : ps = p0

H1 : there is s = 1, . . . , t such that ps = p0

The Pearson χ2-goodness-of-fit test rejects when D below is large D =

(Xs − np0

np0

How large? Under the null, D has approximately the χ2 distribution with t − 1 degrees of freedom.

Testing for equal proportions

Here, n = 2000, t = 10 (with rs = s) and p0

> chisq.test(table(pi2000)) Chi-squared test for given probabilities data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.8817

The p-value is fairly large and so there is not enough evidence to reject the null.

Testing for Dependencies

Many possible dependency structures. Here is an example. Compute the differences of successive digits and group them into {−9, . . . , 9}

> table(diff(pi2000))

1 2 3 4 5 6 7 8 9 18 33 66 93 103 119 145 170 156 190 181 162 131 114 116 83 46 45 28

If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9}, the differences would have the following distribution on {−9, . . . , 9}

> p0 = c(1:9, 10, 9:1)/100 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.09 0.08 0.07 0.06

Testing for Dependencies

We therefore perform a χ2-goodness-of-fit test to verify that

> chisq.test(table(diff(pi2000)), p = p0) Chi-squared test for given probabilities data: table(diff(pi2000)) X-squared = 19.4219, df = 18, p-value = 0.3663

Again, there is not enough evidence to reject the null.

Testing for Dependencies

A more detailed-oriented method computes the number of transistions from

digit s to digit t.

If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9},

all transitions would be equally likely.

However, there are many (100) such transitions. Many other approaches, under the name of Tests of Randomness – for

example tests based on runs.