Univariate Categorical Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation

univariate categorical data
SMART_READER_LITE
LIVE PREVIEW

Univariate Categorical Data MATH 185 Introduction to Computational - - PowerPoint PPT Presentation

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math185.html MATH 185 University of California San Diego


slide-1
SLIDE 1

MATH 185 – University of California San Diego – Ery Arias-Castro 1 / 10

Univariate Categorical Data

MATH 185 – Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/∼eariasca/math185.html

slide-2
SLIDE 2

The first 2000 digits of π

MATH 185 – University of California San Diego – Ery Arias-Castro 2 / 10

We use the pi2000 data in the package UsingR – call ?pi2000.

> library(UsingR) > str(pi2000) num [1:2000] 3 1 4 1 5 9 2 6 5 3 ...

Q: Though this is not the role of a statistician per se, what kind of questions would we ask of such data?

slide-3
SLIDE 3

Counts/Frequencies

MATH 185 – University of California San Diego – Ery Arias-Castro 3 / 10

Say we are insterested in the number of times certain digits appear. We therefore summarize the data as counts in the different categories

> table(pi2000) pi2000 1 2 3 4 5 6 7 8 9 181 213 207 189 195 205 200 197 202 211

Alternatively, we can compute frequencies

> table(pi2000)/length(pi2000) pi2000 1 2 3 4 5 6 7 8 9 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055

slide-4
SLIDE 4

Barplot

MATH 185 – University of California San Diego – Ery Arias-Castro 4 / 10

For categorical data with a few categories, a barplot is often useful.

> barplot(table(pi2000), col = "#ffffcc")

1 2 3 4 5 6 7 8 9 50 100 150 200

slide-5
SLIDE 5

Pie Chart

MATH 185 – University of California San Diego – Ery Arias-Castro 5 / 10

We can also use a pie chart.

> pie(table(pi2000))

1 2 3 4 5 6 7 8 9

slide-6
SLIDE 6

Testing for equal proportions

MATH 185 – University of California San Diego – Ery Arias-Castro 6 / 10

The Pearson χ2-goodness-of-fit test: We observe an i.i.d. sample ξ1, . . . , ξn with P(ξi = rs) = ps We want to test

H0 : ps = p0

s for all s = 1, . . . , t

H1 : there is s = 1, . . . , t such that ps = p0

s

The Pearson χ2-goodness-of-fit test rejects when D below is large D =

t

  • s=1

(Xs − np0

s)2

np0

s

How large? Under the null, D has approximately the χ2 distribution with t − 1 degrees of freedom.

slide-7
SLIDE 7

Testing for equal proportions

MATH 185 – University of California San Diego – Ery Arias-Castro 7 / 10

Here, n = 2000, t = 10 (with rs = s) and p0

s = 1 10.

> chisq.test(table(pi2000)) Chi-squared test for given probabilities data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.8817

The p-value is fairly large and so there is not enough evidence to reject the null.

slide-8
SLIDE 8

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 8 / 10

Many possible dependency structures. Here is an example. Compute the differences of successive digits and group them into {−9, . . . , 9}

> table(diff(pi2000))

  • 9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6 7 8 9 18 33 66 93 103 119 145 170 156 190 181 162 131 114 116 83 46 45 28

If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9}, the differences would have the following distribution on {−9, . . . , 9}

> p0 = c(1:9, 10, 9:1)/100 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.09 0.08 0.07 0.06

slide-9
SLIDE 9

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 9 / 10

We therefore perform a χ2-goodness-of-fit test to verify that

> chisq.test(table(diff(pi2000)), p = p0) Chi-squared test for given probabilities data: table(diff(pi2000)) X-squared = 19.4219, df = 18, p-value = 0.3663

Again, there is not enough evidence to reject the null.

slide-10
SLIDE 10

Testing for Dependencies

MATH 185 – University of California San Diego – Ery Arias-Castro 10 / 10

A more detailed-oriented method computes the number of transistions from

digit s to digit t.

If the sequence behaved like an i.i.d. sample from the uniform on {0, . . . , 9},

all transitions would be equally likely.

However, there are many (100) such transitions. Many other approaches, under the name of Tests of Randomness – for

example tests based on runs.