Random projections, reweighting and half-sampling for - - PowerPoint PPT Presentation

random projections reweighting and half sampling for high
SMART_READER_LITE
LIVE PREVIEW

Random projections, reweighting and half-sampling for - - PowerPoint PPT Presentation

MC for high-dimensional statistics 1 Random projections, reweighting and half-sampling for high-dimensional statistical inference Art B. Owen Stanford University based on joint works with: Dean Eckles Facebook Inc. Sarah Emerson Oregon


slide-1
SLIDE 1

MC for high-dimensional statistics 1

Random projections, reweighting and half-sampling for high-dimensional statistical inference

Art B. Owen Stanford University based on joint works with: Dean Eckles Facebook Inc. Sarah Emerson Oregon State University

MCQMC 2012, February 2012

slide-2
SLIDE 2

MC for high-dimensional statistics 2

About these slides

These are the slides I presented on February 15 at MCQMC 2012 in Sydney Australia. I have corrected some typos and extended the presentation of the challenging integral over the Stiefel manifold. A few of these slides were skipped over in order to allow time for questions. This talk covers two projects. The bootstrap work with Dean Eckles has now been accepted by the Annals of Applied statistics. The projection work with Sarah Emerson is still in progress.

MCQMC 2012, February 2012

slide-3
SLIDE 3

MC for high-dimensional statistics 3

Monte Carlo methods for statistics

1) Markov chain Monte Carlo 2) Bootstrap resampling The above are mainstays. We also use: 1) Random permutations 2) Random projections 3) Sample splitting

Probability/Statistics and Monte Carlo are closely intertwined.

MCQMC 2012, February 2012

slide-4
SLIDE 4

MC for high-dimensional statistics 4

Statistics and Monte Carlo

  • M. C. Escher (1948)

This talk will show some uses of MC uses in statistics.

MCQMC 2012, February 2012

slide-5
SLIDE 5

MC for high-dimensional statistics 5

Some statistical notions

X ∼ F

random vector X has distribution F

Xi

iid

∼ F Xi are statistically Independent and Identically Distributed (IID) from F Nd(µ, Σ)

The Gaussian distribution with mean µ ∈ Rd and variance covariance matrix Σ ∈ Rd×d

X ∼ N(µ, Σ), means

Pr(X ∈ A) =

  • A

f(x) dx

where

f(x) = exp

  • − 1

2(x − µ)TΣ−1(x − µ)

  • (2π)d/2 det(Σ)1/2

p-values

Observe T = t and compute p = Pr(T t). If p < 0.01 then the observed value t happens 1% or less of the time. Evidence against the hypothesized distribution of T .

MCQMC 2012, February 2012

slide-6
SLIDE 6

MC for high-dimensional statistics 6

Problem one

We have

X1, . . . , Xnx

iid

∼ F in Rd Y 1, . . . , Y ny

iid

∼ G in Rd

is F = G? We might assume F = N(µ1, Σ) and G = N(µ2, Σ). Then we test µ1 = µ2. This is an old problem.

Revived interest, d ≫ nx + ny

DNA microarrays expression levels of d ≈ 30,000 genes

  • n nx healthy and ny diseased individuals

nx, ny tens or hundreds

Genome wide association studies

d ≈ 2,000,000 markers

with nx, ny thousands or more Also: fMRI, finance

MCQMC 2012, February 2012

slide-7
SLIDE 7

MC for high-dimensional statistics 7

Illustration

−3 −2 −1 1 2 −2 −1 1 2 X1 X2

  • 50 red and 50 black points in R2

Black points normally distributed Red points shifted Northwest wrt black

X1 not significantly different, p = 0.47 X2 not significantly different, p = 0.09 X1 + X2 not significantly different, p = 0.60 X1 − X2 very significantly different, p = 1.7 × 10−4

So: how to find the interesting projection?

MCQMC 2012, February 2012

slide-8
SLIDE 8

MC for high-dimensional statistics 8

Hotelling’s T 2

Find θ ∈ Rd with θTθ = 1 to maximize the apparent separation between

  • Xi = θTXi ∈ R

and

  • Yi = θTY i ∈ R

Answer depends on ¯ X = 1 nx

nx

  • i=1

Xi Sx =

nx

  • i=1

(Xi − ¯ X)(Xi − ¯ X)T ¯ Y = 1 ny

ny

  • i=1

Y i Sy =

ny

  • i=1

(Y i − ¯ Y )(Y i − ¯ Y )T Algebraicly we get T 2= nxny nx + ny ( ¯ X − ¯ Y )TS−1( ¯ X − ¯ Y )

where

S = Sx + Sy nx + ny − 2

Hotelling (1931) Get T 2 = 18.58. Also Pr(T 2 18.58) = 2.6 × 10−4 (p value)

MCQMC 2012, February 2012

slide-9
SLIDE 9

MC for high-dimensional statistics 9

In high dimensions

When d ≫ nx + ny the covariance matrix S is not invertible. Can’t use

T 2= nxny nx + ny ( ¯ X − ¯ Y )TS−1( ¯ X − ¯ Y ) Geometrically

Some projection θ ∈ Rd has

θTXi = constant for i = 1, . . . , nx and θTY i = different constant.

IE: we will get perfect separation, even if F = G. A classic remedy by Dempster (1958) takes

T 2

Dempster =

nxny nx + ny ¯ X − ¯ Y 2

tr(S)

= nxny nx + ny d

j=1( ¯

Xj − ¯ Yj)2 d

j=1 Sjj

but this makes no use of correlations Recent improvements by: Bai, Saradanasa, Hall, Fan, Chen, Srivastava also don’t use correlations

MCQMC 2012, February 2012

slide-10
SLIDE 10

MC for high-dimensional statistics 10

Random projections

Lopes, Jacob, Wainwright (2011) Choose random Θ ∈ Rd×k with ΘTΘ = Ik. Put

Xi = ΘTXi and Y i = ΘTY i

Then use

  • T 2

Θ =

nxny nx + ny ( ¯ X − ¯ Y )TΘ(ΘTSΘ)−1ΘT( ¯ X − ¯ Y )

Exists if k < nx + ny − 2

That is

project data into a random k dimensional subspace test means of projected data this retains some of the correlations

MCQMC 2012, February 2012

slide-11
SLIDE 11

MC for high-dimensional statistics 11

Uniform random projections

From Rd to R: normalize a Gaussian vector

θ = Z Z, Z ∼ N(0, Ik) To project from Rd to Rk Z =        Z11 Z12 · · · Z1k Z21 Z22 · · · Z2k

. . . . . . ... . . .

Zd1 Zd2 · · · Zdk        ∈ Rd×k Zij

iid

∼ N(0, 1) Gram-Schmidt yields Z = QR

deliver

Θ = Q ∈ Rd×k

project

  • Xi = ΘTXi

Any QR decomposition with positive Rii will do.

MCQMC 2012, February 2012

slide-12
SLIDE 12

MC for high-dimensional statistics 12

Lopes et al. ctd.

They make just one random projection of the data They find that k ≈ (nx + ny − 2)/2 performs well

Why just one?

If your one projection is ’unlucky’ then you might miss the pattern. But with just one projection the distribution of

T 2 is known. Multiple projections ¯ T 2 = 1 M

M

  • i=1
  • T 2

i

T 2

i

based on Θi ∈ Rd×k Get some kind of ’average’ luck. But distribution of ¯

T 2 not known.

MCQMC 2012, February 2012

slide-13
SLIDE 13

MC for high-dimensional statistics 13

Multiple projections

Work with S. Emerson: average over M independent random Θi ∈ Rd×k

¯ T 2 = 1 M

M

  • i=1
  • T 2

i ,

where

  • T 2

i =

nxny nx + ny ( ¯ X − ¯ Y )TΘi(ΘT

i SΘi)−1ΘT i ( ¯

X − ¯ Y ) Easily 1) E( ¯ T 2) = E( T 2

i )

2) Var( ¯ T 2) < Var( T 2

i ),

unless both are infinite! (averaging reduces variance)

Less easily 1) Finite variance requires k nx + ny − 6 2) Finite mean requires k nx + ny − 4

Unfortunately: the distribution of ¯

T 2 is not known.

MCQMC 2012, February 2012

slide-14
SLIDE 14

MC for high-dimensional statistics 14

Separation

Simulate 2000 data sets

Xi

iid

∼ N(0, Σ), Y i

iid

∼ N(δ, Σ), δ ∈ Rd

Of these: 1000 null cases

δ = 0

1000 non-null cases

δ > 0

Rank 2000 ¯

T 2 scores

See if nulls get smaller ¯

T 2 values. The ROC∗ curve

Shown later, shows how well the test separates the two cases

∗Receiver Operating Characteristic (don’t ask)

MCQMC 2012, February 2012

slide-15
SLIDE 15

MC for high-dimensional statistics 15

Simulated case

Xi, Y i ∈ R200 nx = ny = 50

Pick δ = 3 uniform on 200 dimensional sphere Pick Σ = Id × 50/

√ d Why these

Uniform δ means that the group separation is unrelated to the covariance structure.

  • Debatable. We follow Lopes et al in making this assumption.

WLOG, under uniformity Σ = diag(λ1, . . . , λd) λ1 λ2 · · · λd Interesting cases are equal λj and rapidly decreasing λj

MCQMC 2012, February 2012

slide-16
SLIDE 16

MC for high-dimensional statistics 16

Multiple projections

M=1 Null Alt M=32 Null Alt

Simulated T squared nx = ny = 50, d = 200, k = 49,

Null: δ = 0 Alt: δ = 3

MCQMC 2012, February 2012

slide-17
SLIDE 17

MC for high-dimensional statistics 17

The ROCs

20 40 60 80 100 20 40 60 80 100 False positives True positives

ROC curves: M=1,2,4,8,16,32

Larger M has greater area under the curve: M AUC 1 71.9 2 80.6 4 87.1 8 91.4 16 94.3 32 95.7

MCQMC 2012, February 2012

slide-18
SLIDE 18

MC for high-dimensional statistics 18

Varying k

Lopes et al. prefer k ≈ (nx + ny − 2)/2 That is not always optimal. But may be a good default. For the previous scenario: small k do relatively poorly.

32 k 56 all gave AUC ≈ 0.95 with M = 32

Other scenarios

  • S. Emerson: advantage of averaging persists in other decay rates for eigenvalues of Σ

MCQMC 2012, February 2012

slide-19
SLIDE 19

MC for high-dimensional statistics 19

Using ¯

T 2

The usual p-value is Pr( ¯

T 2 t2) where t2 is the observed value on our data.

We have no good approximation for this. Even the moments of ¯

T 2 involve difficult integrals over Θ ∈ Vd,k, the Stiefel manifold, e.g.

  • Θ∈Vd,k

Θ(ΘTSΘ)−1ΘT dU(Θ) = (2π)−dk/2

  • Z∈Rd×k Z(ZTSZ)−1ZT e−tr(ZTZ)/2 dZ

Non-negative diagonal S ∈ Rd×d with nx + ny − 2 positive entries

U(Θ) is the uniform (Haar) measure.

Above is the first moment. Closed forms for first and second moments could lead to useful test statistics.

MCQMC 2012, February 2012

slide-20
SLIDE 20

MC for high-dimensional statistics 20

Permutation tests

There are

nx+ny

nx

  • ways to allocate nx of the pooled observations

(X1, . . . , Xnx, Y 1, . . . , Y ny) to the first sample (the X’s). The re-allocated data: (X∗

1, . . . , X∗ nx, Y ∗ 1, . . . , Y ∗ ny)

have statistic ¯

T ∗2. Then p = #{ ¯ T 2,∗ | ¯ T 2,∗ ¯ T 2} nx+ny

nx

  • For nx = ny = 50 there are ≈ 1029 allocations (permutations).

Use a Monte Carlo sample of random permutations. Justified in text by Lehmann & Romano (2005), “Testing Statistical Hypotheses” Combination test might be a better name

MCQMC 2012, February 2012

slide-21
SLIDE 21

MC for high-dimensional statistics 21

Summary of problem one

If d > nx + ny − 2, then T 2 is not defined Dempster (almost) used coordinate projections Lopes et al. use one random projection from d to k dimensions We find benefits from multiple projections But have to use permutation tests for significance Monte Carlo enters to compute the test statistic and then to judge its significance

MCQMC 2012, February 2012

slide-22
SLIDE 22

MC for high-dimensional statistics 22

Problem two

  • How can we judge uncertainty in two-way and higher way data?
  • We would like to use the bootstrap.
  • But McCullagh (2000) proved this is impossible.

Solution

We will get a Monte Carlo method which mildly overestimates the sampling uncertainty.

But first

it is necessary to describe the bootstrap, as well as multi-way data.

MCQMC 2012, February 2012

slide-23
SLIDE 23

MC for high-dimensional statistics 23

Bootstrap sampling

Data are X1, . . . , Xn

iid

∼ F

We compute

T = T(X1, . . . , Xn)

What is sampling uncertainty in

T , e.g. Var( T) = Var( T | F)? Combine two ideas

Monte Carlo Sample from F to estimate Var(

T | F) (but we don’t know F ).

Plug in Use Var(

T | F), as if F = F , the empirical distribution∗.

From Efron (1979).

∃ extensive variations on the idea.

∗Empirical distribution

  • F = 1

n

n

i=1 δXi puts probability 1/n on each sample observation.

δx is a ‘point mass’ at x

To sample X ∼

F , pick one of the original data points

MCQMC 2012, February 2012

slide-24
SLIDE 24

MC for high-dimensional statistics 24

Bootstrap pseudocode

Given X1, . . . , Xn

iid

∼ F Resample the data B times

For b = 1, . . . , B For i = 1, . . . , n

i∗ ∼ U{1, . . . , n} X∗

i = Xi∗

T ∗

b = T(X∗ 1, . . . , X∗ n)

Compute summary ¯ T ∗ = 1 B

B

  • b=1

T ∗

b

  • Var(

T) = 1 B − 1

B

  • b=1

(T ∗

b − ¯

T ∗)2

MCQMC 2012, February 2012

slide-25
SLIDE 25

MC for high-dimensional statistics 25

First reactions

1) How could that ever work? 2) How could that ever fail? The bootstrap works by a continuity argument

Plug-in: Under conditions,

Var( T | F) is continuous in F , and F → F .

So Var(

T | F) → Var( T | F)

Monte Carlo: As B → ∞,

Var( T | F) → Var( T | F) Fails when

Continuity conditions fail or when

F does not mimick F well enough

MCQMC 2012, February 2012

slide-26
SLIDE 26

MC for high-dimensional statistics 26

Bootstrapping the mean

The base case is for

T(X1, . . . , Xn) = ¯ X ≡ 1 n

n

  • i=1

Xi

We have non-bootstrap methods for

Var( ¯ X) ( ! )

Correctness for the mean extends to more complicated statistics via Taylor approximations

Why we like it

No need to assume the Gaussian or any other distributional form It is explainable to scientific colleagues

MCQMC 2012, February 2012

slide-27
SLIDE 27

MC for high-dimensional statistics 27

Some variants

The Bayesian bootstrap, Rubin (1981)

  • F =

n

i=1 Wi δXi

n

i=1 Wi

Wi

iid

∼ Exp(1)

That is: place an independent random weight on each observation. If T(X1, . . . , Xn) is the sample mean, then under bootstrapping

T ∗ = n

i=1 Wi Xi

n

i=1 Wi

is a random ratio.

MCQMC 2012, February 2012

slide-28
SLIDE 28

MC for high-dimensional statistics 28

Weight condition

We need E(Wi) = 1 and Var(Wi) = 1 Then the Bayesian bootstrap is comparable to ordinary bootstrap Can also use Poi(1) weights (Poisson distribution) Used in machine learning Oza (2001) Half sampling:

Wi =   

with probability 1/2

2

with probability 1/2

MCQMC 2012, February 2012

slide-29
SLIDE 29

MC for high-dimensional statistics 29

Next

That was the bootstrap We’ll bootstrap 2-way data

MCQMC 2012, February 2012

slide-30
SLIDE 30

MC for high-dimensional statistics 30

One-way data

Patient Height Weight Age

· · ·

Blood pressure . . .

42

1.7 72 32

· · ·

141

43

2.1 97 42

· · ·

109 . . . A usual data matrix has Xi in row i. Data within a row are dependent, e.g. height and weight of same person are correlated as are blood pressure and age Data in two different rows are independent. patient 42 and patient 43

MCQMC 2012, February 2012

slide-31
SLIDE 31

MC for high-dimensional statistics 31

Two way data

patients

×

medical interns

blood pressure measurements genes

×

environments

crop yields students

×

exam questions

points

Features

Measurements on the same patient are correlated Measurements by the same interns are also correlated (hopefully that effect is smaller) Neither rows nor columns are IID

MCQMC 2012, February 2012

slide-32
SLIDE 32

MC for high-dimensional statistics 32

Netflix data

Rating Viewer 1 Viewer 2 Viewer 3

· · ·

Viewer C Movie 1 4 4 1

· · ·

4 Movie 2 5 5 NA

· · ·

NA Movie 3 3 3 NA

· · ·

2 . . . . . . . . . . . . ... . . . Movie R NA 5 3

· · ·

4 Ratings on same movie are correlated as are ratings by same person Danny Deckchair (2003) Most common rating was 4 stars

MCQMC 2012, February 2012

slide-33
SLIDE 33

MC for high-dimensional statistics 33

Netflix continued

17,770 movies, 480,189 customers, 100,000,000+ ratings,

Sample uncertainty

Ratings made on Tuesday came out a little lower than Sunday ratings. Is it real or a sampling artifact?

Further problems 1) Missing data make the matrix very imbalanced 2) Unequal variances, e.g. some customers just give 1s or 5s

MCQMC 2012, February 2012

slide-34
SLIDE 34

MC for high-dimensional statistics 34

Facebook data

Alice (shares a URL) “Hey, check out http://www.mcqmc2012.unsw.edu.au/” Bob (comments on it) “Thanks for sharing that, I learned a lot.” Data url = http://www.mcqmc2012.unsw.edu.au/

sharer = Alice commenter = Bob log length X = log(41) . = 3.71 Data size 18,134,419 comments

by 8,078,531 commenters

  • n 2,085,639 URLs

This is 3 way data: url × sharer × commenter

Of interest:

users’ sharing and commenting behaviour, e.g. who makes longer comments, U.S. or U.K. users? Probably greater interest for ad clicks, linking and liking activity

MCQMC 2012, February 2012

slide-35
SLIDE 35

MC for high-dimensional statistics 35

Random effects model

Xij = µ + ai + bj + εij i = 1, . . . , R j = 1, . . . , C ai ∼ N(0, σ2

A)

e.g. patients

bj ∼ N(0, σ2

B)

e.g. interns

εij ∼ N(0, σ2

E)

simplest model for two-way data

Used in agriculture Studied for decades

ˆ µ is ¯ X••

No bootstrap exists for V (ˆ

µ)

None can exist · · ·

· · · McCullagh (2000) We can’t even bootstrap a balanced ¯

X !

He rules out resampling, permuting or any “monoid” operations on rows and columns.

MCQMC 2012, February 2012

slide-36
SLIDE 36

MC for high-dimensional statistics 36

What about classical approaches?

prime reference: “Variance Components” by Searle, Casella, McCulloch (1992)

  • Excellent for balanced Gaussian data
  • Unbalance =

⇒ invert large matrices

  • Emphasis on homogeneous variances

MCQMC 2012, February 2012

slide-37
SLIDE 37

MC for high-dimensional statistics 37

McCullagh (2000)

For

ˆ µ = ¯ X•• = 1 R 1 C

R

  • i=1

C

  • j=1

Xij

Naive Resample from N = RC values Product Resample R rows and resample C columns (indep)

VRE(ˆ µ) = σ2

A

R + σ2

B

C + σ2

E

RC

true var

ERE( VNaiv(ˆ µ)) . =

  • σ2

A + σ2 B + σ2 E

1 RC

way too small

ERE( VProd(ˆ µ)) . = σ2

A

R + σ2

B

C + 3σ2

E

RC

not so bad Naive resampling seriously flawed, product resampling is close

MCQMC 2012, February 2012

slide-38
SLIDE 38

MC for high-dimensional statistics 38

Notation

Index j takes values ij = 1, 2, 3, . . . Observation i has multi-index i = (i1, . . . , ir) ∈ {1, 2, . . . }r Random Xi ∈ Rd with Zi =

   1 Xi known Xi missing. Sample size 0 < N ≡

  • i∈Nr

Zi < ∞ Sample mean ¯ X =

  • i ZiXi
  • i Zi

We want to estimate Var( ¯

X)

treating Zi as fixed

MCQMC 2012, February 2012

slide-39
SLIDE 39

MC for high-dimensional statistics 39

r-fold product bootstrap

¯ X∗ =

  • i ZiWiXi
  • i ZiWi

where

Wi =

r

  • j=1

Wj,ij E(Wj,ij) = 1 Var(Wj,ij) = 1 From replications ¯

X∗1, . . . , ¯ X∗B

  • Var( ¯

X) = 1 B − 1

B

  • b=1

(X∗b − ¯ X∗)2.

It is operationally much easier to have independent Wj,ij (as in the Bayesian bootstrap). Data for URLs might be scattered over several continents. Then keeping

i Wj,ij = N is awkward.

MCQMC 2012, February 2012

slide-40
SLIDE 40

MC for high-dimensional statistics 40

r-fold random effects

We study

Var( ¯ X) under this model: Xi = µ +

  • u⊆{1,...,r}

u=∅

εiu,u

If u = (j1, . . . , jk) then iu = (ij1, . . . , ijk)

Moments E(εiu,u) = 0 Var(εiu,u) = σ2

u

Cov(εiu,u, εi′

u′,u′) = 0

if u = u′

  • r

iu = i′

u′

σ2

u can be allowed to depend on iu

MCQMC 2012, February 2012

slide-41
SLIDE 41

MC for high-dimensional statistics 41

Variance

VarRE(ˆ µ) = 1 N 2

  • u=∅
  • u′=∅
  • i
  • i′

ZiZi′Cov(εi,u, εi′,u′) = 1 N 2

  • u=∅
  • i
  • i′

ZiZi′1iu=i′u

  • σ2

u

= 1 N

  • u=∅

νuσ2

u,

for gain coefficients, νu = 1 N

  • i

ZiNi,u,

where

Ni,u =

  • i′

Zi′1iu=i′u.

MCQMC 2012, February 2012

slide-42
SLIDE 42

MC for high-dimensional statistics 42

Running examples

VRE(ˆ µ) ≡ 1 N

  • u=∅

νuσ2

u

. = 1 N

  • 56,200 σ2

movies + 646 σ2 viewers + σ2 interaction

  • (for Netflix)

For Facebook νsh . = 17.71, νcom . = 7.71, ν url . = 26,854.92 ! νsh,com . = 5.92, νsh,url . = 12.91, νcom,url . = 5.19,

and

νsh,com,url . = 4.88.

νurl 26,000

MCQMC 2012, February 2012

slide-43
SLIDE 43

MC for high-dimensional statistics 43

Variances

For gain coefficients νu

VarRE( ¯ X) = 1 N

  • u⊆{1,...,r}

u=∅

νuσ2

u

Similarly for γu (defined later):

ERE( VarProd( ¯ X∗)) . = 1 N

  • u⊆{1,...,r}

u=∅

γuσ2

u

And:

ERE( VarNaiv( ¯ X∗)) . = 1 N

  • u⊆{1,...,r}

u=∅

σ2

u

Typically

1 ≪ νu ≪ N

for

u = {1, . . . , r} = ⇒ Naive bootstrap unreliable We want γu = νu. We’ll get γu νu.

MCQMC 2012, February 2012

slide-44
SLIDE 44

MC for high-dimensional statistics 44

The case r = 2

O (2007) Independent bootstrap of rows and columns Still get ERE(

VProd(ˆ µ)) . = VRE(ˆ µ),

i.e. Still get ≈ 1 × the main effect contribution

≈ 3 × the interaction contribution

Sunday vs. Tuesday edge of 0.02 stars is real (about 8 standard errors)

New here 1) Arbitrary order r 2 2) Independent product weights (vs resampling)

MCQMC 2012, February 2012

slide-45
SLIDE 45

MC for high-dimensional statistics 45

Product bootstrap

ˆ µ∗ =

  • i ZiWiXi
  • i ZiWi

≡ T ∗ N ∗

(ratio estimator)

VProd(ˆ µ∗) ≈ VProd(ˆ µ∗) ≡ 1 N 2 EProd

  • (T ∗ − ˆ

µN ∗)2 Main result ERE( VProd(ˆ µ∗)) = 1 N

  • u=∅

γuσ2

u

where γu ≈ νu if

|u| = 1,

(i.e. cardinality 1)

  • therwise small γu/νu > 1

MCQMC 2012, February 2012

slide-46
SLIDE 46

MC for high-dimensional statistics 46

Exact formula depends on

# index pairs i, i′ that match in set u:

  • i∈Nr
  • i′∈Nr

ZiZi′1iu=i′

u

# index pairs i, i′ that match in precisely k places # index triples i, i′, i′′ where i matches i′ in set u and matches i′′ in precisely k places

MCQMC 2012, February 2012

slide-47
SLIDE 47

MC for high-dimensional statistics 47

Duplication indices

(level dup)

ǫ =

greatest item popularity

N

(variable dup)

η = max

∅uv

νv νu Examples ǫ η

Netflix

232,944 100,480,507

. = 0.00232

1 646

. = 0.00155

Miss Congeniality

νinteraction/νmovies

Facebook

686,990 18,134,419

. = 0.0379

4.88 5.19

. = 0.94

a popular URL

νsh,com,url/νcom,url η is not small for the Facebook data

bootstrap variances will be somewhat more conservative

MCQMC 2012, February 2012

slide-48
SLIDE 48

MC for high-dimensional statistics 48

Approximations

Theorem 1. In the homogeneous random effects model, the product weight bootstrap with

V (Wj,ij) = τ 2 = 1, satisfies γu = νu[2|u| − 1 + Θuǫ] +

  • vu

2|v|νv,

where |Θu| 2r+1 − 2.

  • Proof. O & Eckles (2011), who consider general τ 2.

For small ǫ and r (i.e. 2rǫ ≪ 1) γu ≈ (2|u| − 1)νu +

  • vu

2|v|νv If also η ≪ 1 γu ≈ (2|u| − 1)νu

MCQMC 2012, February 2012

slide-49
SLIDE 49

MC for high-dimensional statistics 49

Some specific approximations

For r = 2 γ{j} = ν{j}(1 + Θjǫ) + 2 j = 1, 2 γ{1,2} = ν{1,2}(3 + Θ{1,2}ǫ),

where

|Θu| 6. For r = 3 γ{1} ≈ ν{1} + 4ν{1,2} + 4ν{1,3} + 8 γ{1,2} ≈ 3ν{1,2} + 8 γ{1,2,3} ≈ 7. If 0 < m minu σ2

u maxu σ2 u M < ∞ then

ERE( VProd(ˆ µ∗)) VRE(ˆ µ) = 1 + O(η + ǫ).

MCQMC 2012, February 2012

slide-50
SLIDE 50

MC for high-dimensional statistics 50

Facebook loquacity

For each commenter, url and sharer, we obtain:

X = log(#char in comment) as well as,

country c ∈ {US, UK} of commenter, and mode m ∈ {web, mobile} of commenter. Now let

ˆ µcm =

  • i ZiXi1country=c1mode=m
  • i Zi1country=c1mode=m

We see small differences

US UK web 3.62 3.55 mobile 3.50 3.57

but they’re larger than sample fluctuations

MCQMC 2012, February 2012

slide-51
SLIDE 51

MC for high-dimensional statistics 51

Loquacity ECDFs

Mean log characters for US minus mean log characters for UK Empirical CDF

0.0 0.2 0.4 0.6 0.8 1.0 −0.10 −0.08 −0.06 −0.04

Mobile

0.05 0.06 0.07 0.08 0.09 0.10

Web commenter commenter, sharer commenter, sharer, URL

ECDF over 50 bootstraps of ˆ

µUSm − ˆ µUKm

Reweighting one, two, or three ways

MCQMC 2012, February 2012

slide-52
SLIDE 52

MC for high-dimensional statistics 52

Loquacity confidence intervals

Mean log characters for US minus mean log characters for UK

commenter commenter, sharer commenter, sharer, URL −0.10 −0.08 −0.06 −0.04

Mobile

0.05 0.06 0.07 0.08 0.09 0.10

Web

Central 95% confidence intervals from 50 bootstraps of ˆ

µUSm − ˆ µUKm

Reweighting one, two, or three ways

MCQMC 2012, February 2012

slide-53
SLIDE 53

MC for high-dimensional statistics 53

Summary of problem two

Crossed random effects data require special care No correct bootstrap exists Product sampling is at least conservative, mildly over-estimating the variance It works also with unequal variances

what remains to do

better seeding extensions to more complicated analyses

MCQMC 2012, February 2012

slide-54
SLIDE 54

MC for high-dimensional statistics 54

Conclusion

Statistics and machine learning are still finding new ways to consume Monte Carlo ideas In addition to MCMC there are:

  • permutations
  • rotations
  • projections
  • subsampling
  • reweighting

MCQMC 2012, February 2012

slide-55
SLIDE 55

MC for high-dimensional statistics 55

Thanks

  • Collaborators: Dean Eckles and Sarah Emerson
  • NSF DMS-0906056
  • Data: Netflix and Facebook
  • UNSW
  • Organizers: Ian Sloan, Frances Kuo, Josef Dick and Gareth Peters

MCQMC 2012, February 2012

slide-56
SLIDE 56

MC for high-dimensional statistics 56

Missingness

The ≈ 100,000,000 movie ratings we see are only 1% of R × C. Those we see are probably biased towards higher values. How do we adjust? Ans: we can’t, because nobody knows the bias function. It would not be reasonable to expect a bootstrap method to fix missing data bias. For instance how could one sampling method fix both positive bias and negative bias? Given an adjustment algorithm (based on assumptions or knowledge from outside the given data) we might be able to bootstrap its predictions.

MCQMC 2012, February 2012