[PPT] - A survey on mixing coe cients: computation and estimation. Vitaly PowerPoint Presentation

SLIDE 1

A survey on mixing coe cients: computation and estimation. Vitaly Kuznetsov

Courant Institute of Mathematical Sciences, New York University

October 29, 2013

1 / 24

SLIDE 2

Introduction Binary classi cation Receive a sample X1, . . . , Xm with labels in {0, 1}. Choose a hypothesis h that has a good expected performance on unseen data.

X1, . . . , Xm are typically assumed i.i.d.

2 / 24

SLIDE 3

Introduction (continued) Much of the learning theory operates under the assumption that data comes from an i.i.d. source. In certain scenarios this assumption is not appropriate, e.g. time series analysis. To extend learning theory to this scenarios we need to nd a suitable relaxation of i.i.d. requirement. One common approach found in literature is imposing various \ mixing conditions" . Under these mixing conditions the strength of dependence between random variables is measured using \ mixing coe cients" .

3 / 24

SLIDE 4

Outline

Mixing conditions and coefficients: definitions and basic properties. Computational aspects. Estimating mixing coefficients. Discussion.

4 / 24

SLIDE 5

How can we measure dependence between random variables? Common measures of dependence are so called “mixing” coefficients. Originally introduced to prove laws of large numbers for sequences of dependent variables.

5 / 24

SLIDE 6

α mixing coe cient between two σ-algebras

Given a probability space (Ω, F, P) and two sub σ-algebras σ1 and σ2, define α-mixing coefficient α(σ1, σ2) = sup

A,B

|P(A)P(B) − P(A ∩ B)| where supremum is taken over all A ∈ σ1 and B ∈ σ2.

6 / 24

SLIDE 7

ϕ mixing coe cient

Define ϕ-mixing coefficient ϕ(σ1|σ2) = sup

A,B

|P(A) − P(A|B)| where supremum is taken over all A ∈ σ1 and B ∈ σ2. Note that ϕ coefficient is not symmetric.

7 / 24

SLIDE 8

β mixing coe cient

De ne β-mixing coe cient between two σ-algebras σ1 and σ2:

β(σ1, σ2) = E sup

A

|P(A) − P(A|σ2)|

where supremum is taken over all A ∈ σ1. We can rewrite β-mixing coe cient as follows:

β(σ1, σ2) =

1 2 sup I

i=1

J

j=1

|P(Ai)P(Bj) − P(Ai ∩ Bj)|

where supremum is taken over all nite partitions

{A1, . . . , AI} and {B1, . . . , BJ} of

such that Ai ∈ σ1 and Bj ∈ S2.

8 / 24

SLIDE 9

Alternative de nitions of β mixing coe cient This leads to yet another characterization of β-mixing coe cient:

β(σ1, σ2) = Pσ1 ⊗ Pσ2 − Pσ1⊗σ2

where · denotes the total variation distance, i.e.

P − Q = supA |P(A) − Q(A)|.

Assuming distributions P and Q have densities f and

g respectively P − Q =

1 2

|f − g|

9 / 24

SLIDE 10

Relations between mixing coe cients We have the following: 2α(σ1, σ2) ≤ β(σ1, σ2) ≤ ϕ(σ1, σ2) The second inequality is immediate from the de nition. Proof of the rst inequality:

|P(A)P(B) − P(A ∩ B)|

+ |P(A)P(Bc) − P(A ∩ Bc)| + |P(Ac)P(B) − P(Ac ∩ B)| + |P(Ac)P(Bc) − P(Ac ∩ Bc)| ≤ 2β(σ1, σ2)

10 / 24

SLIDE 11

From two variables to stochastic processes (i)

Let {Xt}∞

t= −∞ be a doubly infinite sequence of

random variables. Notation: X j

i = (Xi, Xi+ 1, . . . , Xj)

Pj

iis the joint probability distribution of X j i

σj

iis the σ-algebra generated by X j i

11 / 24

SLIDE 12

From two variables to stochastic processes (ii) De ne the following mixing coe cients

α(a) = sup

t α(σt −∞, σ∞ t+a)

β(a) = sup

t β(σt −∞, σ∞ t+a)

ϕ(a) = sup

t ϕ(σt −∞, σ∞ t+a)

We say that a sequence of random variables X ∞

−∞ is α,

β or ϕ mixing if the corresponding mixing coe cient → 0 as a → ∞.

These coe cients measure dependence between future and the past separated by a time units.

12 / 24

SLIDE 13

Stationary stochastic processes A stochastic process X ∞

−∞ is (strictly) stationary for

any t ∈ Z and k, n ∈ N the distribution of X t+n

t

is the same as the distribution of X t+k+n

t+k

. For stationary processes mixing coe cients can be simpli ed to

α(a) = α(σ0

−∞, σ∞ a )

β(a) = β(σ0

−∞, σ∞ a )

ϕ(a) = ϕ(σ0

−∞, σ∞ a )

13 / 24

SLIDE 14

Connections to machine learning

Theorem (M. Mohri, A. Rostamizadeh, 2009): Let H = {X → Y} be a set of hypothesis and L be an M-bounded loss

function. Let S be a sample of size 2µa from a stationary β-mixing

process on X × Y, for any δ > 4(µ − 1)β(a) with probability at least 1 − δ′ the following holds for all h ∈ H

E[L(h(X), Y )] ≤ 1 m

m

i=1

L(h(Xi), Yi) + ^ RSµ(L ◦ H) + 3M

log 4

δ′

2µ where ^

RSµ denotes the empirical Rademacher complexity and δ′ = δ − 4(µ − 1)β(a).

Other results of the similar nature by R. Meir, M. Mohri and A. Rostamizadeh, I. Steinwart et. al. to name a few.

14 / 24

SLIDE 15

Can we compute mixing coe cients?

Theorem (M. Ahsen, M. Vidyasagar, 2013):

Suppose X and Y are discrete random variables with

known joint and marginal probability distributions. Then

computing α-mixing coe cient is NP - hard. (equivalent to \ partition problem" ). Ahsen and Vidyasgar also give e ciently computable upper and lower bounds.

15 / 24

SLIDE 16

Can we compute mixing coe cients? (continued)

Theorem (M. Ahsen, M. Vidyasagar, 2013):

Suppose X and Y are discrete random variables with

known joint distribution θij and marginal probability

distributions µi and νj. Then one has that

β(σ(X), σ(Y )) =

1 2

|γij| ϕ(σ(X), σ(Y )) = max

j 1 νj

i

max(γij, 0) where γij = θij − µiνj. Thus, β(σ(X), σ(Y )) and

ϕ(σ(X), σ(Y )) both are computable in polynomial time.

16 / 24

SLIDE 17

Estimation of mixing coe cients: naive approach (i)

Question: Given i.i.d. samples (X1, Y1), . . . , (Xm, Ym) from a joint

distribution of real-valued (X, Y ), can we estimate any of the mixing coe cients? De ne the following estimators of the joint and marginal distributions: ^ (x) = 1

m

i=1

IXi≤x

^ (y) = 1

m

i=1

IYi≤y

^ (x, y) = 1

m

i=1

IXi≤x,Yi≤y

Let ^

β and ^ ϕ be estimators of β and γ based on empirical c.d.f.’s.

17 / 24

SLIDE 18

Estimation of mixing coe cients: naive approach (ii)

Theorem (M. Ahsen, M. Vidyasagar, 2013):

^

ϕ ≥ ^ β = m − 1 m → 1 as m → ∞ Justification:

Under empirical probability distributions each sample has mass 1/m. Marginals are also uniform and hence product distribution assigns mass of 1/m to each point in the grid (xi, yj). The conclusion now follows from the above formula for discrete β.

18 / 24

SLIDE 19

Estimation of mixing coe cients: histograms (i) A histogram estimator ^

f of a density f based on a sample X1, . . . , Xm is

^

f (x) =

J

j=1

^

pj mwj IBj(x)

where

Bj’s are bins partitioning the region with observations

^

pj =

m

i=1

IBj(Xi) counts number of samples in bin Bj wjis the width of the j-th bin

19 / 24

SLIDE 20

Estimation of mixing coe cients: histograms (ii) Given m samples choose Jm intervals on R so that each bin contains ⌊m/Jm⌋ or ⌊m/Jm⌋ + 1 samples from both X and Y .

Theorem (M. Ahsen, M. Vidyasagar, 2013):

Suppose (X, Y ) ∼ θ, X ∼ µ and Y ∼ ν with θ being absolutely continuous with respect to µ ⊗ ν. Then ^

β

converges to β provided that Jm/m → 0. If in addition, the density f ∈ L∞ then ^

α and ^ ϕ also converge to α and ϕ respectively.

The measure-theoretic arguments used in the proof establish consistency of the estimators but do not yield error rates.

20 / 24

SLIDE 21

Estimation of mixing coe cients: stochastic processes (i) Two step approximation

| ^ βd(a) − β(a)| ≤ | ^ βd(a) − βd(a)| + |βd(a) − β(a)|

where βd(a) = sup β(σt

t−d, σt+a+d t+a

) and ^

βd(a) is an

estimator based on ^

βd(a) =

1 2

|^

fd ⊗ ^ fd − ^ f2d|

with ^

fd, ^ f2d being d and 2d dimensional histogram

estimators.

21 / 24

SLIDE 22

Estimation of mixing coe cients: stochastic processes (ii)

Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): Let X m

1 be a sample from a stationary β-mixing process. For m = 2µmbm

and d ≤ µm we have that

P(| ^ βd(a) − βd(a)| ≥ ǫ) ≤2 exp −µmǫ2

1

2

+ 2 exp

−µmǫ2

2

+ 4(µm − 1)β(bm)

where ǫ1 = ǫ/2 − E[

|^

fd − fd|] and ǫ2 = ǫ − E[

|^

f2d − f2d|].

Proof is based on blocking technique.

22 / 24

SLIDE 23

Estimation of mixing coe cients: stochastic processes (iii)

|βd(a) − β(a)| a measure-theoretic argument can be used to

show that this → 0 as d → ∞. Under the assumption that densities fd and f2d are in the Sobolev space H2 McDonald, Shalizi and Shervish argue that ^

f2d and ^ fd

are consistent. Choosing dm = O(exp(W (log n)), wm = O(m−km) where

km = W (log m) +

1 2 log m

log m( 1

2 exp(W (log n)) + 1)

and W is an inverse of w exp(w), they show that estimator of β based on histograms is consistent.

23 / 24

SLIDE 24

Estimation of mixing coe cients: discussion Results do not provide convergence rate. High-dimensional histogram estimation may not be accurate. Instead of estimating β directly intermediate step is used to estimate densities. Estimators based on kernels instead of histograms?

24 / 24