[PPT] - On Estimation of Modal Decompositions Anuran Makur, Gregory W. PowerPoint Presentation

SLIDE 1

On Estimation of Modal Decompositions

Anuran Makur, Gregory W. Wornell, and Lizhong Zheng

Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology

IEEE International Symposium on Information Theory 2020

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 1 / 21

SLIDE 2

Outline

1

Introduction A Brief History of Modal Decompositions Formal Definitions Motivation: Embedding of Categorical Data into Euclidean Space

2

Characterization of Operators

3

Sample Complexity Analysis

4

Conclusion

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 2 / 21

SLIDE 3

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36]

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 4

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Can we extend these techniques to categorical data?

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 5

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35]

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 6

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35] Maximal correlation: [Geb41], [R´ en59], [Wit75]

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 7

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35] Maximal correlation: [Geb41], [R´ en59], [Wit75] Strong data processing inequalities and related directions: χ2-divergence [Sar58], KL divergence [AG76], and recent work on hypercontractivity [AGKN13], contraction coefficients [MZ15], [PW17], [MZ20], functional inequalities [Rag16], estimation theory, security, and privacy [CMM+17], . . .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 8

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35] Maximal correlation: [Geb41], [R´ en59], [Wit75] Strong data processing inequalities and related directions: χ2-divergence [Sar58], KL divergence [AG76], and recent work on hypercontractivity [AGKN13], contraction coefficients [MZ15], [PW17], [MZ20], functional inequalities [Rag16], estimation theory, security, and privacy [CMM+17], . . . Lancaster distributions: Mehler’s decomposition [Meh66], Lancaster decompositions [Lan58], [Lan69], orthogonal polynomials [Eag64], [Gri69], [Kou96], [Kou98], and recent work [AZ12], [MZ17], . . .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 9

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35] Maximal correlation: [Geb41], [R´ en59], [Wit75] Strong data processing inequalities and related directions: χ2-divergence [Sar58], KL divergence [AG76], and recent work on hypercontractivity [AGKN13], contraction coefficients [MZ15], [PW17], [MZ20], functional inequalities [Rag16], estimation theory, security, and privacy [CMM+17], . . . Lancaster distributions: Mehler’s decomposition [Meh66], Lancaster decompositions [Lan58], [Lan69], orthogonal polynomials [Eag64], [Gri69], [Kou96], [Kou98], and recent work [AZ12], [MZ17], . . . Correspondence analysis: Data visualization [Ben73], [Gre84], [GH87], and recent work

n neural networks [HMWZ19], [HSC19], . . .
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 10

A Brief History of Modal Decompositions

Dimensionality reduction: Principal component analysis (PCA) [Pea01], [Hot33], canonical correlation analysis (CCA) [Hot36] Modal decompositions: [Hir35] Maximal correlation: [Geb41], [R´ en59], [Wit75] Strong data processing inequalities and related directions: χ2-divergence [Sar58], KL divergence [AG76], and recent work on hypercontractivity [AGKN13], contraction coefficients [MZ15], [PW17], [MZ20], functional inequalities [Rag16], estimation theory, security, and privacy [CMM+17], . . . Lancaster distributions: Mehler’s decomposition [Meh66], Lancaster decompositions [Lan58], [Lan69], orthogonal polynomials [Eag64], [Gri69], [Kou96], [Kou98], and recent work [AZ12], [MZ17], . . . Correspondence analysis: Data visualization [Ben73], [Gre84], [GH87], and recent work

n neural networks [HMWZ19], [HSC19], . . .

Non-parametric regression: Alternating conditional expectations (ACE) algorithm [BF85], [Buj85], feature extraction [MKHZ15], [HMZW17], [HMWZ19]

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 3 / 21

SLIDE 11

Formal Definitions

Finite alphabets X and Y

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 12

Formal Definitions

Finite alphabets X and Y, and random variables X ∈ X and Y ∈ Y

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 13

Formal Definitions

Finite alphabets X and Y, and random variables X ∈ X and Y ∈ Y Bivariate distribution PX,Y with marginals PX, PY > 0

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 14

Formal Definitions

Finite alphabets X and Y, and random variables X ∈ X and Y ∈ Y Bivariate distribution PX,Y with marginals PX, PY > 0 Hilbert spaces: Input space: L2(X, PX)

f : X → R
E
f (X)2

< +∞

with inner product:

∀f1, f2 ∈ L2(X, PX), f1, f2PX E[f1(X)f2(X)] =

x∈X

PX(x)f1(x)f2(x) ,

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 15

Formal Definitions

Finite alphabets X and Y, and random variables X ∈ X and Y ∈ Y Bivariate distribution PX,Y with marginals PX, PY > 0 Hilbert spaces: Input space: L2(X, PX)

f : X → R
E
f (X)2

< +∞

with inner product:

∀f1, f2 ∈ L2(X, PX), f1, f2PX E[f1(X)f2(X)] =

x∈X

PX(x)f1(x)f2(x) , and induced L2-norm: ∀f ∈ L2(X, PX), f 2

PX = E

f (X)2

.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 16

Formal Definitions

Finite alphabets X and Y, and random variables X ∈ X and Y ∈ Y Bivariate distribution PX,Y with marginals PX, PY > 0 Hilbert spaces: Input space: L2(X, PX)

f : X → R
E
f (X)2

< +∞

with inner product:

∀f1, f2 ∈ L2(X, PX), f1, f2PX E[f1(X)f2(X)] =

x∈X

PX(x)f1(x)f2(x) , and induced L2-norm: ∀f ∈ L2(X, PX), f 2

PX = E

f (X)2

. Output space: L2(Y, PY )

g : Y → R
E
g(Y )2

< +∞

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 4 / 21

SLIDE 17

Formal Definitions: Two Equivalent Representations of PX,Y

Definition (Conditional Expectation Operator)

PX|Y : L2(X, PX) → L2(Y, PY ) maps any f ∈ L2(X, PX) to PX|Y f ∈ L2(Y, PY ): ∀y ∈ Y,

PX|Y f
(y) E[f (X)|Y = y] .
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 5 / 21

SLIDE 18

Formal Definitions: Two Equivalent Representations of PX,Y

Definition (Conditional Expectation Operator)

PX|Y : L2(X, PX) → L2(Y, PY ) maps any f ∈ L2(X, PX) to PX|Y f ∈ L2(Y, PY ): ∀y ∈ Y,

PX|Y f
(y) E[f (X)|Y = y] .

Definition (Divergence Transition Matrix)

The divergence transition matrix (DTM), denoted B ∈ R|Y|×|X|, has (y, x)th entry given by: ∀x ∈ X, ∀y ∈ Y, B(x, y) PX,Y (x, y)

PX(x)PY (y)

.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 5 / 21

SLIDE 19

Formal Definitions: Two Equivalent Representations of PX,Y

Definition (Conditional Expectation Operator)

PX|Y : L2(X, PX) → L2(Y, PY ) maps any f ∈ L2(X, PX) to PX|Y f ∈ L2(Y, PY ): ∀y ∈ Y,

PX|Y f
(y) E[f (X)|Y = y] .

Definition (Divergence Transition Matrix)

The divergence transition matrix (DTM), denoted B ∈ R|Y|×|X|, has (y, x)th entry given by: ∀x ∈ X, ∀y ∈ Y, B(x, y) PX,Y (x, y)

PX(x)PY (y)

. Remark: DTMs parallel symmetric normalized Laplacian matrices.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 5 / 21

SLIDE 20

Formal Definitions: SVDs and Modal Decompositions

K = min{|X|, |Y|} SVD of Conditional Expectation Operator: ∀i ∈ {0, . . . , K − 1}, PX|Y f ∗

i = σi g∗ i

σ0 ≥ σ1 ≥ · · · ≥ σK−1 ≥ 0 are singular values f ∗

0 , . . . , f ∗ K−1 ∈ L2(X, PX) are orthonormal right singular vectors

g∗

0 , . . . , g∗ K−1 ∈ L2(Y, PY ) are orthonormal left singular vectors

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 6 / 21

SLIDE 21

Formal Definitions: SVDs and Modal Decompositions

K = min{|X|, |Y|} SVD of Conditional Expectation Operator: ∀i ∈ {0, . . . , K − 1}, PX|Y f ∗

i = σi g∗ i

σ0 ≥ σ1 ≥ · · · ≥ σK−1 ≥ 0 are singular values f ∗

0 , . . . , f ∗ K−1 ∈ L2(X, PX) are orthonormal right singular vectors

g∗

0 , . . . , g∗ K−1 ∈ L2(Y, PY ) are orthonormal left singular vectors

SVD of DTM: B =

K−1

i=0

σi ψY

i

ψX

i

T

ψX

0 , . . . , ψX K−1 ∈ R|X| are orthonormal right singular vectors

ψY

0 , . . . , ψY K−1 ∈ R|Y| are orthonormal left singular vectors

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 6 / 21

SLIDE 22

Formal Definitions: SVDs and Modal Decompositions

K = min{|X|, |Y|} SVD of Conditional Expectation Operator: ∀i ∈ {0, . . . , K − 1}, PX|Y f ∗

i = σi g∗ i

σ0 ≥ σ1 ≥ · · · ≥ σK−1 ≥ 0 are singular values f ∗

0 , . . . , f ∗ K−1 ∈ L2(X, PX) are orthonormal right singular vectors

g∗

0 , . . . , g∗ K−1 ∈ L2(Y, PY ) are orthonormal left singular vectors

SVD of DTM: B =

K−1

i=0

σi ψY

i

ψX

i

T

ψX

0 , . . . , ψX K−1 ∈ R|X| are orthonormal right singular vectors

ψY

0 , . . . , ψY K−1 ∈ R|Y| are orthonormal left singular vectors

Relation: ψX

i (x) = f ∗ i (x)

PX(x) for x ∈ X, and ψY

i (y) = g∗ i (y)

PY (y) for y ∈ Y
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 6 / 21

SLIDE 23

Formal Definitions: SVDs and Modal Decompositions

Proposition (SVD Structure)

Operator Norm: σ0 = 1, f ∗

0 (x) = 1 for all x ∈ X, and g∗ 0 (y) = 1 for all y ∈ Y.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 7 / 21

SLIDE 24

Formal Definitions: SVDs and Modal Decompositions

Proposition (SVD Structure)

Operator Norm: σ0 = 1, f ∗

0 (x) = 1 for all x ∈ X, and g∗ 0 (y) = 1 for all y ∈ Y.

Maximal Correlations [Hir35, Geb41, Sar58, R´ en59]: Using Courant-Fischer-Weyl, ∀i ∈ {1, . . . , K − 1}, σi = max

f , g E[f (X)g(Y )] = E[f ∗ i (X)g∗ i (Y )] ,

where the maximization is over all f ∈ L2(X, PX) and g ∈ L2(Y, PY ) such that E

f (X)2

= E

g(Y )2

= 1 and E[f (X)f ∗

j (X)] = E[g(Y )g∗ j (Y )] = 0 for all j < i.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 7 / 21

SLIDE 25

Formal Definitions: SVDs and Modal Decompositions

Proposition (SVD Structure)

Operator Norm: σ0 = 1, f ∗

0 (x) = 1 for all x ∈ X, and g∗ 0 (y) = 1 for all y ∈ Y.

Maximal Correlations [Hir35, Geb41, Sar58, R´ en59]: Using Courant-Fischer-Weyl, ∀i ∈ {1, . . . , K − 1}, σi = max

f , g E[f (X)g(Y )] = E[f ∗ i (X)g∗ i (Y )] ,

where the maximization is over all f ∈ L2(X, PX) and g ∈ L2(Y, PY ) such that E

f (X)2

= E

g(Y )2

= 1 and E[f (X)f ∗

j (X)] = E[g(Y )g∗ j (Y )] = 0 for all j < i.

Proposition (Modal Decomposition of Bivariate Distribution [Hir35, Lan58, Ben73])

∀x ∈ X, ∀y ∈ Y, PX,Y (x, y) = PX(x) PY (y)

1 +

K−1

i=1

σi f ∗

i (x) g∗ i (y)

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 7 / 21

SLIDE 26

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .}

probability simplex

|

embed

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 8 / 21

SLIDE 27

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Goal: Embed X into Rk using knowledge of PX,Y for further processing, e.g., clustering.

probability simplex

|

embed

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 8 / 21

SLIDE 28

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Goal: Embed X into Rk using knowledge of PX,Y for further processing, e.g., clustering. “Natural” Embedding: Represent each x ∈ X using conditional distribution PY |X=x ∈ R|Y|.

probability simplex

|

embed

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 8 / 21

SLIDE 29

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Goal: Embed X into Rk using knowledge of PX,Y for further processing, e.g., clustering. “Natural” Embedding: Represent each x ∈ X using conditional distribution PY |X=x ∈ R|Y|.

probability simplex

|

embed

Dimensionality Reduction: |Y| is large! Reduce dimension of embedding.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 8 / 21

SLIDE 30

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Want: Low-dimensional embedding of X into Euclidean space Rk.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 9 / 21

SLIDE 31

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Modal Decomposition Embedding: PY |X=x = PY +

K−1

i=1

σif ∗

i (x) (g∗ i · PY )

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 9 / 21

SLIDE 32

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Modal Decomposition Embedding: When σk+1 is small, ζk : X → Rk, ζk(x) = [σ1f ∗

1 (x) · · · σkf ∗ k (x)]T embed

∗

∗

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 9 / 21

SLIDE 33

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Modal Decomposition Embedding: When σk+1 is small, ζk : X → Rk, ζk(x) = [σ1f ∗

1 (x) · · · σkf ∗ k (x)]T

Diffusion Distance Preservation (cf. Laplacian eigenmaps [BN01], diffusion maps [CL06]): D2

diff(PY |X=x, PY |X=x′)

y∈Y
PY |X(y|x) − PY |X(y|x′)

2 PY (y) =

ζK−1(x) − ζK−1(x′)
2

2

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 9 / 21

SLIDE 34

Motivation: Embedding of Categorical Data into Euclidean Space

Suppose we have: X =    , , , . . .    Y = {ISIT, Allerton, ICASSP, ICML, . . .} Modal Decomposition Embedding: When σk+1 is small, ζk : X → Rk, ζk(x) = [σ1f ∗

1 (x) · · · σkf ∗ k (x)]T

Diffusion Distance Preservation (cf. Laplacian eigenmaps [BN01], diffusion maps [CL06]): D2

diff(PY |X=x, PY |X=x′)

y∈Y
PY |X(y|x) − PY |X(y|x′)

2 PY (y) =

ζK−1(x) − ζK−1(x′)
2

2

≈

ζk(x) − ζk(x′)
2

2

(dimensionality reduction when k ≪ K)

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 9 / 21

SLIDE 35

Main Questions

How do we characterize or identify DTMs?

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 10 / 21

SLIDE 36

Main Questions

How do we characterize or identify DTMs? Why do we use DTMs or conditional expectation operators to represent PX,Y (instead of, e.g., information density [HV93])?

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 10 / 21

SLIDE 37

Main Questions

How do we characterize or identify DTMs? Why do we use DTMs or conditional expectation operators to represent PX,Y (instead of, e.g., information density [HV93])? Known relation to mutual χ2-information, . . .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 10 / 21

SLIDE 38

Main Questions

How do we characterize or identify DTMs? Why do we use DTMs or conditional expectation operators to represent PX,Y (instead of, e.g., information density [HV93])? Known relation to mutual χ2-information, . . . If true distribution PX,Y is unknown but we have training data, how well can we learn σ1, . . . , σk and (f ∗

1 , g∗ 1 ), . . . , (f ∗ k , g∗ k )?

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 10 / 21

SLIDE 39

Outline

1

Introduction

2

Characterization of Operators Characterization of DTMs Representation of Conditional Expectation Operators

3

Sample Complexity Analysis

4

Conclusion

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 11 / 21

SLIDE 40

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 41

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 42

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

Theorem (Characterization of DTMs)

M ∈ R|Y|×|X| is a DTM corresponding to a distribution in PX×Y

if and only if M > 0

(entry-wise) and the spectral norm Ms = 1: B(PX×Y

) =
M ∈ R|Y|×|X| : M > 0 and Ms = 1
.
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 43

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

Theorem (Characterization of DTMs)

M ∈ R|Y|×|X| is a DTM corresponding to a distribution in PX×Y

if and only if M > 0

(entry-wise) and the spectral norm Ms = 1: B(PX×Y

) =
M ∈ R|Y|×|X| : M > 0 and Ms = 1
.

More generally, we have: B(PX×Y) =

M ∈ R|Y|×|X| : M ≥ 0, Ms = 1, ∃ ψX > 0, MTM ψX = ψX, and

∃ ψY > 0, MMTψY = ψY .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 44

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

Theorem (Characterization of DTMs)

M ∈ R|Y|×|X| is a DTM corresponding to a distribution in PX×Y

if and only if M > 0

(entry-wise) and the spectral norm Ms = 1: B(PX×Y

) =
M ∈ R|Y|×|X| : M > 0 and Ms = 1
.

More generally, we have: B(PX×Y) =

M ∈ R|Y|×|X| : M ≥ 0, Ms = 1, ∃ ψX > 0, MTM ψX = ψX, and

∃ ψY > 0, MMTψY = ψY . DTM function is bijective and continuous.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 45

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

Theorem (Characterization of DTMs)

M ∈ R|Y|×|X| is a DTM corresponding to a distribution in PX×Y

if and only if M > 0

(entry-wise) and the spectral norm Ms = 1: B(PX×Y

) =
M ∈ R|Y|×|X| : M > 0 and Ms = 1
.

More generally, we have: B(PX×Y) =

M ∈ R|Y|×|X| : M ≥ 0, Ms = 1, ∃ ψX > 0, MTM ψX = ψX, and

∃ ψY > 0, MMTψY = ψY . DTM function is bijective and continuous. (So, B is equivalent representation of PX,Y .)

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 46

Characterization of DTMs

PX×Y = {bivariate distributions over X × Y with strictly positive marginals} PX×Y

= relative interior of PX×Y

DTM function: B : PX×Y → R|Y|×|X| so that B = B(PX,Y )

Theorem (Characterization of DTMs)

M ∈ R|Y|×|X| is a DTM corresponding to a distribution in PX×Y

if and only if M > 0

(entry-wise) and the spectral norm Ms = 1: B(PX×Y

) =
M ∈ R|Y|×|X| : M > 0 and Ms = 1
.

More generally, we have: B(PX×Y) =

M ∈ R|Y|×|X| : M ≥ 0, Ms = 1, ∃ ψX > 0, MTM ψX = ψX, and

∃ ψY > 0, MMTψY = ψY . DTM function is bijective and continuous. (So, B is equivalent representation of PX,Y .) Proofs utilize Perron-Frobenius theorem.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 12 / 21

SLIDE 47

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces?

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 48

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 49

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY )

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 50

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) (This defines PX,Y !)

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 51

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) (This defines PX,Y !) Instead of PX, choose input Hilbert space L2(X, QX) for any distribution QX > 0

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 52

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) (This defines PX,Y !) Instead of PX, choose input Hilbert space L2(X, QX) for any distribution QX > 0 Operator norm of PX|Y : L2(X, QX) → L2(Y, PY ) is

PX|Y
QX →PY

max

f ∈L2(X,QX )\{0}

PX|Y f
PY
f
QX
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 53

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) Instead of PX, choose input Hilbert space L2(X, QX) for any distribution QX > 0 Operator norm of PX|Y : L2(X, QX) → L2(Y, PY ) is

PX|Y
QX →PY

max

f ∈L2(X,QX )\{0}

PX|Y f
PY
f
QX

Theorem (Weak Contraction)

min

QX >0

PX|Y
QX →PY =
PX|Y
PX →PY = 1.

For any QX > 0,

PX|Y
2

QX →PY ≥ 1 + χ2(PX||QX)

x∈X

PX(x)2 QX(x) .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 54

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) Instead of PX, choose input Hilbert space L2(X, QX) for any distribution QX > 0 Operator norm of PX|Y : L2(X, QX) → L2(Y, PY ) is

PX|Y
QX →PY

max

f ∈L2(X,QX )\{0}

PX|Y f
PY
f
QX

Theorem (Weak Contraction)

min

QX >0

PX|Y
QX →PY =
PX|Y
PX →PY = 1.

For any QX > 0,

PX|Y
2

QX →PY ≥ 1 + χ2(PX||QX)

x∈X

PX(x)2 QX(x) . Proof uses explicit calculation of adjoint operator P∗

X|Y .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 55

Representation of Conditional Expectation Operators

Question: Why use conditional expectation operators with specific choices of Hilbert spaces? PX|Y is characterized by PX|Y To get SVD of PX|Y , choose output Hilbert space L2(Y, PY ) Instead of PX, choose input Hilbert space L2(X, QX) for any distribution QX > 0 Operator norm of PX|Y : L2(X, QX) → L2(Y, PY ) is

PX|Y
QX →PY

max

f ∈L2(X,QX )\{0}

PX|Y f
PY
f
QX

Theorem (Weak Contraction)

min

QX >0

PX|Y
QX →PY =
PX|Y
PX →PY = 1. (data processing inequality for χ2-divergence)

For any QX > 0,

PX|Y
2

QX →PY ≥ 1 + χ2(PX||QX)

x∈X

PX(x)2 QX(x) . Answer: Q∗

X = PX is unique input Hilbert space that makes PX|Y a weak contraction.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 13 / 21

SLIDE 56

Outline

1

Introduction

2

Characterization of Operators

3

Sample Complexity Analysis Estimation of Dominant Maximal Correlations Estimation of Dominant Feature Functions

4

Conclusion

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 14 / 21

SLIDE 57

Preliminaries

Suppose true PX,Y is unknown.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 58

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 59

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y . Assume PX and PY are known (e.g., high-dimensional regime max{|X|, |Y|} ≪ n ≪ |X||Y|, or extra “unlabeled” data)

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 60

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y . Assume PX and PY are known, and satisfy PX, PY ≥ p0 for some constant p0 > 0.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 61

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y . Assume PX and PY are known, and satisfy PX, PY ≥ p0 for some constant p0 > 0. Sample Modal Decomposition: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) = PX(x) PY (y)

1 +

K

i=1

ˆ σi ˆ f ∗

i (x) ˆ

g∗

i (y)

Singular value estimates: ˆ

σ1 ≥ · · · ≥ ˆ σK ≥ 0 {ˆ f ∗

1 , . . . , ˆ

f ∗

K} ⊂ L2(X, PX) and {ˆ

g∗

1 , . . . , ˆ

g∗

K} ⊂ L2(Y, PY ) are orthonormal sets

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 62

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y . Assume PX and PY are known, and satisfy PX, PY ≥ p0 for some constant p0 > 0. Sample Modal Decomposition: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) = PX(x) PY (y)

1 +

K

i=1

ˆ σi ˆ f ∗

i (x) ˆ

g∗

i (y)

Singular value estimates: ˆ

σ1 ≥ · · · ≥ ˆ σK ≥ 0 {ˆ f ∗

1 , . . . , ˆ

f ∗

K} ⊂ L2(X, PX) and {ˆ

g∗

1 , . . . , ˆ

g∗

K} ⊂ L2(Y, PY ) are orthonormal sets

Singular vector estimates for all i: ˇ f ∗

i (x) ˆ

f ∗

i (x)−E

ˆ f ∗

i (X)

, ˇ

g∗

i (y) ˆ

g∗

i (y)−E

ˆ

g∗

i (Y )

.
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 63

Preliminaries

Suppose true PX,Y is unknown. Observe n training samples (X1, Y1), . . . , (Xn, Yn) i.i.d. ∼ PX,Y with empirical distribution: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) 1

n

i=1

1Xi=x 1Yi=y . Assume PX and PY are known, and satisfy PX, PY ≥ p0 for some constant p0 > 0. Sample Modal Decomposition: ∀x ∈ X, ∀y ∈ Y, ˆ Pn

X,Y (x, y) = PX(x) PY (y)

1 +

K

i=1

ˆ σi ˆ f ∗

i (x) ˆ

g∗

i (y)

Singular value estimates: ˆ

σ1 ≥ · · · ≥ ˆ σK ≥ 0 {ˆ f ∗

1 , . . . , ˆ

f ∗

K} ⊂ L2(X, PX) and {ˆ

g∗

1 , . . . , ˆ

g∗

K} ⊂ L2(Y, PY ) are orthonormal sets

Singular vector estimates for all i: ˇ f ∗

i (x) ˆ

f ∗

i (x)−E

ˆ f ∗

i (X)

, ˇ

g∗

i (y) ˆ

g∗

i (y)−E

ˆ

g∗

i (Y )

.

Algorithm: Compute SVD of empirical quasi-DTM using, e.g., orthogonal iteration method, QR iteration algorithm (or ACE algorithm), Krylov subspace methods, etc.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 15 / 21

SLIDE 64

Estimation of k ∈ {1, . . . , K − 1} Dominant Maximal Correlations

Estimate σ1, . . . , σk using ˆ σ1, . . . , ˆ σk under (squared) ℓ1-norm loss.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 16 / 21

SLIDE 65

Estimation of k ∈ {1, . . . , K − 1} Dominant Maximal Correlations

Estimate σ1, . . . , σk using ˆ σ1, . . . , ˆ σk under (squared) ℓ1-norm loss.

Theorem (Sample Complexity Tail Bound I)

∀ 0 ≤ δ ≤ 1 p0

k

2 , P k

i=1
ˆ

σi − σi

≥ δ
≤ exp

1 4 − n p2

0 δ2

8k

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 16 / 21

SLIDE 66

Estimation of k ∈ {1, . . . , K − 1} Dominant Maximal Correlations

Estimate σ1, . . . , σk using ˆ σ1, . . . , ˆ σk under (squared) ℓ1-norm loss.

Theorem (Sample Complexity Tail Bound I)

∀ 0 ≤ δ ≤ 1 p0

k

2 , P k

i=1
ˆ

σi − σi

≥ δ
≤ exp

1 4 − n p2

0 δ2

8k

To estimate σ1, . . . , σk within fixed error and confidence, n must grow linearly with k.
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 16 / 21

SLIDE 67

Estimation of k ∈ {1, . . . , K − 1} Dominant Maximal Correlations

Estimate σ1, . . . , σk using ˆ σ1, . . . , ˆ σk under (squared) ℓ1-norm loss.

Theorem (Sample Complexity Tail Bound I)

∀ 0 ≤ δ ≤ 1 p0

k

2 , P k

i=1
ˆ

σi − σi

≥ δ
≤ exp

1 4 − n p2

0 δ2

8k

To estimate σ1, . . . , σk within fixed error and confidence, n must grow linearly with k.

Proof exploits: 1) vector generalization of Bernstein’s inequality, and 2) weak majorization result for perturbation of singular values known as Lidskii inequality.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 16 / 21

SLIDE 68

Estimation of k ∈ {1, . . . , K − 1} Dominant Maximal Correlations

Estimate σ1, . . . , σk using ˆ σ1, . . . , ˆ σk under (squared) ℓ1-norm loss.

Theorem (Sample Complexity Tail Bound I)

∀ 0 ≤ δ ≤ 1 p0

k

2 , P k

i=1
ˆ

σi − σi

≥ δ
≤ exp

1 4 − n p2

0 δ2

8k

To estimate σ1, . . . , σk within fixed error and confidence, n must grow linearly with k.

Proof exploits: 1) vector generalization of Bernstein’s inequality, and 2) weak majorization result for perturbation of singular values known as Lidskii inequality.

Corollary (Squared ℓ1-Risk Bound)

∀ n ≥ 16 log(4kn) , E   k

i=1
ˆ

σi − σi

2

 ≤ 6k + 8k log(nk) p2

0n

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 16 / 21

SLIDE 69

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 70

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

First term equals σ2

1 + · · · + σ2 k (“rank k approximation” of mutual χ2-information).

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 71

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

First term equals σ2

1 + · · · + σ2 k.

Theorem (Sample Complexity Tail Bound II)

∀ 0 ≤ δ ≤ 4k , P k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY ≥ δ

≤
|X| + |Y|
exp
−n p0 δ2

64 k2

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 72

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

First term equals σ2

1 + · · · + σ2 k.

Theorem (Sample Complexity Tail Bound II)

∀ 0 ≤ δ ≤ 4k , P k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY ≥ δ

≤
|X| + |Y|
exp
−n p0 δ2

64 k2

To estimate f ∗

1 , . . . , f ∗ k within fixed error and confidence, n must be quadratic in k.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 73

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

First term equals σ2

1 + · · · + σ2 k.

Theorem (Sample Complexity Tail Bound II)

∀ 0 ≤ δ ≤ 4k , P k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY ≥ δ

≤
|X| + |Y|
exp
−n p0 δ2

64 k2

To estimate f ∗

1 , . . . , f ∗ k within fixed error and confidence, n must be quadratic in k.

Proof exploits: 1) matrix generalization of Bernstein’s inequality, and 2) singular value stability result known as Weyl inequality.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 74

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Estimate f ∗

1 , . . . , f ∗ k using ˇ

f ∗

1 , . . . , ˇ

f ∗

k under loss function: k

i=1
PX|Y f ∗

i

2

PY − k

i=1
PX|Y ˇ

f ∗

i

2

PY ≥ 0 .

First term equals σ2

1 + · · · + σ2 k.

Theorem (Sample Complexity Tail Bound II)

∀ 0 ≤ δ ≤ 4k , P k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY ≥ δ

≤
|X| + |Y|
exp
−n p0 δ2

64 k2

To estimate f ∗

1 , . . . , f ∗ k within fixed error and confidence, n must be quadratic in k.

Proof exploits: 1) matrix generalization of Bernstein’s inequality, and 2) singular value stability result known as Weyl inequality. Analogous results hold for estimation of g∗

1 , . . . , g∗ k using ˇ

g∗

1 , . . . , ˇ

g∗

k .

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 17 / 21

SLIDE 75

Estimation of k ∈ {1, . . . , K − 1} Dominant Feature Functions

Theorem (Sample Complexity Tail Bound II)

∀ 0 ≤ δ ≤ 4k , P k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY ≥ δ

≤
|X| + |Y|
exp
−n p0 δ2

64 k2

Corollary (Mean Squared Error Risk Bound)

For every sufficiently large n such that p0n

64 ≥ 1 |X|+|Y| and p0n 4 ≥ log

p0n

64

|X| + |Y|
,

E   k

i=1
PX|Y f ∗

i

2

PY −

PX|Y ˇ

f ∗

i

2

PY

2

 ≤ 64k2 log

p0n
|X| + |Y|
− 3
p0n
A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 18 / 21

SLIDE 76

Outline

1

Introduction

2

Characterization of Operators

3

Sample Complexity Analysis

4

Conclusion

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 19 / 21

SLIDE 77

Conclusion

Main Contributions: DTMs are entry-wise strictly positive matrices with spectral norm 1. Unique Hilbert spaces yield conditional expectation operators that are weak contractions. Sample complexity bounds for learning modal decompositions from training data.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 20 / 21

SLIDE 78

Conclusion

Main Contributions: DTMs are entry-wise strictly positive matrices with spectral norm 1. Unique Hilbert spaces yield conditional expectation operators that are weak contractions. Sample complexity bounds for learning modal decompositions from training data. Main Future Direction: Sharpen and generalize sample complexity results using matrix estimation ideas.

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 20 / 21

SLIDE 79

Thank You!

A. Makur, G. W. Wornell, L. Zheng (MIT)

On Estimation of Modal Decompositions ISIT 21-26 June 2020 21 / 21