Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred - - PowerPoint PPT Presentation

large scale matrix analysis and inference
SMART_READER_LITE
LIVE PREVIEW

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred - - PowerPoint PPT Presentation

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32 Introductory musing What is a matrix? a i , j 1 A vector of n 2 parameters 2 A


slide-1
SLIDE 1

Large Scale Matrix Analysis and Inference

Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013

1 / 32

slide-2
SLIDE 2

Introductory musing — What is a matrix?

ai,j

1 A vector of n2 parameters 2 A covariance 3 A generalized probability distribution 4 . . . 2 / 32

slide-3
SLIDE 3
  • 1. A vector of n2 parameters

When you regularize with the squared Frobenius norm min

W

||W||2

F +

  • n

loss(tr(WXn))

3 / 32

slide-4
SLIDE 4
  • 1. A vector of n2 parameters

When you regularize with the squared Frobenius norm min

W

||W||2

F +

  • n

loss(tr(WXn)) Equivalent to min

vec(W)

||vec(W)||2

2 +

  • n

loss(vec(W) · vec(Xn))

No structure: n2 independent variables

4 / 32

slide-5
SLIDE 5
  • 2. A covariance

View the symmetric positive definite matrix C as a covariance matrix of some random feature vector c ∈ Rn, i.e. C = E

  • (c − E(c))(c − E(c))⊤

n features plus their pairwise interactions

5 / 32

slide-6
SLIDE 6

Symmetric matrices as ellipses

Ellipse = {Cu : u2 = 1} Dotted lines connect point u on unit ball with point Cu on ellipse

6 / 32

slide-7
SLIDE 7

Symmetric matrices as ellipses

Eigenvectors form axes Eigenvalues are lengths

7 / 32

slide-8
SLIDE 8

Dyads

uu⊤, where u unit vector One eigenvalue one All others zero Rank one projection matrix

8 / 32

slide-9
SLIDE 9

Directional variance along direction u

V(c⊤u) = u⊤Cu = tr(C uu⊤) ≥ 0 The outer figure eight is direction u times the variance u⊤C u PCA: find direction of largest variance

9 / 32

slide-10
SLIDE 10

3 dimensional variance plots

tr(C uu⊤) is generalized probability when tr(C) = 1

10 / 32

slide-11
SLIDE 11
  • 3. Generalized probability distributions

Probability vector ω = (.2, .1., .6, .1)⊤ =

i

ωi

  • mixture coefficients

ei

  • pure events

Density matrix W =

i

ωi

  • mixture coefficients

wiw⊤

i pure density matrices

11 / 32

slide-12
SLIDE 12
  • 3. Generalized probability distributions

Probability vector ω = (.2, .1., .6, .1)⊤ =

i

ωi

  • mixture coefficients

ei

  • pure events

Density matrix W =

i

ωi

  • mixture coefficients

wiw⊤

i pure density matrices

Matrices as generalized distributions

12 / 32

slide-13
SLIDE 13
  • 3. Generalized probability distributions

Probability vector ω = (.2, .1., .6, .1)⊤ =

i

ωi

  • mixture coefficients

ei

  • pure events

Density matrix W =

i

ωi

  • mixture coefficients

wiw⊤

i pure density matrices

Matrices as generalized distributions

Many mixtures lead to same density matrix There always exists a decomposition into n eigendyads Density matrix: Symmetric positive matrix of trace one

13 / 32

slide-14
SLIDE 14

It’s like a probability!

Total variance along orthogonal set of directions is 1 u⊤

1 Wu1 +u⊤ 2 Wu2 = 1

a + b + c = 1

14 / 32

slide-15
SLIDE 15

Uniform density?

1 nI

All dyads have generalized probability 1

n

tr(1 nI uu⊤) = 1 n tr(uu⊤) = 1 n Generalized probabilities of n orthogonal dyads sum to 1

15 / 32

slide-16
SLIDE 16

Conventional Bayes Rule

P(Mi|y) = P(Mi)P(y|Mi) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max

16 / 32

slide-17
SLIDE 17

Conventional Bayes Rule

P(Mi|y) = P(Mi)P(y|Mi) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max

17 / 32

slide-18
SLIDE 18

Conventional Bayes Rule

P(Mi|y) = P(Mi)P(y|Mi) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max

18 / 32

slide-19
SLIDE 19

Conventional Bayes Rule

P(Mi|y) = P(Mi)P(y|Mi) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max

19 / 32

slide-20
SLIDE 20

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 1 update with data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

20 / 32

slide-21
SLIDE 21

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 2 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

21 / 32

slide-22
SLIDE 22

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 3 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

22 / 32

slide-23
SLIDE 23

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 4 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

23 / 32

slide-24
SLIDE 24

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 10 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

24 / 32

slide-25
SLIDE 25

Bayes Rule for density matrices

D(M|y) = exp (log D(M) + log D(y|M)) tr (above matrix) 20 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation

25 / 32

slide-26
SLIDE 26

Bayes’ rules

vector matrix Bayes rule P(Mi|y)=

P(Mi)·P(y|Mi)

  • j P(Mj)·P(y|Mj)

D(M|y) =

D(M)⊙D(y|M) tr(D(M)⊙D(y|M)

A⊙B := exp(log A + log B)

26 / 32

slide-27
SLIDE 27

Bayes’ rules

vector matrix Bayes rule P(Mi|y)=

P(Mi)·P(y|Mi)

  • j P(Mj)·P(y|Mj)

D(M|y) =

D(M)⊙D(y|M) tr(D(M)⊙D(y|M)

A⊙B := exp(log A + log B)

Regularizer Entropy Quantum Entropy

27 / 32

slide-28
SLIDE 28

Vector case as special case of matrix case

Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case

28 / 32

slide-29
SLIDE 29

Vector case as special case of matrix case

Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case This phenomenon has been dubbed the “free matrix lunch”

Size of matrix = size of vector = n

29 / 32

slide-30
SLIDE 30

PCA setup

Data vectors C =

n xnx⊤ n

max

unit u

u⊤C u

  • not convex in u

= max

dyad uu⊤

tr(Cuu⊤)

  • linear in uu⊤

Corresponding vector problem max

ei

c⊤ei

  • linear in ei

Vector problem is matrix problem when everything happens in the same eigensystem Uncertainty over unit: probability vector Uncertainty over dyads: density matrix Uncertainty over k-sets of units: capped probability vector Uncertainty over rank k projection matrices: capped density matrix

30 / 32

slide-31
SLIDE 31

For PCA

Solve the vector problem first Do all bounds Lift to matrix case: essentially replace · by ⊙ Regret bounds stay the same Free Matrix Lunch

31 / 32

slide-32
SLIDE 32

Questions

When can you “lift”vector case to matrix case? When is there a free matrix lunch? Lifting matrices to tensors? Efficient algorithms for large matrices?

Approximations of ⊙ Avoid eigenvalue decomposition by sampling . . .

32 / 32