[PPT] - Approximate Matrix Multiplication Lecture 21 April 11, 2019 PowerPoint Presentation

SLIDE 1

CS 498ABD: Algorithms for Big Data, Spring 2019

Approximate Matrix Multiplication

Lecture 21

April 11, 2019

Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 34

SLIDE 2

Matrix data

Lot of data can be viewed as defining a matrix. We have already seen vectors modeling data/signals. More generally we can use tensors too. n data items and each data item ai is a vector over some features (say m features) A is the matrix defined by the n data items. Assuming a1, . . . , an are columns then A is a m × n matrix Combinatorial objects such as graphs can also be modeled via graphs

Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 34

SLIDE 3

Numerical Linear Algebra

Basic problems in linear algebra: Matrix vector product: compute Ax Matrix multiplication: compute AB Linear equations: solve Ax = b Matrix inversion: compute A−1 Least squares: solve minxAx − b Singular value decomposition, eigen values, principal component analysis, low-rank approximations . . . Fundamental in all areas of applied mathematics and engineering. Many applications to statistics and data analysis.

Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 34

SLIDE 4

Numerical Linear Algebra

NLA has a vast literature In practice iterative methods are used that converge to an optimum

solution. They can take advantage of sparsity in the input data

better than exact methods Some TCS contributions in the recent past: randomized NLA for faster algorithms with provable approximation guarantees - sampling and JL based techniques and others revisit preconditioning methods for Laplacians and beyond Many powerful applications in theory and practice

Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 34

SLIDE 5

Norms and matrix norms

Definition

A norm in a real vector space V is a real valued function that has three properties: (i) x ≥ 0 for all x ∈ V and x = 0 implies x = 0, (ii) ax = |a|x for all scalars a (iii) x + y ≤ x + y Familiar vector norms: xp = (

i |xi|p)1/p

If A is a injective linear transformation Ax is also a norm in the

riginal space.

Norms and metrics: d(x, y) = x − y is a metric

Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 34

SLIDE 6

Matrix norms

Consider vector space of all matrices A ∈ Rm×n What are useful norms over matrices? Treat matrix like a vector of dimension m × n and apply vector

norm. For instance AF (Frobenius norm) is (

i,j |Ai,j|2)1/2.

Treat matrix as linear operator and see what it does to norms of vectors it operates on. Spectral norm is supx2=1Ax2. Schatten p-norms based on singular values of A Trace norm, nuclear norm, . . . Norms are related in some cases (different perspective on the same norm)

Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 34

SLIDE 7

Frobenus and Spectral norms

Submultiplicative property: ABF ≤ AFBF AB2 ≤ A2B2

Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 34

SLIDE 8

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Standard algorithm based on definition: O(mnp) time Faster algorithms via non-trivial Strassen-like divide and conquer.

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 34

SLIDE 9

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Standard algorithm based on definition: O(mnp) time Faster algorithms via non-trivial Strassen-like divide and conquer. Approximation: Compute D ∈ Rm×p such that D − AB is small in some appropriate matrix norm. Two methods random sampling random projections (fast JL)

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 34

SLIDE 10

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Notation: M(j) for j’th column of M and M(i) for i’th row of M both interpreted as vectors From textbook definition: Di,h = A(i), B(h) = n

k=1 Ai,kBk,h

Consider AT consisting of m column vectors from Rn and B as p column vectors from Rn We want to compute all mp inner products of these vectors.

Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 34

SLIDE 11

Approximate Matrix Multiplication

Want to approximate AB in the Frobenius norm. Want D such that D − ABF ≤ ǫABF but ABF can be 0. Instead will settle for D − ABF ≤ ǫAFBF

Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 34

SLIDE 12

Part I Random Sampling for Approx Matrix Mult

Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 34

SLIDE 13

Matrix Multiplication and Outer Products

Alternate definition of matrix multiplication based on outer product: AB =

n

j=1

A(j)B(j) A(j)B(j) is a m × h matrix of rank 1

Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 34

SLIDE 14

Importance Sampling

AB =

n

j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 34

SLIDE 15

Importance Sampling

AB =

n

j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

C = 1

t

ℓ Cℓ where E[Cℓ] = AB.

By linearity of expectation: E[C] = AB

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 34

SLIDE 16

Importance Sampling

Question: How should we choose p1, p2, . . . , pn?

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

SLIDE 17

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

SLIDE 18

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF Use spectral norm of A(j)B(j) which is A(j)B(j)2

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

SLIDE 19

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF Use spectral norm of A(j)B(j) which is A(j)B(j)2 Claim: A(j)B(j)2 = A(j)2B(j)2. Choose pj =

A(j)2B(j)2

ℓA(ℓ)2B(ℓ)2

Due to [Drineas-Kannan-Mahoney]

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

SLIDE 20

Running time

For all j compute A(j)2 and B(j)2. Takes one pass over A and B Allows one to compute p1, p2, . . . , pn C = 1

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

At most O(tmh + NA + NB) time where NA and NB is number of non-zeroes in A and B. Full computation takes O(nmh) time.

Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 34

SLIDE 21

Analysis of approximation

Want to analyse Pr[C − ABF ≥ ǫAFBF].

Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 34

SLIDE 22

Analysis of approximation

Want to analyse Pr[C − ABF ≥ ǫAFBF]. Using Markov: Pr[C − ABF ≥ ǫAFBF] ≤ E

C − AB2

F

ǫ2A2

FB2 F

Lemma

E

C − AB2

F

≤ 1

t

n

j=1A(j)2B(j)2

2 − 1

t AB2 F

Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 34

SLIDE 23

Analysis continued

Pr[C − ABF ≥ ǫAFBF] ≤ E

C − AB2

F

ǫ2A2

FB2 F

E

C − AB2

F

≤

1 t  

n

j=1

A(j)2B(j)  

2

− 1 t AB2

F

≤ 1 t A2

FB2 F.

Thus, if t =

1 ǫ2δ then

Pr[C − ABF ≥ ǫAFBF] ≤ δ.

Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 34

SLIDE 24

Median trick

Recall that we used median trick to improve dependence on δ from 1/δ to log(1/δ). If t =

3 ǫ2 then

Pr[C − ABF ≥ ǫAFBF] ≤ 1/3. Repeat independently to obtain C1, C2, . . . , Cr where r = Θ(log(1/δ)) By Chernoff bounds majority of estimators are good. How do we pick the “median” matrix?

Chandra (UIUC) CS498ABD 18 Spring 2019 18 / 34

SLIDE 25

Median trick

If t =

3 ǫ2 then

Pr[C − ABF ≥ ǫAFBF] ≤ 1/3. Repeat independently to obtain C1, C2, . . . , Cr where r = Θ(log(1/δ)) For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 [Clarkson-Woodruff]

Chandra (UIUC) CS498ABD 19 Spring 2019 19 / 34

SLIDE 26

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≤ Ci − ABF + Cj − ABF and Ci − CjF ≥ Ci − ABF − Cj − ABF.

Chandra (UIUC) CS498ABD 20 Spring 2019 20 / 34

SLIDE 27

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≤ Ci − ABF + Cj − ABF More than half of Ci’s have Ci − ABF < ǫAFBF. If Cs is good then ρs ≥ r/2.

Chandra (UIUC) CS498ABD 21 Spring 2019 21 / 34

SLIDE 28

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≥ Ci − ABF − Cj − ABF. More than half of Ci’s have Ci − ABF < ǫAFBF. If Cs is bad (Cs − ABF > 3AFBF) then Cs − Cj > 2ǫAFBF which means ρs < r/2.

Chandra (UIUC) CS498ABD 22 Spring 2019 22 / 34

SLIDE 29

Running time again

For all j compute A(j)2 and B(j)2. Takes one pass over A and B Allows one to compute p1, p2, . . . , pn C = 1

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

At most O(tmh + NA + NB) time where NA and NB is number of non-zeroes in A and B. Full computation takes O(nmh) time. Either we choose t =

1 ǫ2δ or we choose t = 1 ǫ2 ln(1/δ) but then we

need to do pairwise matrix computations so effectively t =

1 ǫ4 log2(1/δ).

Chandra (UIUC) CS498ABD 23 Spring 2019 23 / 34

SLIDE 30

Proof of Lemma

Lemma

E

C − AB2

F

≤ 1

t

n

j=1A(j)2B(j)2

2 − 1

t AB2 F

Recall C is sum of t independent estimators so the lemma is basically about t = 1. E

C − AB2

F

=

x,y E

(Cx,y − (AB)x,y)2

Fix x, y. Let Z = Cx,y. We have E[Z] = (AB)x,y. Hence Var[Z] = E

Z 2

− (AB)2

x,y = E

(Cx,y − (AB)x,y)2

Chandra (UIUC) CS498ABD 24 Spring 2019 24 / 34

SLIDE 31

Proof of Lemma

E

Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

C − AB2

F

=

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

SLIDE 32

Proof of Lemma

E

Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

C − AB2

F

=

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Recall: pj =

A(j)B(j)

ℓA(ℓ)B(ℓ)

Thus E

C − AB2

F

=

n

j=1A(j)B(j)

2 − AB2

F.

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

SLIDE 33

Proof of Lemma

E

Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

C − AB2

F

=

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Recall: pj =

A(j)B(j)

ℓA(ℓ)B(ℓ)

Thus E

C − AB2

F

=

n

j=1A(j)B(j)

2 − AB2

F.

One can show that the choice of pj values is optimum for reducing variance in the simple importance sampling scheme.

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

SLIDE 34

Sampling matrix view

AB =

n

j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

C = (AST)(SB) where S ∈ Rn×t is a sampling matrix: Si,j =

1

√

tpj if column j is picked in i’th sample else Si,j = 0

Chandra (UIUC) CS498ABD 26 Spring 2019 26 / 34

SLIDE 35

Part II Random Projection for Approx Matrix Mult

Chandra (UIUC) CS498ABD 27 Spring 2019 27 / 34

SLIDE 36

JL Approach for Approx Matrix Multiplication

[Sarlos] Output C = (AST)(SB) where S is a (fast) JL matrix. Works! Advantage?

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 34

SLIDE 37

JL Approach for Approx Matrix Multiplication

[Sarlos] Output C = (AST)(SB) where S is a (fast) JL matrix. Works! Advantage? Oblivious to A, B. Can update them etc.

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 34

SLIDE 38

Recalling JL

Lemma (Distributional JL Lemma)

Fix vector x ∈ Rd and let Π ∈ Rk×d matrix where each entry Πij is chosen independently according to standard normal distribution N (0, 1) distribution. If k = Ω( 1

ǫ2 log(1/δ)), then with probability

(1 − δ) we have 1

√ k Πx2 = (1 ± ǫ)x2.

Can choose entries from {−1, 1} as well.

Definition

Let D be a distribution over m × n matrices. D is said to have (ǫ, δ) JL moment property if for any unit vector x ∈ Rn, EΠ∼D|Πx2

2 − 1| ≤ ǫδ.

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 34

SLIDE 39

JL Property

Lemma

If Π comes from (ǫ, δ) JL moment distribution then for all unit vectors x, y ∈ Rn, E

|Πx, Πy − x, y|2

≤ cǫ2δ.

Chandra (UIUC) CS498ABD 30 Spring 2019 30 / 34

SLIDE 40

Using JL

Theorem

Suppose Π is chosen from a distribution D that satisfies (ǫ, δ) JL moment property then Pr

Π∼D[AB − (AΠT)(BΠ)F > 3ǫAFBF] ≤ δ.

Chandra (UIUC) CS498ABD 31 Spring 2019 31 / 34

SLIDE 41

Using JL

Theorem

Suppose Π is chosen from a distribution D that satisfies (ǫ, δ) JL moment property then Pr

Π∼D[AB − (AΠT)(BΠ)F > 3ǫAFBF] ≤ δ.

Let C = (AΠT)(BΠ). Pr[AB − C2

F > 3ǫAFB2 F] ≤ E

AB − C2

F

(3ǫAFBF)2.

Chandra (UIUC) CS498ABD 31 Spring 2019 31 / 34

SLIDE 42

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j)

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

SLIDE 43

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

SLIDE 44

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Term by term: |Πai, Πbj − ai, bj|2 = α|Π

ai ai 2, Π bj bj 2 − ai ai 2, bj bj 2|2

where α = ai2

2bj2 2

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

SLIDE 45

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Term by term: |Πai, Πbj − ai, bj|2 = α|Π

ai ai 2, Π bj bj 2 − ai ai 2, bj bj 2|2

where α = ai2

2bj2 2

Applying JL property and linearity of expectation E

AB − C2

F

≤ (cǫ)2δ
i,j

ai2

2bj2 2 ≤ (cǫ2)δA2 FB2 F

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

SLIDE 46

Analysis

Pr[AB − C2

F > 3ǫAFB2 F] ≤ E

AB − C2

F

(3ǫAFBF)2.

and E

AB − C2

F

≤ (cǫ)2δ
i,j

ai2

2bj2 2 ≤ (cǫ2)δA2 FB2 F

hence Pr[AB − C2

F > 3ǫAFB2 F] ≤ δ.

Chandra (UIUC) CS498ABD 33 Spring 2019 33 / 34

SLIDE 47

Running time

Roughly speaking: Π converts vectors of dimension n into vectors of dimension d = O( 1

ǫ2 log(1/δ)).

Need to compute ΠAT and ΠB and then compute dot products. mp inner products of vectors of dimension d which is O(mpd 2) time in the worst case Using Fast JL with very sparse Π one can improve running time

Chandra (UIUC) CS498ABD 34 Spring 2019 34 / 34