Approximate Matrix Multiplication Lecture 21 April 11, 2019 - - PowerPoint PPT Presentation

approximate matrix multiplication
SMART_READER_LITE
LIVE PREVIEW

Approximate Matrix Multiplication Lecture 21 April 11, 2019 - - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Approximate Matrix Multiplication Lecture 21 April 11, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 34 Matrix data Lot of data can be viewed as defining a matrix. We have already seen


slide-1
SLIDE 1

CS 498ABD: Algorithms for Big Data, Spring 2019

Approximate Matrix Multiplication

Lecture 21

April 11, 2019

Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 34

slide-2
SLIDE 2

Matrix data

Lot of data can be viewed as defining a matrix. We have already seen vectors modeling data/signals. More generally we can use tensors too. n data items and each data item ai is a vector over some features (say m features) A is the matrix defined by the n data items. Assuming a1, . . . , an are columns then A is a m × n matrix Combinatorial objects such as graphs can also be modeled via graphs

Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 34

slide-3
SLIDE 3

Numerical Linear Algebra

Basic problems in linear algebra: Matrix vector product: compute Ax Matrix multiplication: compute AB Linear equations: solve Ax = b Matrix inversion: compute A−1 Least squares: solve minxAx − b Singular value decomposition, eigen values, principal component analysis, low-rank approximations . . . Fundamental in all areas of applied mathematics and engineering. Many applications to statistics and data analysis.

Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 34

slide-4
SLIDE 4

Numerical Linear Algebra

NLA has a vast literature In practice iterative methods are used that converge to an optimum

  • solution. They can take advantage of sparsity in the input data

better than exact methods Some TCS contributions in the recent past: randomized NLA for faster algorithms with provable approximation guarantees - sampling and JL based techniques and others revisit preconditioning methods for Laplacians and beyond Many powerful applications in theory and practice

Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 34

slide-5
SLIDE 5

Norms and matrix norms

Definition

A norm in a real vector space V is a real valued function that has three properties: (i) x ≥ 0 for all x ∈ V and x = 0 implies x = 0, (ii) ax = |a|x for all scalars a (iii) x + y ≤ x + y Familiar vector norms: xp = (

i |xi|p)1/p

If A is a injective linear transformation Ax is also a norm in the

  • riginal space.

Norms and metrics: d(x, y) = x − y is a metric

Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 34

slide-6
SLIDE 6

Matrix norms

Consider vector space of all matrices A ∈ Rm×n What are useful norms over matrices? Treat matrix like a vector of dimension m × n and apply vector

  • norm. For instance AF (Frobenius norm) is (

i,j |Ai,j|2)1/2.

Treat matrix as linear operator and see what it does to norms of vectors it operates on. Spectral norm is supx2=1Ax2. Schatten p-norms based on singular values of A Trace norm, nuclear norm, . . . Norms are related in some cases (different perspective on the same norm)

Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 34

slide-7
SLIDE 7

Frobenus and Spectral norms

Submultiplicative property: ABF ≤ AFBF AB2 ≤ A2B2

Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 34

slide-8
SLIDE 8

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Standard algorithm based on definition: O(mnp) time Faster algorithms via non-trivial Strassen-like divide and conquer.

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 34

slide-9
SLIDE 9

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Standard algorithm based on definition: O(mnp) time Faster algorithms via non-trivial Strassen-like divide and conquer. Approximation: Compute D ∈ Rm×p such that D − AB is small in some appropriate matrix norm. Two methods random sampling random projections (fast JL)

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 34

slide-10
SLIDE 10

Matrix Multiplication

Problem: Given matrices A ∈ Rm×n and B ∈ Rn×p compute the matrix AB Notation: M(j) for j’th column of M and M(i) for i’th row of M both interpreted as vectors From textbook definition: Di,h = A(i), B(h) = n

k=1 Ai,kBk,h

Consider AT consisting of m column vectors from Rn and B as p column vectors from Rn We want to compute all mp inner products of these vectors.

Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 34

slide-11
SLIDE 11

Approximate Matrix Multiplication

Want to approximate AB in the Frobenius norm. Want D such that D − ABF ≤ ǫABF but ABF can be 0. Instead will settle for D − ABF ≤ ǫAFBF

Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 34

slide-12
SLIDE 12

Part I Random Sampling for Approx Matrix Mult

Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 34

slide-13
SLIDE 13

Matrix Multiplication and Outer Products

Alternate definition of matrix multiplication based on outer product: AB =

n

  • j=1

A(j)B(j) A(j)B(j) is a m × h matrix of rank 1

Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 34

slide-14
SLIDE 14

Importance Sampling

AB =

n

  • j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 34

slide-15
SLIDE 15

Importance Sampling

AB =

n

  • j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

C = 1

t

  • ℓ Cℓ where E[Cℓ] = AB.

By linearity of expectation: E[C] = AB

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 34

slide-16
SLIDE 16

Importance Sampling

Question: How should we choose p1, p2, . . . , pn?

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

slide-17
SLIDE 17

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

slide-18
SLIDE 18

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF Use spectral norm of A(j)B(j) which is A(j)B(j)2

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

slide-19
SLIDE 19

Importance Sampling

Question: How should we choose p1, p2, . . . , pn? pj should correspond to contribution of A(j)B(j) to ABF Use spectral norm of A(j)B(j) which is A(j)B(j)2 Claim: A(j)B(j)2 = A(j)2B(j)2. Choose pj =

A(j)2B(j)2

  • ℓA(ℓ)2B(ℓ)2

Due to [Drineas-Kannan-Mahoney]

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 34

slide-20
SLIDE 20

Running time

For all j compute A(j)2 and B(j)2. Takes one pass over A and B Allows one to compute p1, p2, . . . , pn C = 1

t

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

At most O(tmh + NA + NB) time where NA and NB is number of non-zeroes in A and B. Full computation takes O(nmh) time.

Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 34

slide-21
SLIDE 21

Analysis of approximation

Want to analyse Pr[C − ABF ≥ ǫAFBF].

Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 34

slide-22
SLIDE 22

Analysis of approximation

Want to analyse Pr[C − ABF ≥ ǫAFBF]. Using Markov: Pr[C − ABF ≥ ǫAFBF] ≤ E

  • C − AB2

F

  • ǫ2A2

FB2 F

Lemma

E

  • C − AB2

F

  • ≤ 1

t

n

j=1A(j)2B(j)2

2 − 1

t AB2 F

Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 34

slide-23
SLIDE 23

Analysis continued

Pr[C − ABF ≥ ǫAFBF] ≤ E

  • C − AB2

F

  • ǫ2A2

FB2 F

E

  • C − AB2

F

1 t  

n

  • j=1

A(j)2B(j)  

2

− 1 t AB2

F

≤ 1 t A2

FB2 F.

Thus, if t =

1 ǫ2δ then

Pr[C − ABF ≥ ǫAFBF] ≤ δ.

Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 34

slide-24
SLIDE 24

Median trick

Recall that we used median trick to improve dependence on δ from 1/δ to log(1/δ). If t =

3 ǫ2 then

Pr[C − ABF ≥ ǫAFBF] ≤ 1/3. Repeat independently to obtain C1, C2, . . . , Cr where r = Θ(log(1/δ)) By Chernoff bounds majority of estimators are good. How do we pick the “median” matrix?

Chandra (UIUC) CS498ABD 18 Spring 2019 18 / 34

slide-25
SLIDE 25

Median trick

If t =

3 ǫ2 then

Pr[C − ABF ≥ ǫAFBF] ≤ 1/3. Repeat independently to obtain C1, C2, . . . , Cr where r = Θ(log(1/δ)) For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 [Clarkson-Woodruff]

Chandra (UIUC) CS498ABD 19 Spring 2019 19 / 34

slide-26
SLIDE 26

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≤ Ci − ABF + Cj − ABF and Ci − CjF ≥ Ci − ABF − Cj − ABF.

Chandra (UIUC) CS498ABD 20 Spring 2019 20 / 34

slide-27
SLIDE 27

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≤ Ci − ABF + Cj − ABF More than half of Ci’s have Ci − ABF < ǫAFBF. If Cs is good then ρs ≥ r/2.

Chandra (UIUC) CS498ABD 21 Spring 2019 21 / 34

slide-28
SLIDE 28

Median trick

For each 1 ≤ i ≤ r compute ρi = |{j | j = i, Ci − Cj ≤ 2ǫAFBF}| Output Cs such that ρs ≥ r/2 Correctness follows from triangle inequality. Ci − CjF ≥ Ci − ABF − Cj − ABF. More than half of Ci’s have Ci − ABF < ǫAFBF. If Cs is bad (Cs − ABF > 3AFBF) then Cs − Cj > 2ǫAFBF which means ρs < r/2.

Chandra (UIUC) CS498ABD 22 Spring 2019 22 / 34

slide-29
SLIDE 29

Running time again

For all j compute A(j)2 and B(j)2. Takes one pass over A and B Allows one to compute p1, p2, . . . , pn C = 1

t

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

At most O(tmh + NA + NB) time where NA and NB is number of non-zeroes in A and B. Full computation takes O(nmh) time. Either we choose t =

1 ǫ2δ or we choose t = 1 ǫ2 ln(1/δ) but then we

need to do pairwise matrix computations so effectively t =

1 ǫ4 log2(1/δ).

Chandra (UIUC) CS498ABD 23 Spring 2019 23 / 34

slide-30
SLIDE 30

Proof of Lemma

Lemma

E

  • C − AB2

F

  • ≤ 1

t

n

j=1A(j)2B(j)2

2 − 1

t AB2 F

Recall C is sum of t independent estimators so the lemma is basically about t = 1. E

  • C − AB2

F

  • =

x,y E

  • (Cx,y − (AB)x,y)2

Fix x, y. Let Z = Cx,y. We have E[Z] = (AB)x,y. Hence Var[Z] = E

  • Z 2

− (AB)2

x,y = E

  • (Cx,y − (AB)x,y)2

Chandra (UIUC) CS498ABD 24 Spring 2019 24 / 34

slide-31
SLIDE 31

Proof of Lemma

E

  • Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

  • C − AB2

F

  • =

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

  • x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

slide-32
SLIDE 32

Proof of Lemma

E

  • Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

  • C − AB2

F

  • =

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

  • x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Recall: pj =

A(j)B(j)

  • ℓA(ℓ)B(ℓ)

Thus E

  • C − AB2

F

  • =

n

j=1A(j)B(j)

2 − AB2

F.

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

slide-33
SLIDE 33

Proof of Lemma

E

  • Z 2

= n

j=1 pj(Ax,jBj,y)2/p2 j = n j=1(Ax,jBj,y)2/pj

Thus E

  • C − AB2

F

  • =

x,y

n

j=1(Ax,jBj,y)2/pj − AB2 F.

Simplifying the first term:

  • x,y

n

j=1(Ax,j))2(Bj,y)2/pj = n j=1 1 pj A(j)2B(j)2

Recall: pj =

A(j)B(j)

  • ℓA(ℓ)B(ℓ)

Thus E

  • C − AB2

F

  • =

n

j=1A(j)B(j)

2 − AB2

F.

One can show that the choice of pj values is optimum for reducing variance in the simple importance sampling scheme.

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 34

slide-34
SLIDE 34

Sampling matrix view

AB =

n

  • j=1

A(j)B(j) Pick a probability distribution over [n], p1 + p2 + . . . + pn = 1 For ℓ = 1 to t do pick an index jℓ ∈ [n] according to distribution p (independent with replacement) Output C = 1

t

t

ℓ=1 1 piℓ A(jℓ)B(jℓ)

C = (AST)(SB) where S ∈ Rn×t is a sampling matrix: Si,j =

1

tpj if column j is picked in i’th sample else Si,j = 0

Chandra (UIUC) CS498ABD 26 Spring 2019 26 / 34

slide-35
SLIDE 35

Part II Random Projection for Approx Matrix Mult

Chandra (UIUC) CS498ABD 27 Spring 2019 27 / 34

slide-36
SLIDE 36

JL Approach for Approx Matrix Multiplication

[Sarlos] Output C = (AST)(SB) where S is a (fast) JL matrix. Works! Advantage?

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 34

slide-37
SLIDE 37

JL Approach for Approx Matrix Multiplication

[Sarlos] Output C = (AST)(SB) where S is a (fast) JL matrix. Works! Advantage? Oblivious to A, B. Can update them etc.

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 34

slide-38
SLIDE 38

Recalling JL

Lemma (Distributional JL Lemma)

Fix vector x ∈ Rd and let Π ∈ Rk×d matrix where each entry Πij is chosen independently according to standard normal distribution N (0, 1) distribution. If k = Ω( 1

ǫ2 log(1/δ)), then with probability

(1 − δ) we have 1

√ k Πx2 = (1 ± ǫ)x2.

Can choose entries from {−1, 1} as well.

Definition

Let D be a distribution over m × n matrices. D is said to have (ǫ, δ) JL moment property if for any unit vector x ∈ Rn, EΠ∼D|Πx2

2 − 1| ≤ ǫδ.

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 34

slide-39
SLIDE 39

JL Property

Lemma

If Π comes from (ǫ, δ) JL moment distribution then for all unit vectors x, y ∈ Rn, E

  • |Πx, Πy − x, y|2

≤ cǫ2δ.

Chandra (UIUC) CS498ABD 30 Spring 2019 30 / 34

slide-40
SLIDE 40

Using JL

Theorem

Suppose Π is chosen from a distribution D that satisfies (ǫ, δ) JL moment property then Pr

Π∼D[AB − (AΠT)(BΠ)F > 3ǫAFBF] ≤ δ.

Chandra (UIUC) CS498ABD 31 Spring 2019 31 / 34

slide-41
SLIDE 41

Using JL

Theorem

Suppose Π is chosen from a distribution D that satisfies (ǫ, δ) JL moment property then Pr

Π∼D[AB − (AΠT)(BΠ)F > 3ǫAFBF] ≤ δ.

Let C = (AΠT)(BΠ). Pr[AB − C2

F > 3ǫAFB2 F] ≤ E

  • AB − C2

F

  • (3ǫAFBF)2.

Chandra (UIUC) CS498ABD 31 Spring 2019 31 / 34

slide-42
SLIDE 42

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j)

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

slide-43
SLIDE 43

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

slide-44
SLIDE 44

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Term by term: |Πai, Πbj − ai, bj|2 = α|Π

ai ai 2, Π bj bj 2 − ai ai 2, bj bj 2|2

where α = ai2

2bj2 2

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

slide-45
SLIDE 45

Analysis

Ci,j = ΠA(i), ΠB(j) while (AB)i,j = A(i), B(j) Notation: ai for A(i) and bj for B(j) AB − C2

F = i,j |Πai, Πbj − ai, bj|2

Term by term: |Πai, Πbj − ai, bj|2 = α|Π

ai ai 2, Π bj bj 2 − ai ai 2, bj bj 2|2

where α = ai2

2bj2 2

Applying JL property and linearity of expectation E

  • AB − C2

F

  • ≤ (cǫ)2δ
  • i,j

ai2

2bj2 2 ≤ (cǫ2)δA2 FB2 F

Chandra (UIUC) CS498ABD 32 Spring 2019 32 / 34

slide-46
SLIDE 46

Analysis

Pr[AB − C2

F > 3ǫAFB2 F] ≤ E

  • AB − C2

F

  • (3ǫAFBF)2.

and E

  • AB − C2

F

  • ≤ (cǫ)2δ
  • i,j

ai2

2bj2 2 ≤ (cǫ2)δA2 FB2 F

hence Pr[AB − C2

F > 3ǫAFB2 F] ≤ δ.

Chandra (UIUC) CS498ABD 33 Spring 2019 33 / 34

slide-47
SLIDE 47

Running time

Roughly speaking: Π converts vectors of dimension n into vectors of dimension d = O( 1

ǫ2 log(1/δ)).

Need to compute ΠAT and ΠB and then compute dot products. mp inner products of vectors of dimension d which is O(mpd 2) time in the worst case Using Fast JL with very sparse Π one can improve running time

Chandra (UIUC) CS498ABD 34 Spring 2019 34 / 34