compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 16 0 Finish up stochastic block model. Efficient algorithms for SVD/eigendecomposition. Iterative methods: power method,


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 16

slide-2
SLIDE 2

summary

Last Class:

  • Spectral clustering and embeddings
  • Started application to stochastic block model.

This Class:

  • Finish up stochastic block model.
  • Efficient algorithms for SVD/eigendecomposition.
  • Iterative methods: power method, Krylov subspace methods.

1

slide-3
SLIDE 3

summary

Last Class:

  • Spectral clustering and embeddings
  • Started application to stochastic block model.

This Class:

  • Finish up stochastic block model.
  • Efficient algorithms for SVD/eigendecomposition.
  • Iterative methods: power method, Krylov subspace methods.

1

slide-4
SLIDE 4

stochastic block model

Goal: Argue the effectiveness of spectral clustering in a natural, if

  • versimplified, generative model.

Stochastic Block Model (Planted Partition Model): Let Gn p q be a distribution over graphs on n nodes, split equally into two groups B and C, each with n 2 nodes.

  • Any two nodes in the same group are connected with probability p

(including self-loops).

  • Any two nodes in different groups are connected with prob. q

p.

  • Connections are independent.

2

slide-5
SLIDE 5

stochastic block model

Goal: Argue the effectiveness of spectral clustering in a natural, if

  • versimplified, generative model.

Stochastic Block Model (Planted Partition Model): Let Gn(p, q) be a distribution over graphs on n nodes, split equally into two groups B and C, each with n/2 nodes.

  • Any two nodes in the same group are connected with probability p

(including self-loops).

  • Any two nodes in different groups are connected with prob. q < p.
  • Connections are independent.

2

slide-6
SLIDE 6

expected adjacency spectrum

Letting G be a stochastic block model graph drawn from Gn(p, q) and A ∈ Rn×n be its adjacency matrix. (E[A])i,j = p for i, j in same group, (E[A])i,j = q otherwise. What is the rank of A and how can you see this quickly? How many nonzero eigenvalues does A have?

Gn(p, q): stochastic block model distribution. B, C: groups with n/2 nodes

  • each. Connections are independent with probability p between nodes in the

same group, and probability q between nodes not in the same group. 3

slide-7
SLIDE 7

expected adjacency spectrum

Letting G be a stochastic block model graph drawn from Gn(p, q) and A ∈ Rn×n be its adjacency matrix. (E[A])i,j = p for i, j in same group, (E[A])i,j = q otherwise. What is the rank of E[A] and how can you see this quickly? How many nonzero eigenvalues does A have?

Gn(p, q): stochastic block model distribution. B, C: groups with n/2 nodes

  • each. Connections are independent with probability p between nodes in the

same group, and probability q between nodes not in the same group. 3

slide-8
SLIDE 8

expected adjacency spectrum

Letting G be a stochastic block model graph drawn from Gn(p, q) and A ∈ Rn×n be its adjacency matrix. (E[A])i,j = p for i, j in same group, (E[A])i,j = q otherwise. What is the rank of E[A] and how can you see this quickly? How many nonzero eigenvalues does E[A] have?

Gn(p, q): stochastic block model distribution. B, C: groups with n/2 nodes

  • each. Connections are independent with probability p between nodes in the

same group, and probability q between nodes not in the same group. 3

slide-9
SLIDE 9

expected adjacency spectrum

  • v1

1 with eigenvalue

1 p q n 2

.

  • v2

B C with eigenvalue 2 p q n 2

.

  • B C i

1 if i B and

B C i

1 for i C. If we compute v2 then we recover the communities B and C!

4

slide-10
SLIDE 10

expected adjacency spectrum

v1 = ⃗ 1 with eigenvalue λ1 = (p+q)n

2

.

v2 = χB,C with eigenvalue λ2 = (p−q)n

2

.

  • χB,C(i) = 1 if i ∈ B and χB,C(i) = −1 for i ∈ C.

If we compute v2 then we recover the communities B and C!

4

slide-11
SLIDE 11

expected adjacency spectrum

v1 = ⃗ 1 with eigenvalue λ1 = (p+q)n

2

.

v2 = χB,C with eigenvalue λ2 = (p−q)n

2

.

  • χB,C(i) = 1 if i ∈ B and χB,C(i) = −1 for i ∈ C.

If we compute ⃗ v2 then we recover the communities B and C!

4

slide-12
SLIDE 12

expected laplacian spectrum

Letting G be a stochastic block model graph drawn from Gn(p, q), A ∈ Rn×n be its adjacency matrix and L be its Laplacian, what are the eigenvectors and eigenvalues of E[L]?

5

slide-13
SLIDE 13

expected laplacian spectrum

Letting G be a stochastic block model graph drawn from Gn(p, q), A ∈ Rn×n be its adjacency matrix and L be its Laplacian, what are the eigenvectors and eigenvalues of E[L]?

6

slide-14
SLIDE 14

expected laplacian spectrum

Upshot: The second small eigenvector of E[L] is χB,C – the indicator vector for the cut between the communities.

  • If the random graph G (equivilantly A and L) were exactly

equal to its expectation, partitioning using this eigenvector would exactly recover the two communities B and C. How do we show that a matrix (e.g., A) is close to its expectation? Matrix concentration inequalities.

  • Analogous to scalar concentration inequalities like Markovs,

Chebyshevs, Bernsteins.

  • Random matrix theory is a very recent and cutting edge

subfield of mathematics that is being actively applied in computer science, statistics, and ML.

7

slide-15
SLIDE 15

expected laplacian spectrum

Upshot: The second small eigenvector of E[L] is χB,C – the indicator vector for the cut between the communities.

  • If the random graph G (equivilantly A and L) were exactly

equal to its expectation, partitioning using this eigenvector would exactly recover the two communities B and C. How do we show that a matrix (e.g., A) is close to its expectation? Matrix concentration inequalities.

  • Analogous to scalar concentration inequalities like Markovs,

Chebyshevs, Bernsteins.

  • Random matrix theory is a very recent and cutting edge

subfield of mathematics that is being actively applied in computer science, statistics, and ML.

7

slide-16
SLIDE 16

expected laplacian spectrum

Upshot: The second small eigenvector of E[L] is χB,C – the indicator vector for the cut between the communities.

  • If the random graph G (equivilantly A and L) were exactly

equal to its expectation, partitioning using this eigenvector would exactly recover the two communities B and C. How do we show that a matrix (e.g., A) is close to its expectation? Matrix concentration inequalities.

  • Analogous to scalar concentration inequalities like Markovs,

Chebyshevs, Bernsteins.

  • Random matrix theory is a very recent and cutting edge

subfield of mathematics that is being actively applied in computer science, statistics, and ML.

7

slide-17
SLIDE 17

expected laplacian spectrum

Upshot: The second small eigenvector of E[L] is χB,C – the indicator vector for the cut between the communities.

  • If the random graph G (equivilantly A and L) were exactly

equal to its expectation, partitioning using this eigenvector would exactly recover the two communities B and C. How do we show that a matrix (e.g., A) is close to its expectation? Matrix concentration inequalities.

  • Analogous to scalar concentration inequalities like Markovs,

Chebyshevs, Bernsteins.

  • Random matrix theory is a very recent and cutting edge

subfield of mathematics that is being actively applied in computer science, statistics, and ML.

7

slide-18
SLIDE 18

expected laplacian spectrum

Upshot: The second small eigenvector of E[L] is χB,C – the indicator vector for the cut between the communities.

  • If the random graph G (equivilantly A and L) were exactly

equal to its expectation, partitioning using this eigenvector would exactly recover the two communities B and C. How do we show that a matrix (e.g., A) is close to its expectation? Matrix concentration inequalities.

  • Analogous to scalar concentration inequalities like Markovs,

Chebyshevs, Bernsteins.

  • Random matrix theory is a very recent and cutting edge

subfield of mathematics that is being actively applied in computer science, statistics, and ML.

7

slide-19
SLIDE 19

matrix concentration

Matrix Concentration Inequality: If p ≥ O (

log4 n n

) , then with high probability ∥A − E[A]∥2 ≤ O(√pn). where ∥ · ∥2 is the matrix spectral norm (operator norm). For any X ∈ Rn×d, ∥X∥2 = maxz∈Rd:∥z∥2=1 ∥Xz∥2. Exercise: Show that X 2 is equal to the largest singular value of X. For symmetric X (like A A ) show that it is equal to the magnitude

  • f the largest magnitude eigenvalue.

For the stochastic block model application, we want to show that the second eigenvectors of A and A are close. How does this relate to their difference in spectral norm?

8

slide-20
SLIDE 20

matrix concentration

Matrix Concentration Inequality: If p ≥ O (

log4 n n

) , then with high probability ∥A − E[A]∥2 ≤ O(√pn). where ∥ · ∥2 is the matrix spectral norm (operator norm). For any X ∈ Rn×d, ∥X∥2 = maxz∈Rd:∥z∥2=1 ∥Xz∥2. Exercise: Show that ∥X∥2 is equal to the largest singular value of X. For symmetric X (like A − E[A]) show that it is equal to the magnitude

  • f the largest magnitude eigenvalue.

For the stochastic block model application, we want to show that the second eigenvectors of A and A are close. How does this relate to their difference in spectral norm?

8

slide-21
SLIDE 21

matrix concentration

Matrix Concentration Inequality: If p ≥ O (

log4 n n

) , then with high probability ∥A − E[A]∥2 ≤ O(√pn). where ∥ · ∥2 is the matrix spectral norm (operator norm). For any X ∈ Rn×d, ∥X∥2 = maxz∈Rd:∥z∥2=1 ∥Xz∥2. Exercise: Show that ∥X∥2 is equal to the largest singular value of X. For symmetric X (like A − E[A]) show that it is equal to the magnitude

  • f the largest magnitude eigenvalue.

For the stochastic block model application, we want to show that the second eigenvectors of A and E[A] are close. How does this relate to their difference in spectral norm?

8

slide-22
SLIDE 22

eigenvector perturbation

Davis-Kahan Eigenvector Perturbation Theorem: Sup- pose A, A ∈ Rd×d are symmetric with ∥A − A∥2 ≤ ϵ and eigenvectors v1, v2, . . . , vd and ¯ v1, ¯ v2, . . . , ¯

  • vd. Letting

θ(vi, ¯ vi) denote the angle between vi and ¯ vi, for all i: sin[θ(vi, ¯ vi)] ≤ ϵ minj̸=i |λi − λj| where λ1, . . . , λd are the eigenvalues of A. The errors get large if there are eigenvalues with similar magnitudes.

9

slide-23
SLIDE 23

eigenvector perturbation

10

slide-24
SLIDE 24

application to stochastic block model

Claim 1 (Matrix Concentration): For p ≥ O (

log4 n n

) , ∥A − E[A]∥2 ≤ O(√pn). Claim 2 (Davis-Kahan): For p ≥ O (

log4 n n

) , sin θ(v2, ¯ v2) ≤ O(√pn) minj̸=i |λi − λj| O pn p q n 2 O p p q n Recall: A , has eigenvalues

1 p q n 2

,

2 p q n 2

,

i

0 for i 3. min

j i i j

min qn p q n 2 Typically,

p q n 2

will be the minimum of these two gaps.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 11

slide-25
SLIDE 25

application to stochastic block model

Claim 1 (Matrix Concentration): For p ≥ O (

log4 n n

) , ∥A − E[A]∥2 ≤ O(√pn). Claim 2 (Davis-Kahan): For p ≥ O (

log4 n n

) , sin θ(v2, ¯ v2) ≤ O(√pn) minj̸=i |λi − λj| O pn p q n 2 O p p q n Recall: E[A], has eigenvalues λ1 = (p+q)n

2

, λ2 = (p−q)n

2

, λi = 0 for i ≥ 3. min

j i i j

min qn p q n 2 Typically,

p q n 2

will be the minimum of these two gaps.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 11

slide-26
SLIDE 26

application to stochastic block model

Claim 1 (Matrix Concentration): For p ≥ O (

log4 n n

) , ∥A − E[A]∥2 ≤ O(√pn). Claim 2 (Davis-Kahan): For p ≥ O (

log4 n n

) , sin θ(v2, ¯ v2) ≤ O(√pn) minj̸=i |λi − λj| O pn p q n 2 O p p q n Recall: E[A], has eigenvalues λ1 = (p+q)n

2

, λ2 = (p−q)n

2

, λi = 0 for i ≥ 3. min

j̸=i |λi − λj| = min

( qn, (p − q)n 2 ) . Typically,

p q n 2

will be the minimum of these two gaps.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 11

slide-27
SLIDE 27

application to stochastic block model

Claim 1 (Matrix Concentration): For p ≥ O (

log4 n n

) , ∥A − E[A]∥2 ≤ O(√pn). Claim 2 (Davis-Kahan): For p ≥ O (

log4 n n

) , sin θ(v2, ¯ v2) ≤ O(√pn) minj̸=i |λi − λj| O pn p q n 2 O p p q n Recall: E[A], has eigenvalues λ1 = (p+q)n

2

, λ2 = (p−q)n

2

, λi = 0 for i ≥ 3. min

j̸=i |λi − λj| = min

( qn, (p − q)n 2 ) . Typically, (p−q)n

2

will be the minimum of these two gaps.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 11

slide-28
SLIDE 28

application to stochastic block model

Claim 1 (Matrix Concentration): For p ≥ O (

log4 n n

) , ∥A − E[A]∥2 ≤ O(√pn). Claim 2 (Davis-Kahan): For p ≥ O (

log4 n n

) , sin θ(v2, ¯ v2) ≤ O(√pn) minj̸=i |λi − λj| ≤ O(√pn) (p − q)n/2 = O ( √p (p − q)√n ) Recall: E[A], has eigenvalues λ1 = (p+q)n

2

, λ2 = (p−q)n

2

, λi = 0 for i ≥ 3. min

j̸=i |λi − λj| = min

( qn, (p − q)n 2 ) . Typically, (p−q)n

2

will be the minimum of these two gaps.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 11

slide-29
SLIDE 29

application to stochastic block model

So Far: sin θ(v2, ¯ v2) ≤ O (

√p (p−q)√n

) . What does this give us?

  • Can show that this implies v2

v2 2

2

O

p p q 2n

(exercise).

  • v2 is

1 n B C: the community indicator vector.

  • Every i where v2 i , v2 i differ in sign contributes

1 n to v2

v2 2

2

  • So they differ in sign in at most O

p p q 2

positions.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 12

slide-30
SLIDE 30

application to stochastic block model

So Far: sin θ(v2, ¯ v2) ≤ O (

√p (p−q)√n

) . What does this give us?

  • Can show that this implies ∥v2 − ¯

v2∥2

2 ≤ O

(

p (p−q)2n

) (exercise).

  • v2 is

1 n B C: the community indicator vector.

  • Every i where v2 i , v2 i differ in sign contributes

1 n to v2

v2 2

2

  • So they differ in sign in at most O

p p q 2

positions.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 12

slide-31
SLIDE 31

application to stochastic block model

So Far: sin θ(v2, ¯ v2) ≤ O (

√p (p−q)√n

) . What does this give us?

  • Can show that this implies ∥v2 − ¯

v2∥2

2 ≤ O

(

p (p−q)2n

) (exercise).

  • ¯

v2 is

1 √nχB,C: the community indicator vector.

  • Every i where v2 i , v2 i differ in sign contributes

1 n to v2

v2 2

2

  • So they differ in sign in at most O

p p q 2

positions.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 12

slide-32
SLIDE 32

application to stochastic block model

So Far: sin θ(v2, ¯ v2) ≤ O (

√p (p−q)√n

) . What does this give us?

  • Can show that this implies ∥v2 − ¯

v2∥2

2 ≤ O

(

p (p−q)2n

) (exercise).

  • ¯

v2 is

1 √nχB,C: the community indicator vector.

  • Every i where v2(i), ¯

v2(i) differ in sign contributes ≥ 1

n to ∥v2 − ¯

v2∥2

2.

  • So they differ in sign in at most O

p p q 2

positions.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 12

slide-33
SLIDE 33

application to stochastic block model

So Far: sin θ(v2, ¯ v2) ≤ O (

√p (p−q)√n

) . What does this give us?

  • Can show that this implies ∥v2 − ¯

v2∥2

2 ≤ O

(

p (p−q)2n

) (exercise).

  • ¯

v2 is

1 √nχB,C: the community indicator vector.

  • Every i where v2(i), ¯

v2(i) differ in sign contributes ≥ 1

n to ∥v2 − ¯

v2∥2

2.

  • So they differ in sign in at most O

(

p (p−q)2

) positions.

A adjacency matrix of random stochastic block model graph. p: connection probability within clusters. q < p: connection probability between clusters. n: number of nodes. v2,¯ v2: second eigenvectors of A and E[A] respectively. 12

slide-34
SLIDE 34

application to stochastic block model

Upshot: If G is a stochastic block model graph with adjacency matrix A, if we compute its second large eigenvector v2 and assign nodes to communities according to the sign pattern of this vector, we will correctly assign all but O (

p (p−q)2

) nodes.

  • Why does the error increase as q gets close to p?
  • Even when p

q O 1 n , assign all but an O n fraction

  • f nodes correctly. E.g., assign 99
  • f nodes correctly.

13

slide-35
SLIDE 35

application to stochastic block model

Upshot: If G is a stochastic block model graph with adjacency matrix A, if we compute its second large eigenvector v2 and assign nodes to communities according to the sign pattern of this vector, we will correctly assign all but O (

p (p−q)2

) nodes.

  • Why does the error increase as q gets close to p?
  • Even when p

q O 1 n , assign all but an O n fraction

  • f nodes correctly. E.g., assign 99
  • f nodes correctly.

13

slide-36
SLIDE 36

application to stochastic block model

Upshot: If G is a stochastic block model graph with adjacency matrix A, if we compute its second large eigenvector v2 and assign nodes to communities according to the sign pattern of this vector, we will correctly assign all but O (

p (p−q)2

) nodes.

  • Why does the error increase as q gets close to p?
  • Even when p − q = O(1/√n), assign all but an O(n) fraction
  • f nodes correctly. E.g., assign 99% of nodes correctly.

13

slide-37
SLIDE 37

Questions on spectral partitioning?

14

slide-38
SLIDE 38

efficient eigendecomposition and svd

We have talked about the eigendecomposition and SVD as ways to compress data, to embed entities like words and documents, to compress/cluster non-linearly separable data. How efficient are these techniques? Can they be run on massive datasets?

15

slide-39
SLIDE 39

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O nd2 runtime.
  • Find eigendecomposition ATA

V VT – O d3 runtime.

  • Compute L
  • AV. Set

i

Li 2 and Ui Li Li 2. – O nd2 runtime. Total runtime: O nd2 d3 O nd2 (assume w.l.o.g. n d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-40
SLIDE 40

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA

V VT – O d3 runtime.

  • Compute L
  • AV. Set

i

Li 2 and Ui Li Li 2. – O nd2 runtime. Total runtime: O nd2 d3 O nd2 (assume w.l.o.g. n d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-41
SLIDE 41

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L
  • AV. Set

i

Li 2 and Ui Li Li 2. – O nd2 runtime. Total runtime: O nd2 d3 O nd2 (assume w.l.o.g. n d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-42
SLIDE 42

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O nd2 d3 O nd2 (assume w.l.o.g. n d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-43
SLIDE 43

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O(nd2 + d3) O nd2 (assume w.l.o.g. n d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-44
SLIDE 44

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O(nd2 + d3) = O(nd2) (assume w.l.o.g. n ≥ d)

  • If we have n

10 million images with 200 200 3 120 000 pixel values each, runtime is 1 5 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-45
SLIDE 45

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O(nd2 + d3) = O(nd2) (assume w.l.o.g. n ≥ d)

  • If we have n = 10 million images with 200 × 200 × 3 = 120, 000

pixel values each, runtime is 1.5 × 1017 operations!

  • The worlds fastest super computers compute at

100 petaFLOPS = 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-46
SLIDE 46

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O(nd2 + d3) = O(nd2) (assume w.l.o.g. n ≥ d)

  • If we have n = 10 million images with 200 × 200 × 3 = 120, 000

pixel values each, runtime is 1.5 × 1017 operations!

  • The worlds fastest super computers compute at ≈ 100 petaFLOPS

= 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-47
SLIDE 47

computing the svd

To compute the SVD of A ∈ Rn×d, A = UΣVT, first compute V. Then compute UΣ = AV.

  • Compute ATA – O(nd2) runtime.
  • Find eigendecomposition ATA = VΛVT – O(d3) runtime.
  • Compute L = AV. Set σi = ∥Li∥2 and Ui = Li/∥Li∥2. – O(nd2)

runtime. Total runtime: O(nd2 + d3) = O(nd2) (assume w.l.o.g. n ≥ d)

  • If we have n = 10 million images with 200 × 200 × 3 = 120, 000

pixel values each, runtime is 1.5 × 1017 operations!

  • The worlds fastest super computers compute at ≈ 100 petaFLOPS

= 1017 FLOPS (floating point operations per second).

  • This is an easy task for them – but no one else.

16

slide-48
SLIDE 48

faster algorithms

To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or bottom) k singular vectors for k ≪ d.

  • Suffices to compute Vk

d k and then compute

Uk

k

AVk.

  • Use an iterative algorithm to compute an approximation to

the top k singular vectors Vk.

  • Runtime will be roughly O ndk instead of O nd2 .

Won’t cover: randomized methods, which can be much faster in some cases.

17

slide-49
SLIDE 49

faster algorithms

To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or bottom) k singular vectors for k ≪ d.

  • Suffices to compute Vk ∈ Rd×k and then compute

UkΣk = AVk.

  • Use an iterative algorithm to compute an approximation to

the top k singular vectors Vk.

  • Runtime will be roughly O ndk instead of O nd2 .

Won’t cover: randomized methods, which can be much faster in some cases.

17

slide-50
SLIDE 50

faster algorithms

To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or bottom) k singular vectors for k ≪ d.

  • Suffices to compute Vk ∈ Rd×k and then compute

UkΣk = AVk.

  • Use an iterative algorithm to compute an approximation to

the top k singular vectors Vk.

  • Runtime will be roughly O ndk instead of O nd2 .

Won’t cover: randomized methods, which can be much faster in some cases.

17

slide-51
SLIDE 51

faster algorithms

To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or bottom) k singular vectors for k ≪ d.

  • Suffices to compute Vk ∈ Rd×k and then compute

UkΣk = AVk.

  • Use an iterative algorithm to compute an approximation to

the top k singular vectors Vk.

  • Runtime will be roughly O(ndk) instead of O(nd2).

Won’t cover: randomized methods, which can be much faster in some cases.

17

slide-52
SLIDE 52

faster algorithms

To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or bottom) k singular vectors for k ≪ d.

  • Suffices to compute Vk ∈ Rd×k and then compute

UkΣk = AVk.

  • Use an iterative algorithm to compute an approximation to

the top k singular vectors Vk.

  • Runtime will be roughly O(ndk) instead of O(nd2).

Won’t cover: randomized methods, which can be much faster in some cases.

17

slide-53
SLIDE 53

sparse vs. direct

In numerical linear algebra, two main types of methods: Direct Methods: Gaussian elimination, QR decomposition, Cholesky decomposition, etc.

  • Directly manipulate the entries of the input matrix A. Typically run

in O(n3) time for an n × n matrix. Sparse (Iterative) Methods: Conjugate gradient, Gauss-Seidel, Krylov subspace methods, Lanczos, gradient descent.

  • Generally only access A via a sequence of matrix vector
  • multiplications. Ax1 Ax2

Axt

  • Runtime is

iterations t matrix vector multiplication time O nnz A t O ndt where nnz A is the number of nonzero entries in A.

  • Not just for sparse matrices!

18

slide-54
SLIDE 54

sparse vs. direct

In numerical linear algebra, two main types of methods: Direct Methods: Gaussian elimination, QR decomposition, Cholesky decomposition, etc.

  • Directly manipulate the entries of the input matrix A. Typically run

in O(n3) time for an n × n matrix. Sparse (Iterative) Methods: Conjugate gradient, Gauss-Seidel, Krylov subspace methods, Lanczos, gradient descent.

  • Generally only access A via a sequence of matrix vector
  • multiplications. Ax1, Ax2, . . . , Axt.
  • Runtime is

iterations t matrix vector multiplication time O nnz A t O ndt where nnz A is the number of nonzero entries in A.

  • Not just for sparse matrices!

18

slide-55
SLIDE 55

sparse vs. direct

In numerical linear algebra, two main types of methods: Direct Methods: Gaussian elimination, QR decomposition, Cholesky decomposition, etc.

  • Directly manipulate the entries of the input matrix A. Typically run

in O(n3) time for an n × n matrix. Sparse (Iterative) Methods: Conjugate gradient, Gauss-Seidel, Krylov subspace methods, Lanczos, gradient descent.

  • Generally only access A via a sequence of matrix vector
  • multiplications. Ax1, Ax2, . . . , Axt.
  • Runtime is # iterations t × matrix vector multiplication time =

O(nnz(A) · t) = O(ndt) where nnz(A) is the number of nonzero entries in A.

  • Not just for sparse matrices!

18

slide-56
SLIDE 56

sparse vs. direct

In numerical linear algebra, two main types of methods: Direct Methods: Gaussian elimination, QR decomposition, Cholesky decomposition, etc.

  • Directly manipulate the entries of the input matrix A. Typically run

in O(n3) time for an n × n matrix. Sparse (Iterative) Methods: Conjugate gradient, Gauss-Seidel, Krylov subspace methods, Lanczos, gradient descent.

  • Generally only access A via a sequence of matrix vector
  • multiplications. Ax1, Ax2, . . . , Axt.
  • Runtime is # iterations t × matrix vector multiplication time =

O(nnz(A) · t) = O(ndt) where nnz(A) is the number of nonzero entries in A.

  • Not just for sparse matrices!

18

slide-57
SLIDE 57

sparse vs. direct

Matlab: svd and eig vs. svds and eigs SciPy (Python): scipy.linalg.svd vs. scipy.sparse.linalg.svds

19

slide-58
SLIDE 58

power method

Power Method: The most fundamental iterative method for approximate SVD. Applies to computing k = 1 singular vectors. Goal: Given A

n d, with SVD A

U V, find z v1.

  • Choose z 0 randomly. E.g. z 0 i

0 1 .

  • For i

1 t

  • z i

AT Az i

1

Runtime: 2 nd

  • ni

z i

2

Runtime: d

  • z i

z i ni Runtime: d

Return zt Total Runtime: O ndt

20

slide-59
SLIDE 59

power method

Power Method: The most fundamental iterative method for approximate SVD. Applies to computing k = 1 singular vectors. Goal: Given A ∈ Rn×d, with SVD A = UΣV, find ⃗ z ≈ ⃗ v1.

  • Choose z 0 randomly. E.g. z 0 i

0 1 .

  • For i

1 t

  • z i

AT Az i

1

Runtime: 2 nd

  • ni

z i

2

Runtime: d

  • z i

z i ni Runtime: d

Return zt Total Runtime: O ndt

20

slide-60
SLIDE 60

power method

Power Method: The most fundamental iterative method for approximate SVD. Applies to computing k = 1 singular vectors. Goal: Given A ∈ Rn×d, with SVD A = UΣV, find ⃗ z ≈ ⃗ v1.

  • Choose ⃗

z(0) randomly. E.g. ⃗ z(0)(i) ∼ N(0, 1).

  • For i = 1, . . . , t

z(i) = AT · (A⃗ z(i−1)) Runtime: 2 nd

  • ni = ∥⃗

z(i)∥2 Runtime: d

z(i) = ⃗ z(i)/ni Runtime: d

Return ⃗ zt Total Runtime: O ndt

20

slide-61
SLIDE 61

power method

Power Method: The most fundamental iterative method for approximate SVD. Applies to computing k = 1 singular vectors. Goal: Given A ∈ Rn×d, with SVD A = UΣV, find ⃗ z ≈ ⃗ v1.

  • Choose ⃗

z(0) randomly. E.g. ⃗ z(0)(i) ∼ N(0, 1).

  • For i = 1, . . . , t

z(i) = AT · (A⃗ z(i−1)) Runtime: 2 · nd

  • ni = ∥⃗

z(i)∥2 Runtime: d

z(i) = ⃗ z(i)/ni Runtime: d

Return ⃗ zt Total Runtime: O(ndt)

20

slide-62
SLIDE 62

power method intuition

Write ⃗ z(0) in the right singular vector basis: ⃗ z(0) = c1⃗ v1 +⃗ c2⃗ v2 + . . . + cd⃗ vd Update step: z i AT Az i

1

V

2VTz i 1 (then normalize)

Claim: z 1 1 n1 c1

2 1v1

c2

2 2v2

cd

2 dvd 21

slide-63
SLIDE 63

power method intuition

Write ⃗ z(0) in the right singular vector basis: ⃗ z(0) = c1⃗ v1 +⃗ c2⃗ v2 + . . . + cd⃗ vd Update step: ⃗ z(i) = AT · (A⃗ z(i−1)) = VΣ2VT⃗ z(i−1) (then normalize) Claim: z 1 1 n1 c1

2 1v1

c2

2 2v2

cd

2 dvd 21

slide-64
SLIDE 64

power method intuition

Write ⃗ z(0) in the right singular vector basis: ⃗ z(0) = c1⃗ v1 +⃗ c2⃗ v2 + . . . + cd⃗ vd Update step: ⃗ z(i) = AT · (A⃗ z(i−1)) = VΣ2VT⃗ z(i−1) (then normalize) Claim: ⃗ z(1) = 1 n1 [ c1 · σ2

1⃗

v1 + c2 · σ2

2⃗

v2 + . . . + cd · σ2

d⃗

vd ]

21

slide-65
SLIDE 65

power method intuition

Claim: ⃗ z(t) = 1 ∏t

i=1 ni

[ c1 · σ2t

1 ⃗

v1 + c2 · σ2t

2 ⃗

v2 + . . . + cd · σ2t

d ⃗

vd ] After t iterations, you have ‘powered’ up the singular values, making the component in the direction of v1 much larger, relative to the other components.

22

slide-66
SLIDE 66

power method convergence

Theorem (Basic Power Method Convergence) Let γ = σ1−σ2

σ1

be parameter capturing the “gap” between the first and second largest singular values. If Power Method is initialized with a random Gaussian vector then, with high probability, after t = O (

log d/ϵ γ

) steps: ∥⃗ v1 −⃗ z(t)∥2 ≤ ϵ. Total runtime: O nnz A

log d

O nd

log d

. Next Time: Will analyze this method formally.

23

slide-67
SLIDE 67

power method convergence

Theorem (Basic Power Method Convergence) Let γ = σ1−σ2

σ1

be parameter capturing the “gap” between the first and second largest singular values. If Power Method is initialized with a random Gaussian vector then, with high probability, after t = O (

log d/ϵ γ

) steps: ∥⃗ v1 −⃗ z(t)∥2 ≤ ϵ. Total runtime: O ( nnz(A) · log d/ϵ

γ

· ) = O ( nd · log d/ϵ

γ

· ) . Next Time: Will analyze this method formally.

23

slide-68
SLIDE 68

power method convergence

Theorem (Basic Power Method Convergence) Let γ = σ1−σ2

σ1

be parameter capturing the “gap” between the first and second largest singular values. If Power Method is initialized with a random Gaussian vector then, with high probability, after t = O (

log d/ϵ γ

) steps: ∥⃗ v1 −⃗ z(t)∥2 ≤ ϵ. Total runtime: O ( nnz(A) · log d/ϵ

γ

· ) = O ( nd · log d/ϵ

γ

· ) . Next Time: Will analyze this method formally.

23