compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 10 0 logistics submissions until Sunday 10/13 at midnight with no penalty. today. 1 Problem Set 2 is due next Friday 10/11,


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 10

slide-2
SLIDE 2

logistics

  • Problem Set 2 is due next Friday 10/11, although we will allow

submissions until Sunday 10/13 at midnight with no penalty.

  • Midterm on Thursday 10/17. Will cover material through

today.

1

slide-3
SLIDE 3

summary

Last Class: Dimensionality Reduction

  • Applications and examples of dimensionality reduction in

data science.

  • Low-distortion embeddings (MinHash as an example).
  • Low-distortion embeddings for Euclidean space and the

Johnson-Lindenstrauss Lemma. This Class: Finish the JL Lemma.

  • Prove the Johnson-Lindenstrauss Lemma.
  • Discuss algorithmic considerations, connections to other

methods, etc.

2

slide-4
SLIDE 4

embeddings for euclidean space

Low Distortion Embedding for Euclidean Space: Given x1, . . . , xn ∈ Rd and error parameter ϵ ≥ 0, find ˜ x1, . . . ,˜ xn ∈ Rd′ (where d′ ≪ d) such that for all i, j ∈ [n]: (1 − ϵ)∥xi − xj∥2 ≤ ∥˜ xi − ˜ xj∥2 ≤ (1 + ϵ)∥xi − xj∥2 If x1, . . . , xn lie in a k-dimensional subspace of Rd can project to d′ = k dimensions with no distortion. If close to a k-dimensional space, can project to k dimensions without much distortion (the idea behind PCA).

3

slide-5
SLIDE 5

the johnson-lindenstrauss lemma

Johnson-Lindenstrauss Lemma: Let Π ∈ Rd′×d have each entry chosen i.i.d. as

1 √ d′ · N(0, 1). For any set of points

x1, . . . , xn ∈ Rd, ϵ, δ > 0, and d′ = O (

log(n/δ) ϵ2

) , letting x̃i = Πxi, with probability ≥ 1 − δ we have: For all i, j : (1 − ϵ)∥xi − xj∥2 ≤ ∥x̃i − x̃j∥2 ≤ (1 + ϵ)∥xi − xj∥2. Surprising and powerful result.

  • Construction of Π is simple, random and data oblivious.

x1, . . . , xn: original data points (d dimensions), x̃1, . . . , x̃n: compressed data points (d′ < d dimensions), Π ∈ Rd′×d: random projection matrix (embedding function), ϵ: error of embedding, δ: failure probability. 4

slide-6
SLIDE 6

random projection

Π ∈ Rd′×d is a random matrix. I.e., a random function mapping length d vectors to length d′ vectors.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π: random projection (embedding function), ϵ: error of embedding. 5

slide-7
SLIDE 7

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-8
SLIDE 8

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-9
SLIDE 9

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-10
SLIDE 10

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-11
SLIDE 11

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-12
SLIDE 12

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections.

x1, . . . , xn: original points (d dims.), x̃1, . . . , x̃n: compressed points (d′ < d dims.), Π ∈ Rd′×d: random projection (embedding function) 6

slide-13
SLIDE 13

connection to simhash

Compression operation is x̃i = Πxi, so for any j, x̃i(j) = ⟨Π(j), xi⟩ =

d

k=1

Π(j, k) · xi(k). Π(j) is a vector with independent random Gaussian entries. Points with high cosine similarity have similar random projections. Computing a length d′ SimHash signature SH1(xi), . . . , SHd′(xi) is identical to computing x̃i = Πxi and then taking sign(x̃i).

x1 xn: original points (d dims.), x̃1 x̃n: compressed points (d d dims.),

d d: random projection (embedding function)

6

slide-14
SLIDE 14

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma: Distributional JL Lemma: Let Π ∈ Rm×d have each entry cho- sen i.i.d. as

1 √m · N(0, 1). If we set m = O

(

log 1/δ ϵ2

) , then for any y ∈ Rd, with probability ≥ 1 − δ (1 − ϵ)∥y∥2 ≤ ∥Πy∥2 ≤ (1 + ϵ)∥y∥2 Applying a random matrix Π to any vector y preserves y’s norm with high probability.

  • Like a low-distortion embedding, but for the length of a

compressed vector rather than distances between vectors.

  • Can be proven from first principles. Will see next.

Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analogous to d′), ϵ: embedding error, δ: embedding failure prob. 7

slide-15
SLIDE 15

distributional jl = ⇒ jl

Distributional JL Lemma = ⇒ JL Lemma: Distributional JL show that a random projection Π preserves the norm of any y. The main JL Lemma says that Π preserves distances between vectors. Since Π is linear these are the same thing! Proof: Given x1, . . . , xn, define (n

2

) vectors yij where yij = xi − xj.

x1, . . . , xn: original points, x̃1, . . . , x̃n: compressed points, Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analo- gous to d′), ϵ: embedding error, δ: embedding failure prob. 8

slide-16
SLIDE 16

distributional jl = ⇒ jl

Distributional JL Lemma = ⇒ JL Lemma: Distributional JL show that a random projection Π preserves the norm of any y. The main JL Lemma says that Π preserves distances between vectors. Since Π is linear these are the same thing! Proof: Given x1, . . . , xn, define (n

2

) vectors yij where yij = xi − xj.

  • If we choose Π with m = O

(

log 1/δ ϵ2

) , for each yij with probability ≥ 1 − δ we have: (1 − ϵ)∥yij∥2 ≤ ∥Πyij∥2 ≤ (1 + ϵ)∥yij∥2

x1, . . . , xn: original points, x̃1, . . . , x̃n: compressed points, Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analo- gous to d′), ϵ: embedding error, δ: embedding failure prob. 8

slide-17
SLIDE 17

distributional jl = ⇒ jl

Distributional JL Lemma = ⇒ JL Lemma: Distributional JL show that a random projection Π preserves the norm of any y. The main JL Lemma says that Π preserves distances between vectors. Since Π is linear these are the same thing! Proof: Given x1, . . . , xn, define (n

2

) vectors yij where yij = xi − xj.

  • If we choose Π with m = O

(

log 1/δ ϵ2

) , for each yij with probability ≥ 1 − δ we have: (1 − ϵ)∥xi − xj∥2 ≤ ∥Π(xi − xj)∥2 ≤ (1 + ϵ)∥xi − xj∥2

x1, . . . , xn: original points, x̃1, . . . , x̃n: compressed points, Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analo- gous to d′), ϵ: embedding error, δ: embedding failure prob. 8

slide-18
SLIDE 18

distributional jl = ⇒ jl

Distributional JL Lemma = ⇒ JL Lemma: Distributional JL show that a random projection Π preserves the norm of any y. The main JL Lemma says that Π preserves distances between vectors. Since Π is linear these are the same thing! Proof: Given x1, . . . , xn, define (n

2

) vectors yij where yij = xi − xj.

  • If we choose Π with m = O

(

log 1/δ ϵ2

) , for each yij with probability ≥ 1 − δ we have: (1 − ϵ)∥xi − xj∥2 ≤ ∥x̃i − x̃j∥2 ≤ (1 + ϵ)∥xi − xj∥2

x1, . . . , xn: original points, x̃1, . . . , x̃n: compressed points, Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analo- gous to d′), ϵ: embedding error, δ: embedding failure prob. 8

slide-19
SLIDE 19

distributional jl = ⇒ jl

Claim: If we choose Π with i.i.d.

1 √m · N(0, 1) entries and

m = O (

log 1/δ′ ϵ2

) , letting x̃i = Πxi, for each pair xi, xj with probability ≥ 1 − δ′ we have: (1 − ϵ)∥xi − xj∥2 ≤ ∥x̃i − x̃j∥2 ≤ (1 + ϵ)∥xi − xj∥2. With what probability are all pairwise distances preserved? Union bound: With probability ≥ 1 − (n

2

) · δ′ all pairwise distances are preserved. Apply the claim with δ′ = δ/ (n

2

) . = ⇒ for m = O (

log 1/δ′ ϵ2

) , all pairwise distances are preserved with probability ≥ 1 − δ. m = O (log(1/δ′) ϵ2 ) = O ( log( (n

2

) /δ) ϵ2 ) = O (log(n2/δ) ϵ2 ) = O (log(n/δ) ϵ2 )

x1, . . . , xn: original points, x̃1, . . . , x̃n: compressed points, Π ∈ Rm×d: random projection matrix. d: original dimension. m: compressed dimension (analo- gous to d′), ϵ: embedding error, δ: embedding failure prob. 9

slide-20
SLIDE 20

distributional jl = ⇒ jl

Claim: If we choose Π with i.i.d.

1 √m · N(0, 1) entries and

m = O (

log 1/δ′ ϵ2

) , letting x̃i = Πxi, for each pair xi, xj with probability ≥ 1 − δ′ we have: (1 − ϵ)∥xi − xj∥2 ≤ ∥x̃i − x̃j∥2 ≤ (1 + ϵ)∥xi − xj∥2. With what probability are all pairwise distances preserved? Union bound: With probability ≥ 1 − (n

2

) · δ′ all pairwise distances are preserved. Apply the claim with δ′ = δ/ (n

2

) . = ⇒ for m = O (

log 1/δ′ ϵ2

) , all pairwise distances are preserved with probability ≥ 1 − δ. m = O (log(1/δ′) ϵ2 ) = O ( log( (n

2

) /δ) ϵ2 ) = O (log(n2/δ) ϵ2 ) = O (log(n/δ) ϵ2 ) Yields the JL lemma.

x1 xn: original points, x̃1 x̃n: compressed points,

m d: random

projection matrix. d: original dimension. m: compressed dimension (analo- gous to d ), : embedding error, : embedding failure prob. 9

slide-21
SLIDE 21

distributional jl proof

Distributional JL Lemma: Let Π ∈ Rm×d have each entry cho- sen i.i.d. as

1 √m · N(0, 1). If we set m = O

(

log 1/δ ϵ2

) , then for any y ∈ Rd, with probability ≥ 1 − δ (1 − ϵ)∥y∥2 ≤ ∥Πy∥2 ≤ (1 + ϵ)∥y∥2

  • Let ỹ denote Πy and let Π(j) denote the jth row of Π.
  • For any j, ỹ(j) = ⟨Π(j), y⟩ =

1 √m

∑d

i=1 gi · yi where gi ∼ N(0, 1).

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random

  • projection. d: original dim. m: compressed dim, ϵ: error, δ: failure prob.

10

slide-22
SLIDE 22

distributional jl proof

  • Let ỹ denote Πy and let Π(j) denote the jth row of Π.
  • For any j, ỹ(j) = ⟨Π(j), y⟩ =

1 √m

∑d

i=1 gi · yi where gi ∼ N(0, 1).

  • gi · yi ∼ N(0, y2

i ): a normal distribution with variance y2 i .

What is the distribution of ỹ j ? Also Gaussian!

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, gi: normally distributed random variable. 11

slide-23
SLIDE 23

distributional jl proof

  • Let ỹ denote Πy and let Π(j) denote the jth row of Π.
  • For any j, ỹ(j) = ⟨Π(j), y⟩ =

1 √m

∑d

i=1 gi · yi where gi ∼ N(0, 1).

  • gi · yi ∼ N(0, y2

i ): a normal distribution with variance y2 i .

What is the distribution of ỹ(j)? Also Gaussian!

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, gi: normally distributed random variable. 11

slide-24
SLIDE 24

distributional jl proof

Letting ỹ = Πy, we have ỹ(j) = ⟨Π(j), y⟩ and: ỹ(j) = 1 √m

d

i=1

gi · yi where gi · yi ∼ N(0, y2

i ).

Stability of Gaussian Random Variables. For independent a ∼ N(µ1, σ2

1) and b ∼ N(µ2, σ2 2) we have:

a + b ∼ N(µ1 + µ2, σ2

1 + σ2 2)

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, gi: normally distributed random variable 12

slide-25
SLIDE 25

distributional jl proof

Letting ỹ = Πy, we have ỹ(j) = ⟨Π(j), y⟩ and: ỹ(j) = 1 √m

d

i=1

gi · yi where gi · yi ∼ N(0, y2

i ).

Stability of Gaussian Random Variables. For independent a ∼ N(µ1, σ2

1) and b ∼ N(µ2, σ2 2) we have:

a + b ∼ N(µ1 + µ2, σ2

1 + σ2 2)

Thus, ỹ(j) ∼ N(0, ∥y∥2

2/m). I.e., ỹ itself is a random Gaussian vector.

Rotational invariance of the Gaussian distribution. Stability is another explanation for the central limit theorem.

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, gi: normally distributed random variable 12

slide-26
SLIDE 26

distributional jl proof

So far: Letting Π ∈ Rd×m have each entry chosen i.i.d. as

1 √m · N(0, 1), for any y ∈ Rd, letting ỹ = Πy:

ỹ(j) ∼ N(0, ∥y∥2

2/m).

What is E[∥ỹ∥2

2]?

E[∥ỹ∥2

2] = E

 

m

j=1

ỹ(j)2   =

m

j=1

E[ỹ(j)2] =

m

j=1

∥y∥2

2

m = ∥y∥2

2

So ỹ has the right norm in expectation. How is ∥ỹ∥2

2 distributed? Does it concentrate?

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, gi: normally distributed random variable 13

slide-27
SLIDE 27

distributional jl proof

So far: Letting Π ∈ Rd×m have each entry chosen i.i.d. as

1 √m · N(0, 1), for any y ∈ Rd, letting ỹ = Πy:

ỹ(j) ∼ N(0, ∥y∥2

2/m) and E[∥ỹ∥2 2] = ∥y∥2 2

∥ỹ∥2

2 = ∑m i=1 ỹ(j)2 a Chi-Squared random variable with m degrees of

freedom (a sum of m squared independent Gaussians)

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, ϵ: embedding error, δ: embedding failure prob. 14

slide-28
SLIDE 28

distributional jl proof

So far: Letting Π ∈ Rd×m have each entry chosen i.i.d. as

1 √m · N(0, 1), for any y ∈ Rd, letting ỹ = Πy:

ỹ(j) ∼ N(0, ∥y∥2

2/m) and E[∥ỹ∥2 2] = ∥y∥2 2

∥ỹ∥2

2 = ∑m i=1 ỹ(j)2 a Chi-Squared random variable with m degrees of

freedom (a sum of m squared independent Gaussians) Lemma: (Chi-Squared Concentration) Letting Z be a Chi- Squared random variable with m degrees of freedom, Pr [|Z − EZ| ≥ ϵEZ] ≤ 2e−mϵ2/8. If we set m = O (

log(1/δ) ϵ2

) , with probability 1 − O(e− log(1/δ)) ≥ 1 − δ: (1 − ϵ)∥y∥2

2 ≤ ∥ỹ∥2 2 ≤ (1 + ϵ)∥y∥2 2.

y ∈ Rd: arbitrary vector, ỹ ∈ Rm: compressed vector, Π ∈ Rm×d: random projection mapping y → ỹ. Π(j): jth row of Π, d: original dimension. m: com- pressed dimension, ϵ: embedding error, δ: embedding failure prob. 14

slide-29
SLIDE 29

distributional jl proof

So far: Letting Π ∈ Rd×m have each entry chosen i.i.d. as

1 √m · N(0, 1), for any y ∈ Rd, letting ỹ = Πy:

ỹ(j) ∼ N(0, ∥y∥2

2/m) and E[∥ỹ∥2 2] = ∥y∥2 2

∥ỹ∥2

2 = ∑m i=1 ỹ(j)2 a Chi-Squared random variable with m degrees of

freedom (a sum of m squared independent Gaussians) Lemma: (Chi-Squared Concentration) Letting Z be a Chi- Squared random variable with m degrees of freedom, Pr [|Z − EZ| ≥ ϵEZ] ≤ 2e−mϵ2/8. If we set m = O (

log(1/δ) ϵ2

) , with probability 1 − O(e− log(1/δ)) ≥ 1 − δ: (1 − ϵ)∥y∥2

2 ≤ ∥ỹ∥2 2 ≤ (1 + ϵ)∥y∥2 2.

Gives the distributional JL Lemma and thus the classic JL Lemma!

y

d: arbitrary vector, ỹ m: compressed vector, m d: random

projection mapping y ỹ. j : jth row of , d: original dimension. m: com- pressed dimension, : embedding error, : embedding failure prob. 14

slide-30
SLIDE 30

example application: svm

Support Vector Machines: A classic ML algorithm, where data is classified with a hyperplane.

  • For any point a in A,

⟨a, w⟩ ≥ c + m

  • For any point b in B

⟨b, w⟩ ≤ c − m.

  • Assume all vectors

have unit norm. JL Lemma implies that after projection into O (

log n m2

) dimensions, still have ⟨ã, w̃⟩ ≥ c + m/4 and ⟨b̃, w̃⟩ ≤ c − m/4. Upshot: Can random project and run SVM (much more efficiently) in the lower dimensional space to find separator w̃.

15

slide-31
SLIDE 31

Questions?

16