compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 11 0

logistics guide/practice questions. extended time). hold them at the usual time, and before class at 10:15am. on the JL Lemma and randomized methods, before moving on the spectral methods (PCA, spectral clustering, etc.) 1 • Problem Set 2 is due thus upcoming Sunday 3/8. • Midterm is next Thursday, 3/12. See webpage for study • Let me know ASAP if you need accommodations (e.g., • My office hours next Tuesday will focus on exam review. I will • I am rearranging the next two lectures to spend more time

midterm assessment process Thanks for you feedback! Some specifics: balance with this. but Iooking into ways to improve. into more smaller assignments to spread out the work more. 2 • More details in proofs and slower pace. Will try to find a • Recap at the end of class. • I will post ‘compressed’ versions of the slides. Not perfect, • After the midterm, I might split the homework assignments

summary Last Class: The Johnson-Lindenstrauss Lemma projection. Lemma. This Class: 3 • Low-distortion embeddings for any set of points via random • Started on proof of the JL Lemma via the Distributional JL • Finish Up proof of the JL lemma. • Example applications to classification and clustering. • Discuss connections to high dimensional geometry.

the johnson-lindenstrauss lemma points from Johnson-Lindenstrauss Lemma: log n x i : of set For any 4 x n ∈ R d and ϵ > 0 there exists a linear map Π : R d → R m ⃗ ⃗ ⃗ x 1 , . . . , ⃗ ( ) and letting ˜ x i = Π ⃗ such that m = O ϵ 2 For all i , j : ( 1 − ϵ ) ∥ ⃗ x i − ⃗ x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ ⃗ x i − ⃗ x j ∥ 2 . R m × d has each entry chosen i.i.d. Further, if Π ∈ ( ) log n /δ N ( 0 , 1 / m ) and m = O , Π satisfies the guarantee with ϵ 2 probability ≥ 1 − δ .

random projection 5 • Can store ˜ x 1 , . . . , ˜ x n in n · m rather than n · d space. What about Π ? • Often don’t need to store explicitly – compute it on the fly. • For i = 1 . . . d : • ˜ x j := ˜ x j + h ( i ) · x j ( i ) where h : [ d ] → R m is a random hash function outputting vectors (the columns of Π ).

distributional jl We showed that the Johnson-Lindenstrauss Lemma follows from: dimension, : embedding error, : embedding failure prob. d : random projection matrix. d : original dimension. m : compressed m x j . 2 Main Idea: Union bound over 6 Distributional JL Lemma: Let Π ∈ R m × d have each entry cho- ( ) log ( 1 /δ ) sen i.i.d. as N ( 0 , 1 / m ) . If we set m = O ϵ 2 , then for any ⃗ y ∈ R d , with probability ≥ 1 − δ ( 1 − ϵ ) ∥ ⃗ y ∥ 2 ≤ ∥ Π ⃗ y ∥ 2 ≤ ( 1 + ϵ ) ∥ ⃗ y ∥ 2 ( n ) difference vectors ⃗ y ij = ⃗ x i − ⃗

distributional jl proof 1 pressed dimension, g i : normally distributed random variable. 7 • Let ỹ denote Π ⃗ y and let Π ( j ) denote the j th row of Π . • For any j , ỹ ( j ) = ⟨ Π ( j ) ,⃗ i = 1 g i · ⃗ y ⟩ = ∑ d y ( i ) where g i ∼ N ( 0 , 1 ) . √ m • g i · ⃗ y ( i ) ∼ N ( 0 ,⃗ y ( i ) 2 ) : a normal distribution with variance ⃗ y ( i ) 2 . ỹ ( j ) is also Gaussian, with ˜ y ( j ) ∼ N ( 0 , ∥ ⃗ y ∥ 2 2 / m ) . ⃗ y ∈ R d : arbitrary vector, ỹ ∈ R m : compressed vector, Π ∈ R m × d : random projection mapping ⃗ y → ỹ . Π ( j ) : j th row of Π , d : original dimension. m : com-

distributional jl proof m g i : normally distributed random variable So ỹ has the right norm in expectation. 2 m 2 m Up Shot: Each entry of our compressed vector ỹ is Gaussian: 8 m ỹ ( j ) ∼ N ( 0 , ∥ ⃗ y ∥ 2 2 / m ) .   ∑  = ∑ E [ ∥ ỹ ∥ 2 2 ] = E ỹ ( j ) 2 E [ ỹ ( j ) 2 ]  j = 1 j = 1 ∥ ⃗ y ∥ 2 ∑ = ∥ ⃗ y ∥ 2 = j = 1 How is ∥ ỹ ∥ 2 2 distributed? Does it concentrate? ⃗ y ∈ R d : arbitrary vector, ỹ ∈ R m : compressed vector, Π ∈ R m × d : random projection mapping ⃗ y → ỹ . d : original dimension. m : compressed dimension,

distributional jl proof 2 log 1 1 : 1 2 2 1 Gives the distributional JL Lemma and thus the classic JL Lemma! So Far: Each entry of our compressed vector ỹ is Gaussian with : y d : arbitrary vector, ỹ m : compressed vector, m d : random projection mapping y ỹ . d : original dimension. m : compressed dimension, : embedding error, : embedding failure prob. O e , with probability 1 9 Squared random variable with m degrees of freedom, 2 2 Lemma: (Chi-Squared Concentration) Letting Z be a Chi- If we set m O log 1 ỹ ( j ) ∼ N ( 0 , ∥ ⃗ 2 ] = ∥ ⃗ y ∥ 2 2 / m ) and E [ ∥ ỹ ∥ 2 y ∥ 2 ∥ ỹ ∥ 2 2 = ∑ m i = 1 ỹ ( j ) 2 a Chi-Squared random variable with m degrees of freedom (a sum of m squared independent Gaussians) Pr [ | Z − E Z | ≥ ϵ E Z ] ≤ 2 e − m ϵ 2 / 8 . ( ) y 2 ỹ 2 y 2

example application: svm Support Vector Machines: A classic ML algorithm, where data is the lower dimensional space to find separator w̃ . Upshot: Can random project and run SVM (much more efficiently) in dimensions, still m 2 log n JL Lemma implies that after projection into O have unit norm. 10 a in A, b in B classified with a hyperplane. • For any point ⃗ ⟨ ⃗ a , ⃗ w ⟩ ≥ c + m • For any point ⃗ ⟨ ⃗ b , ⃗ w ⟩ ≤ c − m . • Assume all vectors ( ) have ⟨ ã , w̃ ⟩ ≥ c + m / 2 and ⟨ b̃ , w̃ ⟩ ≤ c − m / 2.

example application: svm 4 4 4 Claim: After random projection into O 2 11 dimensions, if log n m 2 ( ) ⟨ ⃗ a , ⃗ w ⟩ ≥ c + m ≥ 0 then ⟨ ã , w̃ ⟩ ≥ c + m / 2. By JL Lemma: applied with ϵ = m / 4, ( ) ∥ ⃗ a − ⃗ ∥ ã − w̃ ∥ 2 2 ≤ 1 + m w ∥ 2 ( ) ( ∥ ã ∥ 2 2 + ∥ w̃ ∥ 2 2 − 2 ⟨ ã , w̃ ⟩ ≤ ∥ ⃗ a ∥ 2 2 + ∥ ⃗ w ∥ 2 2 − 2 ⟨ ⃗ a , ⃗ w ⟩ ) 1 + m ( ) 2 ⟨ ⃗ a , ⃗ 1 + m w ⟩ − 4 · m 4 ≤ 2 ⟨ ã , w̃ ⟩ ⟨ ⃗ a , ⃗ w ⟩ − m 2 ≤ ⟨ ã , w̃ ⟩ c + m − m 2 ≤ ⟨ ã , w̃ ⟩ .

example application: k -means clustering 2 . 2 k Goal: Separate n points in d dimensional space into k groups. Write in terms of distances: 12 k ∑ ∑ ∥ ⃗ k-means Objective: Cost ( C 1 , . . . , C k ) = min x − µ j ∥ 2 C 1 ,... C k ⃗ j = 1 x ∈C k ∑ ∑ ∥ ⃗ x 1 − ⃗ Cost ( C 1 , . . . , C k ) = min x 2 ∥ 2 C 1 ,... C k j = 1 ⃗ x 1 ,⃗ x 2 ∈C k

example application: k -means clustering log n Upshot: Can cluster in m dimensional space (much more 2 k x 2 , 13 k ∑ ∑ k-means Objective: Cost ( C 1 , . . . , C k ) = min ∥ ⃗ x 1 − ⃗ x 2 ∥ 2 2 If C 1 ,... C k j = 1 ⃗ x 1 ,⃗ x 2 ∈C k ( ) dimensions, for all pairs ⃗ x 1 ,⃗ we randomly project to m = O ϵ 2 2 ≤ ∥ ⃗ x 1 − ⃗ ( 1 − ϵ ) ∥ x̃ 1 − x̃ 2 ∥ 2 x 2 ∥ 2 2 ≤ ( 1 + ϵ ) ∥ x̃ 1 − x̃ 2 ∥ 2 2 = ⇒ ∑ ∑ Letting Cost ( C 1 , . . . , C k ) = min ∥ x̃ 1 − x̃ 2 ∥ 2 C 1 ,... C k j = 1 x̃ 1 , x̃ 2 ∈C k ( 1 − ϵ ) Cost ( C 1 , . . . , C k ) ≤ Cost ( C 1 , . . . , C k ) ≤ ( 1 + ϵ ) Cost ( C 1 , . . . , C k ) efficiently) and minimize Cost ( C 1 , . . . , C k ) . The optimal set of clusters will have true cost within 1 + c ϵ times the true optimal.

The Johnson-Lindenstrauss Lemma and High Dimensional Geometry low-dimensional space. So how can JL work? useless? 14 • High-dimensional Euclidean space looks very different from • Are distances in high-dimensional meaningless, making JL

orthogonal vectors What is the largest set of mutually orthogonal unit vectors in d -dimensional space? Answer: d . 15

nearly orthogonal vectors In fact, an exponentially large set of random vectors will be nearly d . What is the largest set of unit vectors in d -dimensional space that pairwise orthogonal with high probability! 16 1. d have all pairwise dot products |⟨ ⃗ x ,⃗ y ⟩| ≤ ϵ ? (think ϵ = . 01) 4. 2 Θ( d ) 2. Θ( d ) 3. Θ( d 2 ) Proof: Let ⃗ x 1 , . . . ,⃗ √ x t each have independent random entries set to ± 1 / • ⃗ x i is always a unit vector. • E [ ⟨ ⃗ x i ,⃗ x j ⟩ ] = ? 0 . • By a Chernoff bound, Pr [ |⟨ ⃗ x i ,⃗ x j ⟩| ≥ ϵ ] ≤ 2 e − ϵ 2 d / 3 . 2 e ϵ 2 d / 6 , using a union bound over all ≤ t 2 = 1 4 e ϵ 2 d / 3 • If we chose t = 1 possible pairs, with probability ≥ 1 / 2 all with be nearly orthogonal.

curse of dimensionality Even with an exponential number of random vector samples, have a huge amount of data. high dimensional space – samples are very ‘sparse’ unless we Curse of dimensionality for sampling/learning functions in clustering useless. we don’t see any nearby vectors. x T 17 Up Shot: In d -dimensional space, a set of 2 Θ( ϵ 2 d ) random unit vectors have all pairwise dot products at most ϵ (think ϵ = . 01) ∥ ⃗ x i − ⃗ 2 = ∥ ⃗ 2 + ∥ ⃗ 2 − 2 ⃗ i ⃗ x j ∥ 2 x i ∥ 2 x j ∥ 2 x j ≥ 1 . 98 . • Can make methods like nearest neighbor classification or • Only hope is if we lots of structure (which we typically do...)

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 11 0 logistics guide/practice questions. extended time). hold them at the usual time, and before class at 10:15am. on the JL

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Immer Immersion ion P Phas ase MSTP Introduction Phase Directors Lourdes Estrada, Ph.D.

HMVFS: A Hybrid Memory Versioning File System Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu,

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

THE VODAN FAIR DATA POINT FAIR DATA THE UNDERLYING PROBLEM MOST DATA DONT TALK TO EACH

LINC Digital Toolbox (c) 2019. Learning Innovation Catalyst (LINC). ARR Digital Toolbox -

User Experience More Than Just a Pre5y S8ck QconSF Nov

Stairstep-like dendrogram cut: a permutation test approach Dario Bruzzese Domenico Vistocco

Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma What can we measure?