di dimen ension sion re redu ducti ction on Yury Makarychev, - - PowerPoint PPT Presentation

β–Ά
di dimen ension sion re redu ducti ction on
SMART_READER_LITE
LIVE PREVIEW

di dimen ension sion re redu ducti ction on Yury Makarychev, - - PowerPoint PPT Presentation

-mea d -med eans ns an and edian ians s un unde der r di dimen ension sion re redu ducti ction on Yury Makarychev, TTIC Konstantin Makarychev, Northwestern Ilya Razenshteyn, Microsoft Research Simons Institute,


slide-1
SLIDE 1

𝒍-mea eans ns an and d 𝒍-med edian ians s un unde der r di dimen ension sion re redu ducti ction

  • n

Yury Makarychev, TTIC Konstantin Makarychev, Northwestern Ilya Razenshteyn, Microsoft Research

Simons Institute, November 2, 2018

slide-2
SLIDE 2

Euclidean 𝑙-means and 𝑙-medians

Given a set of points π‘Œ in ℝ𝑛 Partition π‘Œ into 𝑙 clusters 𝐷1, … , 𝐷𝑙 and find a β€œcenter” 𝑑𝑗 for each 𝐷𝑗 so as to minimize the cost ෍

𝑗=1 𝑙

෍

π‘£βˆˆπ·π‘—

𝑒(𝑣, 𝑑𝑗) ෍

𝑗=1 𝑙

෍

π‘£βˆˆπ·π‘—

𝑒 𝑣, 𝑑𝑗 2 (𝑙-means) (𝑙-median)

slide-3
SLIDE 3

Dimension Reduction

Dimension reduction πœ’: ℝ𝑛 β†’ ℝ𝑒 is a random map that preserves distances within a factor of 1 + 𝜁 with probability at least 1 βˆ’ πœ€: 1 1 + 𝜁 𝑣 βˆ’ 𝑀 ≀ πœ’ 𝑣 βˆ’ πœ’ 𝑀 ≀ (1 + 𝜁) 𝑣 βˆ’ 𝑀 [Johnson-Lindenstrauss β€˜84] There exists a random linear dimension reduction with 𝑒 = 𝑃

log 1/πœ€ 𝜁2

. [Larsen, Nelson β€˜17] The dependence of 𝑒 on 𝜁 and πœ€ is optimal.

slide-4
SLIDE 4

Dimension Reduction

JL preserves all distances between points in π‘Œ whp when 𝑒 = Ξ©(log |π‘Œ|/𝜁2). Numerous applications in computer science. Dimension Reduction Constructions:

  • [JL β€˜84] Project on a random 𝑒-dimensional subspace
  • [Indyk, Motwani β€˜98] Apply a random Gaussian matrix
  • [Achlioptas β€˜03] Apply a random matrix with Β±1 entries
  • [Ailon, Chazelle β€˜06] Fast JL-transform
slide-5
SLIDE 5

𝑙-means under dimension reduction

[Boutsidis, Zouzias, Drineas ’10] Apply a dimension reduction πœ’ to our dataset π‘Œ Cluster πœ’(π‘Œ) in dimension 𝑒.

dimension reduction

slide-6
SLIDE 6

𝑙-means under dimension reduction

want

Optimal clusterings of π‘Œ and πœ’(π‘Œ) have approximately the same cost.

even better

The cost of every clustering is approximately preserved. For what dimension 𝑒 can we get this?

slide-7
SLIDE 7

𝑙-means under dimension reduction

𝒆 distortion Folklore ~ log π‘œ /𝜁2 1 + 𝜁 Boutsidis, Zouzias, Drineas β€˜10 ~𝑙/𝜁2 2 + 𝜁 Cohen, Elder, Musco, Musco, Persu ’15 ~𝑙/𝜁2 1 + 𝜁 ~ log 𝑙 /𝜁2 9 + 𝜁 MMR ’18 ~ log(𝑙/𝜁) /𝜁2 1 + 𝜁 Lower bound ~ log 𝑙 /𝜁2 1 + 𝜁

slide-8
SLIDE 8

𝑙-medians under dimension reduction

𝒆 distortion Prior work β€” β€” Kirszsbraun Thm β‡’ ~ log π‘œ /𝜁2 1 + 𝜁 MMR ’18 ~ log(𝑙/𝜁) /𝜁2 1 + 𝜁 Lower bound ~ log 𝑙 /𝜁2 1 + 𝜁

slide-9
SLIDE 9

Plan

𝑙-means

  • Challenges
  • Warm up: 𝑒~log π‘œ /𝜁2
  • Special case: β€œdistortions” are everywhere sparse
  • Remove outliers: the general case β†’ the special case
  • Outliers

𝑙-medians

  • Overview of our approach
slide-10
SLIDE 10

Out result for 𝑙-means

Let π‘Œ βŠ‚ ℝ𝑛 πœ’: ℝ𝑛 β†’ ℝ𝑒 be a random dimension reduction. 𝑒 β‰₯ 𝑑 log 𝑙 πœπœ€ /𝜁2 With probability at least 1 βˆ’ πœ€: 1 βˆ’ 𝜁 cost π’Ÿ ≀ cost πœ’ π’Ÿ ≀ 1 + 𝜁 cost π’Ÿ for every clustering π’Ÿ = 𝐷1, … , 𝐷𝑙 of π‘Œ

slide-11
SLIDE 11

Challenges

Let π’Ÿβˆ— be the optimal 𝑙-means clustering. Easy: cost π’Ÿβˆ— β‰ˆ cost πœ’(π’Ÿβˆ—) with probability 1 βˆ’ πœ€ Hard: Prove that there is no other clustering π’Ÿβ€² s.t. cost πœ’ π’Ÿβ€² < 1 βˆ’ 𝜁 cost π’Ÿβˆ— since there are exponentially many clusterings π’Ÿβ€² (can’t use the union bound)

slide-12
SLIDE 12

Warm-up

Consider a clustering π’Ÿ = (𝐷1, … , 𝐷𝑙). Write the cost in terms of pair-wise distances: cost π’Ÿ = ෍

𝑗=1 𝑙

1 2|𝐷𝑗| ෍

𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2 all distances 𝑣 βˆ’ 𝑀 are preserved within 1 + 𝜁

⇓

cost π’Ÿ is preserved within 1 + 𝜁 Sufficient to have 𝑒~ log π‘œ /𝜁2

slide-13
SLIDE 13

Problem & Notation

Assume that π’Ÿ = (𝐷1, … , 𝐷𝑙) is a random clustering that depends on πœ’. Want to prove: cost π’Ÿ β‰ˆ cost πœ’ π’Ÿ whp. The distance between 𝑣 and 𝑀 is (1 + 𝜁)-preserved

  • r distorted depending on whether

πœ’(𝑣) βˆ’ πœ’(𝑀) β‰ˆ1+𝜁 𝑣 βˆ’ 𝑀 Think πœ€ = poly(1/𝑙, 𝜁) is sufficiently small.

slide-14
SLIDE 14

Distortion graph

Connect 𝑣 and 𝑀 with an edge if the distance between them is distorted. + Every edge is present with probability at most πœ€. βˆ’ Edges are not independent. βˆ’ π’Ÿ depends on the set of edges. βˆ’ May have high-degree vertices. βˆ’ All distances in a cluster may be distorted.

slide-15
SLIDE 15

Cost of a cluster

The cost of 𝐷𝑗 is

1 2|𝐷𝑗| ෍

𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2

+ Terms for non-edges (𝑣, 𝑀) are (1 + 𝜁) preserved. 𝑣 βˆ’ 𝑀 β‰ˆ πœ’ 𝑣 βˆ’ πœ’(𝑀) βˆ’ Need to prove that

෍

𝑣,π‘€βˆˆπ·π‘— 𝑣,𝑀 ∈𝐹

𝑣 βˆ’ 𝑀 2 = ෍

𝑣,π‘€βˆˆπ·π‘— 𝑣,𝑀 ∈𝐹

πœ’ 𝑣 βˆ’ πœ’(𝑀) 2 Β± πœβ€²cost π’Ÿ

slide-16
SLIDE 16

Everywhere-sparse edges

Assume every 𝑣 ∈ 𝐷𝑗 is connected to at most a πœ„ fraction of all 𝑀 in 𝐷𝑗 (where πœ„ β‰ͺ 𝜁).

slide-17
SLIDE 17

Everywhere-sparse edges

+ Terms for non-edges (𝑣, 𝑀) are (1 + 𝜁) preserved. + The contribution of terms for edges is small: for an edge 𝑣, 𝑀 and any π‘₯ ∈ 𝐷𝑗 𝑣 βˆ’ 𝑀 ≀ 𝑣 βˆ’ π‘₯ + π‘₯ βˆ’ 𝑀 𝑣 βˆ’ 𝑀 2 ≀ 2 𝑣 βˆ’ π‘₯ 2 + π‘₯ βˆ’ 𝑀 2

slide-18
SLIDE 18

Everywhere-sparse edges

𝑣 βˆ’ 𝑀 2 ≀ 2 𝑣 βˆ’ π‘₯ 2 + π‘₯ βˆ’ 𝑀 2

  • Replace the term for every edge with two terms

𝑣 βˆ’ π‘₯ 2, π‘₯ βˆ’ 𝑀 2 for random π‘₯ ∈ 𝐷𝑗.

  • Each term is used at most 2πœ„ times, in expectation.

෍

(𝑣,𝑀)∈𝐹 𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2 ≀ 4πœ„ ෍

𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2

slide-19
SLIDE 19

Everywhere-sparse edges

෍

𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2 β‰ˆ ෍

𝑣,𝑀 βˆ‰πΉ

𝑣 βˆ’ 𝑀 2 β‰ˆ ෍

(𝑣,𝑀)βˆ‰πΉ

πœ’(𝑣) βˆ’ πœ’(𝑀) 2 β‰ˆ ෍

𝑣,π‘€βˆˆπ·π‘—

πœ’(𝑣) βˆ’ πœ’(𝑀) 2

slide-20
SLIDE 20

Everywhere-sparse edges

෍

𝑣,π‘€βˆˆπ·π‘—

𝑣 βˆ’ 𝑀 2 β‰ˆ ෍

𝑣,𝑀 βˆ‰πΉ

𝑣 βˆ’ 𝑀 2 β‰ˆ ෍

(𝑣,𝑀)βˆ‰πΉ

πœ’(𝑣) βˆ’ πœ’(𝑀) 2 β‰ˆ ෍

𝑣,π‘€βˆˆπ·π‘—

πœ’(𝑣) βˆ’ πœ’(𝑀) 2

Edges are not necessarily everywhere sparse!

slide-21
SLIDE 21

Outliers

Want: remove β€œoutliers” so that in the remaining set π‘Œβ€² edges are everywhere sparse in every cluster.

slide-22
SLIDE 22

(1 βˆ’ πœ„) non-distorted core

Want: remove β€œoutliers” so that in the remaining set π‘Œβ€² edges are everywhere sparse in every cluster.

slide-23
SLIDE 23

(1 βˆ’ πœ„) non-distorted core

Want: remove β€œoutliers” so that in the remaining set π‘Œβ€² edges are everywhere sparse in every cluster. Find a subset π‘Œβ€² βŠ‚ π‘Œ (which depends on π’Ÿ) s.t.

  • Edges are sparse in the obtained clusters:

Every 𝑣 ∈ 𝐷𝑗 ∩ π‘Œβ€² is connected to at most a πœ„ fraction of all 𝑀 in 𝐷𝑗 ∩ π‘Œβ€².

  • Outliers are rare:

For every 𝑣, Pr 𝑣 βˆ‰ π‘Œβ€² ≀ πœ„

slide-24
SLIDE 24

All clusters are large

Assume all clusters are of size ~π‘œ/𝑙. Let πœ„ = πœ€1/4.

  • utliers = all vertices of degree at least ~πœ„π‘œ/𝑙

Every vertex has degree at most πœ€π‘œ in expectation. By Markov, Pr( 𝑣 is an outlier) ≀ πœ€π‘™ πœ„ ≀ πœ„ Remove πœ„π‘œ β‰ͺ π‘œ/𝑙 vertices in total, so all clusters still have size ~π‘œ/𝑙. Crucially use that all clusters are large!

slide-25
SLIDE 25

Main Combinatorial Lemma

Idea: assign β€œweights” to vertices so that all clusters have a large weight.

  • There is a measure 𝜈 on π‘Œ and random set 𝑆 s.t.

𝜈 𝑦 β‰₯

1 π·π‘—βˆ–π‘† for 𝑦 ∈ 𝐷𝑗 βˆ– 𝑆 (always)

  • 𝜈 π‘Œ ≀ 4𝑙3/πœ„2
  • Pr(𝑦 ∈ 𝑆) ≀ πœ„

All clusters 𝐷𝑗 βˆ– 𝑆 are β€œlarge” w.r.t. measure 𝜈. Can apply a variant of the previous argument.

slide-26
SLIDE 26

Edges Incident on Outliers

Need to take care of edges incident on outliers. Say, 𝑣 is an outlier and 𝑀 is not. Consider a fixed optimal clustering 𝐷1

βˆ—, … , 𝐷𝑙 βˆ— for π‘Œ.

Let π‘‘βˆ— be the optimal center for 𝑣. 𝑀 𝑣 π‘‘βˆ—

slide-27
SLIDE 27

Edges Incident on Outliers

𝑣 βˆ’ 𝑀 = 𝑀 βˆ’ π‘‘βˆ— Β± π‘‘βˆ— βˆ’ 𝑣 πœ’(𝑣) βˆ’ πœ’(𝑀) = πœ’(𝑀) βˆ’ πœ’(π‘‘βˆ—) Β± πœ’(π‘‘βˆ—) βˆ’ πœ’(𝑣)

May assume that the distances between non-outliers and the optimal centers are 1 + 𝜁 -preserved. 𝑀 𝑣 π‘‘βˆ—

β‰ˆ

slide-28
SLIDE 28

Edges Incident on Outliers

𝑣 βˆ’ 𝑀 = 𝑀 βˆ’ π‘‘βˆ— Β± π‘‘βˆ— βˆ’ 𝑣 πœ’(𝑣) βˆ’ πœ’(𝑀) = πœ’(𝑀) βˆ’ πœ’(π‘‘βˆ—) Β± πœ’(π‘‘βˆ—) βˆ’ πœ’(𝑣)

𝔽[ Οƒπ‘£βˆ‰π‘Œβ€² 𝑑𝑣

βˆ— βˆ’ 𝑣 2] ≀ πœ„ Οƒπ‘£βˆˆπ‘Œ 𝑑𝑣 βˆ— βˆ’ 𝑣 2 = πœ„ OPT

𝑀 𝑣 π‘‘βˆ—

β‰ˆ

slide-29
SLIDE 29

Edges Incident on Outliers

𝑣 βˆ’ 𝑀 = 𝑀 βˆ’ π‘‘βˆ— Β± π‘‘βˆ— βˆ’ 𝑣 πœ’(𝑣) βˆ’ πœ’(𝑀) = πœ’(𝑀) βˆ’ πœ’(π‘‘βˆ—) Β± πœ’(π‘‘βˆ—) βˆ’ πœ’(𝑣)

Taking care of πœ’(π‘‘βˆ—) βˆ’ πœ’(𝑣) is a bit more difficult. 𝑀 𝑣 π‘‘βˆ—

β‰ˆ

QED

slide-30
SLIDE 30

𝑙-medians under dimension reduction

slide-31
SLIDE 31

𝑙-medians

βˆ’ No formula for the cost of the clustering in terms

  • f pairwise distances.

βˆ’ Not obvious when 𝑒 ~ log π‘œ (then all pairwise

distances are approximately preserved). [was asked by Ravi Kannan in a tutorial @ Simons]

+ Kirzsbraun Theorem β‡’ the 𝑒~ log π‘œ case + Prove a Robust Kirzsbraun Theorem

Our methods for 𝑙-means + Robust Kirzsbraun β‡’ 𝑒~ log 𝑙 for 𝑙-medians

slide-32
SLIDE 32

Summary

  • Prove that the cost of every 𝑙-means and 𝑙-medians

clustering is preserved up to (1 + 𝜁) under dimension reduction, when 𝑒 β‰₯ 𝑑 log

𝑙 πœπœ€ /𝜁2.

  • The bound on 𝑒 almost matches the lower bound.
  • 𝑙-means: improves the bound 𝑒 β‰₯

𝑑𝑙 𝜁2 by Cohen et al.

  • 𝑙-medians: no results were known.
  • Applies to 𝑙-clustering with the β„“π‘ž-objective when

𝑒 β‰₯ 𝑑 π‘ž4 log 𝑙 πœπœ€ /𝜁2