Random Projections for Dimensionality Reduction: Some Theory and - PowerPoint PPT Presentation

Random Projections for Dimensionality Reduction: Some Theory and Applications Robert J. Durrant University of Waikato bobd@waikato.ac.nz www.stats.waikato.ac.nz/˜bobd T´ el´ ecom ParisTech, Tuesday 12th September 2017 R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 1 / 52

Outline 1 Background and Preliminaries 2 Short tutorial on Random Projection Johnson-Lindenstrauss for Random Subspace 3 4 Empirical Corroboration 5 Conclusions and Future Work R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 2 / 52

Motivation - Dimensionality Curse The ‘curse of dimensionality’: A collection of pervasive, and often counterintuitive, issues associated with working with high-dimensional data. Two typical problems: Very high dimensional data (dimensionality d ∈ O ( 1000 ) ) and very many observations (sample size N ∈ O ( 1000 ) ): Computational (time and space complexity) issues. Very high dimensional data (dimensionality d ∈ O ( 1000 ) ) and hardly any observations (sample size N ∈ O ( 10 ) ): Inference a hard problem. Bogus interactions between features. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 3 / 52

Curse of Dimensionality Comment : What constitutes high-dimensional depends on the problem setting, but data vectors with dimensionality in the thousands very common in practice (e.g. medical images, gene activation arrays, text, time series, ...). Issues can start to show up when data dimensionality in the tens! We will simply say that the observations, T , are d -dimensional and there are N of them: T = { x i ∈ R d } N i = 1 and we will assume that, for whatever reason, d is too large. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 4 / 52

Mitigating the Curse of Dimensionality An obvious solution: Dimensionality d is too large, so reduce d to k ≪ d . How? Dozens of methods: PCA, Factor Analysis, Projection Pursuit, ICA, Random Projection ... We will be focusing on Random Projection, motivated (at first) by the following important result: R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 5 / 52

Johnson-Lindenstrauss Lemma The JLL is the following rather surprising fact [DG02, Ach03]: Theorem (W.B.Johnson and J.Lindenstrauss, 1984) Let ǫ ∈ ( 0 , 1 ) . Let N , k ∈ N such that k � C ǫ − 2 log N, for a large enough absolute constant C. Let V ⊆ R d be a set of N points. Then there exists a linear mapping R : R d → R k , such that for all u , v ∈ V: ( 1 − ǫ ) � u − v � 2 2 � � Ru − Rv � 2 2 � ( 1 + ǫ ) � u − v � 2 2 Dot products are also approximately preserved by R since if JLL holds then: u T v − ǫ � u �� v � � ( Ru ) T Rv � u T v + ǫ � u �� v � . (Proof: parallelogram law). Scale of k is sharp even for adaptive linear R (e.g. ‘thin’ PCA): ∀ N , ∃ V s.t. k ∈ Ω( ǫ − 2 log N ) is required [LN14, LN16]. We shall prove shortly that with high probability random projection (that is left-multiplying data with a wide, shallow, random matrix) implements a suitable linear R . R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 6 / 52

Jargon ‘With high probability’ (w.h.p) means with a probability as close to 1 as we choose to make it. ‘Almost surely’ (a.s.) or ‘with probability 1’ (w.p. 1) means so likely we can pretend it always happens. ‘With probability 0’ (w.p. 0) means so unlikely we can pretend it never happens. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 7 / 52

Intuition Geometry of data gets perturbed by random projection, but not too much: 5 5 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −5 −5 −5 0 5 −5 0 5 Figure: Original data Figure: RP data (schematic) R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 8 / 52

Intuition Geometry of data gets perturbed by random projection, but not too much: 5 5 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −5 −5 −5 0 5 −5 0 5 Figure: Original data Figure: RP data & Original data R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 9 / 52

Applications Random projections have been used for: Classification. e.g. [BM01, FM03, GBN05, SR09, CJS09, RR08, DK15, CS15, HWB07, BD09] Clustering and Density estimation. e.g. [IM98, AC06, FB03, Das99, KMV12, AV09] Other related applications: structure-adaptive kd-trees [DF08], low-rank matrix approximation [Rec11, Sar06], sparse signal reconstruction (compressed sensing) [Don06, CT06], matrix completion [CT10], data stream computations [AMS96], heuristic optimization [KBD16]. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 10 / 52

What is Random Projection? (1) Canonical RP: Construct a (wide, flat) matrix R ∈ M k × d by picking the entries from a univariate Gaussian N ( 0 , σ 2 ) . Orthonormalize the rows of R , e.g. set R ′ = ( RR T ) − 1 / 2 R . To project a point v ∈ R d , pre-multiply the vector v with RP matrix R ′ . Then v �→ R ′ v ∈ R ′ ( R d ) ≡ R k is the projection of the d -dimensional data into a random k -dimensional projection space. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 11 / 52

Comment (1) If d is very large we can drop the orthonormalization in practice - the rows of R will be nearly orthogonal to each other and all nearly the same length. For example, for Gaussian ( N ( 0 , σ 2 ) ) R we have [DK12]: � ( 1 − ǫ ) d σ 2 � � R i � 2 2 � ( 1 + ǫ ) d σ 2 � � 1 − δ, ∀ ǫ ∈ ( 0 , 1 ] Pr where R i denotes the i -th row of R and √ √ 1 + ǫ − 1 ) 2 d / 2 ) + exp ( − ( 1 − ǫ − 1 ) 2 d / 2 ) . δ = exp ( − ( Similarly [Led01]: i R j | / d σ 2 � ǫ } � 1 − 2 exp ( − ǫ 2 d / 2 ) , ∀ i � = j . Pr {| R T R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 12 / 52

Concentration in norms of rows of R Norm concentration d=1000, 10K samples Norm concentration d=100, 10K samples Norm concentration d=500, 10K samples 400 400 400 350 350 350 300 300 300 Count (100 bins) Count (100 bins) 250 250 Count (100 bins) 250 200 200 200 150 150 150 100 100 100 50 50 50 0 0 0.7 0.8 0.9 1 1.1 1.2 1.3 0.7 0.8 0.9 1 1.1 1.2 1.3 0 l 2 norm l 2 norm 0.7 0.8 0.9 1 1.1 1.2 1.3 l 2 norm Figure: d = 100 norm Figure: d = 500 norm Figure: d = 1000 norm concentration concentration concentration R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 13 / 52

Near-orthogonality of rows of R Near−orthogonality: d ∈ {100,200, … , 2500}, 10K samples. 0.4 0.3 0.2 0.1 Dot product 0 −0.1 −0.2 −0.3 −0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 d × 10 −2 Figure: Normalized dot product is concentrated about zero, d ∈ { 100 , 200 , . . . , 2500 } R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 14 / 52

Why Random Projection? Linear. Cheap. Universal – JLL holds w.h.p for any fixed finite point set. Oblivious to data distribution. Target dimension doesn’t depend on data dimensionality (for JLL). Interpretable - approximates an isometry (when d is large). Tractable to analysis. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 15 / 52

Proof of JLL (1) We will prove the following randomized version of the JLL, and then show that this implies the original theorem: Theorem Let ǫ ∈ ( 0 , 1 ) . Let k ∈ N such that k � C ǫ − 2 log δ − 1 , for a large enough absolute constant C. Then there is a random linear mapping P : R d → R k , such that for any unit vector x ∈ R d : � ( 1 − ǫ ) � � Px � 2 � ( 1 + ǫ ) � � 1 − δ Pr No loss to take � x � = 1, since P is linear. Note that this mapping is universal and the projected dimension k depends only on ǫ and δ . Lower bound [LN14, LN16] k ∈ Ω( ǫ − 2 log δ − 1 ) . R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 16 / 52

Proof of JLL (2) Consider the following simple mapping: 1 √ Px := Rx k i . i . d where R ∈ M k × d with entries R ij ∼ N ( 0 , 1 ) . Let x ∈ R d be an arbitrary unit vector. We are interested in the quantity: k 2 2 � � � � 1 1 = 1 � Px � 2 = � Y 2 � � � � √ Rx := √ ( Y 1 , Y 2 , . . . , Y k ) i =: Z � � � � k k k � � � � i = 1 where Y i = � d j = 1 R ij x j . R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 17 / 52

Proof of JLL (3) Recall that if W i ∼ N ( µ i , σ 2 i ) and the W i are independent, then i σ 2 �� i W i ∼ N i µ i , � . Hence, in our setting, we have: i     d d d d � � �  ≡ N � x 2 Y i = R ij x j ∼ N E [ R ij x j ] , Var ( R ij x j )  0 ,  j  j = 1 j = 1 j = 1 j = 1 and since � x � 2 = � d j = 1 x 2 j = 1 we therefore have: Y i ∼ N ( 0 , 1 ) , ∀ i ∈ { 1 , 2 , . . . , k } it follows that each of the Y i are standard normal RVs and therefore kZ = � k i = 1 Y 2 i is χ 2 k distributed. Now we complete the proof using a standard Chernoff-bounding approach. R.J.Durrant (U.Waikato) RP for Dimension Reduction ParisTech, 12/9/17 18 / 52

Random Projections for Dimensionality Reduction: Some Theory and - PowerPoint PPT Presentation

Random Projections for Dimensionality Reduction: Some Theory and Applications Robert J. Durrant University of Waikato bobd@waikato.ac.nz www.stats.waikato.ac.nz/bobd T el ecom ParisTech, Tuesday 12th September 2017 R.J.Durrant

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Random Projections & Applications To Dimensionality Reduction Aditya Krishna Menon (BSc.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Dimensionality reduction Outline From distances to points : MultiDimensional Scaling (MDS)

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Using Randomized Controlled Trials in Criminal Justice Gipsy Escobar, PhD June 8 th , 2016

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Just-in-Time Code Reuse The more things change, the more they stay the same Kevin Z. Snow 1 Luca

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex

Sampling 2: Random Walks Lecture 20 CSCI 4974/6971 10 Nov 2016 1 / 10 Todays Biz 1.

Causality and randomization Maximilian Kasy November 2, 2018 Introduction This talk is based

Calibrated Risk Adjusted Modeling (CRAM) With a Bridge Design for Extending the Applicability of

These slides were presented at h/ps://www.pmwcintl.com/cur7s-bagne-2018mich/. You will learn how

Random Projections for Dimensionality Reduction: Some Theory and - PowerPoint PPT Presentation

Random Projections for Dimensionality Reduction: Some Theory and Applications Robert J. Durrant University of Waikato bobd@waikato.ac.nz www.stats.waikato.ac.nz/bobd T el ecom ParisTech, Tuesday 12th September 2017 R.J.Durrant

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Random Projections &amp; Applications To Dimensionality Reduction Aditya Krishna Menon (BSc.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Dimensionality reduction Outline From distances to points : MultiDimensional Scaling (MDS)

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Using Randomized Controlled Trials in Criminal Justice Gipsy Escobar, PhD June 8 th , 2016

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Just-in-Time Code Reuse The more things change, the more they stay the same Kevin Z. Snow 1 Luca

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex

Sampling 2: Random Walks Lecture 20 CSCI 4974/6971 10 Nov 2016 1 / 10 Todays Biz 1.

Causality and randomization Maximilian Kasy November 2, 2018 Introduction This talk is based

Calibrated Risk Adjusted Modeling (CRAM) With a Bridge Design for Extending the Applicability of

These slides were presented at h/ps://www.pmwcintl.com/cur7s-bagne-2018mich/. You will learn how

Random Projections & Applications To Dimensionality Reduction Aditya Krishna Menon (BSc.