Improved Bounds on the Dot Product under Random Projection and - - PowerPoint PPT Presentation

improved bounds on the dot product under random
SMART_READER_LITE
LIVE PREVIEW

Improved Bounds on the Dot Product under Random Projection and - - PowerPoint PPT Presentation

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kab an School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk KDD 2015, Sydney, 10-13


slide-1
SLIDE 1

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Ata Kab´ an School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/∼axk

KDD 2015, Sydney, 10-13 August 2015.

slide-2
SLIDE 2

Outline

  • Introduction & motivation
  • A Johnson-Lindenstrauss lemma (JLL) for the dot product

without union bound

  • Corollaries & connections with previous results
  • Numerical validation
  • Application to bounding generalisation error of compressive

linear classifiers

  • Conclusions and future work
slide-3
SLIDE 3

Introduction

  • Dot product – a key building block in data mining

– classification, regression, retrieval, correlation-clustering, etc.

  • Random projection (RP) – a universal dimensionality reduc-

tion method – independent of the data, computationally cheap, has low-distortion guarantees – The Johnson-Lindenstrauss lemma (JLL) for Euclidean distances is optimal, but for dot product the guarantees have been looser; some suggested that obtuse angles may be not preserved.

slide-4
SLIDE 4

Background: JLL for Euclidean distance

Theorem[Johnson-Lindenstrauss lemma] Let x, y ∈ Rd. Let R ∈ Mk×d, k < d, be a random projection matrix with entries drawn i.i.d. from a 0-mean subgaussian distribution with parameter σ2, and let Rx, Ry ∈ Rk be the images of x, y under R. Then, ∀ǫ ∈ (0, 1): Pr{Rx − Ry2 < (1 − ǫ)x − y2kσ2} < exp

  • −kǫ2

8

  • (1)

Pr{Rx − Ry2 > (1 + ǫ)x − y2kσ2} < exp

  • −kǫ2

8

  • (2)

An elementary constructive proof is in [Dasgupta & Gupta, 2002]. These bounds are known to be optimal [Larsen & Nelson, 2014].

slide-5
SLIDE 5

The quick & loose JLL for dot product

  • (Rx)TRy = 1

4

  • R(x + y)2 − R(x − y)2

Now, applying the JLL on both terms separately and applying the union bound yields: Pr{(Rx)TRy < xTykσ2 − ǫkσ2 · x · y} < 2 exp

  • −kǫ2

8

  • Pr{(Rx)TRy > xTykσ2 + ǫkσ2 · x · y} < 2 exp
  • −kǫ2

8

  • Or, (Rx)TRy = 1

2

  • R(x − y)2 − Rx2 − Ry2

...then we get factors of 3 in front of exp.

slide-6
SLIDE 6

Can we improve the JLL for dot products?

The problems:

  • Technical issue: Union bound.
  • More fundamental issue: Ratio of std of projected dot prod-

uct and original dot product (‘coefficient of variation’) is unbounded [Li et al. 2006].

  • Other issue: Some previous proofs were only applicable to

acute angles [Shi et al, 2012]; obtuse angles investigated empirically is inevitably based on limited numerical tests.

slide-7
SLIDE 7

Results: Improved bounds for dot product

Theorem[Dot Product under Random Projection] Let x, y ∈ Rd. Let R ∈ Mk×d, k < d, be a random projection matrix having i.i.d. 0-mean subgaussian entries with parameter σ2, and let Rx, Ry ∈ Rk be the images of x, y under R. Then, ∀ǫ ∈ (0, 1): Pr{(Rx)TRy < xTykσ2 − ǫkσ2 · x · y} < exp

  • −kǫ2

8

  • (3)

Pr{(Rx)TRy > xTykσ2 + ǫkσ2 · x · y} < exp

  • −kǫ2

8

  • (4)

The proof uses elementary techniques. A standard Chernoff bound argument, but exploit the convexity of the exponential

  • function. The union bound is eliminated. (Details in the paper.)
slide-8
SLIDE 8

Corollaries (1): Clarifying the role of angle

Corollary[Relative distortion bounds] Denote by θ the angle be- tween the vectors x, y ∈ Rd. Then we have the following:

  • 1. Relative distortion bound: Assume xTy = 0. Then,

Pr

  • |xTRTRy

xTy − kσ2| > ǫ

  • < 2 exp

k 8(kσ2)2ǫ2 cos2(θ)

  • (5)
  • 2. Multiplicative form of relative distortion bound:

Pr{xTRTRy < xTy(1 − ǫ)kσ2} < exp

  • −k

8ǫ2 cos2(θ)

  • (6)

Pr{xTRTRy > xTy(1 + ǫ)kσ2} < exp

  • −k

8ǫ2 cos2(θ)

  • (7)
slide-9
SLIDE 9

Observations from Corollary

  • Guarantees are the same for both obtuse and acute angles!
  • Symmetric around orthogonal angles.
  • Relation to coefficient of variation [Li et al.]:
  • Var(xTRTRy)

xTy ≥

  • 2

k (unbounded) (8)

Computing this (case of Gaussian R),

  • Var(xTRTRy)

xTy =

  • 1

k

  • 1 +

1 cos2(θ)

  • (9)

we see that unbounded coefficient of variation occurs only when x and y are perpendicular. Again, symmetric around

  • rthogonal angles.
slide-10
SLIDE 10

Corollaries (2)

Corollary[Margin type bounds and random sign projection] De- note by θ the angle between the vectors x, y ∈ Rd. Then,

  • 1. Margin bound: Assume xTy = 0. Then,
  • for all ρ s.t. ρ < xTykσ2 and ρ > (cos(θ) − 1)x · ykσ2,

Pr{xTRTRy < ρ} < exp

  • −k

8

  • cos(θ) −

ρ x · ykσ2 2 (10)

  • for all ρ s.t. ρ > xTykσ2 and ρ < (cos(θ) + 1)x · ykσ2,

Pr{xTRTRy > ρ} < exp

  • −k

8

  • ρ

x · ykσ2 − cos(θ) 2 (11)

slide-11
SLIDE 11
  • 2. Dot product under random sign projection: Assume xTy = 0.

Then, Pr

  • xTRTRy

xTy < 0

  • < exp
  • −k

8 cos2(θ)

  • (12)

These forms of the bound, with ρ > 0, are useful for instance to bound the margin loss of compressive classifiers. Details to follow shortly. The random sign projection bound was used before to bound the error of compressive classifiers under 0-1 loss [Durrant & Kab´ an, ICML 13] in the case of Gaussian RP; here subgaussian RP is allowed.

slide-12
SLIDE 12

Numerical validation

We will compute empirical estimates of the following probabili- ties, from 2000 independently drawn instances of the RP. The target dimension varies from 1 to the original dimension d = 300.

  • Rejection probability for dot product preservation = Proba-

bility that the relative distortion of the dot product after RP falls outside the allowed error tolerance ǫ:

1 − Pr

  • (1 − ǫ) < (Rx)TRy

xTy < (1 + ǫ)

  • (13)
  • The sign flipping probability:

Pr (Rx)TRy xTy < 0

  • (14)
slide-13
SLIDE 13

Replicating the results in [Shi et al, ICML’12]. Left: Two acute angles; Right: Two obtuse angles. Preservation of these obtuse angles looks indeed worse... ...but not because they are obtuse (see next slide!).

slide-14
SLIDE 14

Now take the angles symmetrical around π/2 and observe the

  • pposite behaviour. – this is why the previous result in [Shi et

al, ICML’12] has been misleading. Left: Two acute angles; Right: Two obtuse angles.

slide-15
SLIDE 15

Numerical validation – full picture

Left: Empirical estimates of rejection probability for dot product preservation; Right: Our analytic upper bound. The error tolerance was set to ǫ = 0.3. Darker means higher probability.

slide-16
SLIDE 16

The same with ǫ = 0.1.

Bound matches the true behaviour: All of these probabilities are symmetric around the angles of π/2 and 3π/2 (i.e. orthogonal vectors before RP). Thus, the preservation of the dot product is symmetrically identical for both acute and obtuse angles.

slide-17
SLIDE 17

Empirical estimates of sign flipping probability vs. our analytic upper-bound. Darker means higher probability.

slide-18
SLIDE 18

An application in machine learning: Margin bound on compressive linear classification

Consider the hypothesis class of linear classifiers defined by a unit length parameter vector: H = {x → h(x) = wTx : w ∈ Rd, w2 = 1} (15) The parameters w are estimated from a training set of size N: T N = {(xn, yn)}N

n=1, where (xn, yn)

i.i.d

∼ D over X × {−1, 1}, X ⊆ Rd. We will work with the margin loss: ℓρ(u) =      if ρ ≤ u 1 − u/ρ if u ∈ [0, ρ] 1 if u ≤ 0 (16)

slide-19
SLIDE 19

We are interested in the case when d is large and N not propor- tionately so. Use a RP matrix R ∈ Mk×d, k < d, with entries Rij drawn i.i.d. from a subgaussian distribution with parameter 1/k. Analogous definitions in the reduced k-dimensional space. The hypothesis class: HR = {x → hR(Rx) = wT

RRx : wR ∈ Rk, wR2 = 1}

(17) where the parameters wR ∈ Rk are estimated from T N

R = {(Rxn, yn)}N n=1

by minimising the empirical margin error: ˆ hR = arg min

hR∈HR

1 N

N

  • n=1

ℓρ(hR(Rxn), yn) (18) The quantity of our interest is the generalisation error of ˆ hR as a random function of both T N, and R: E(x,y)∼D

  • ˆ

hR(Rx) = y

  • (19)
slide-20
SLIDE 20

Theorem Let R by a k × d, k < d matrix having i.i.d. 0-mean subgaussian

entries with parameter 1/k, and a compressed training set T N

R = {(Rxn, yn)}N n=1,

where (xn, yn) are drawn i.i.d. from some distribution D. For any δ ∈ (0, 1), the following holds with probability at least 1 − 3δ for the empirical minimiser

  • f the margin loss in the RP space, ˆ

hR, uniformly for any margin parameter ρ ∈ (0, 1): E(x,y)∼D

  • ˆ

hR(Rx) = y

  • ≤ min

h∈H

  • 1

N

N

  • n=1

1 (h(xn)yn < ρ) + Sk +

  • 3 log(1/δ)
  • Sk
  • +4

ρ · 1 √ N

1 +

  • 8 log(1/δ)

k   Tr XXT N

  • +
  • log log2(2/ρ)

N + 3

  • log(4/δ)

2N where θn is the angle between the parameter vector of h and the vector xnyn. The function 1(·) takes value 1 if its argument is true and 0 otherwise. X is an N × d matrix that holds the input points, and Sk = 1

N

N

n=1 1 (h(xn)yn ≥ ρ)

  exp      −k

8

 cos(θn) −

ρ

  • 1+
  • 8 log(1/δ)

k

xn

 

2 +

     + δ   .

slide-21
SLIDE 21

Illustration of the bound

Illustration of the predictive behaviour of the bound (δ = 0.1 and ρ = 0.05) on the Advert classification data set from the UCI (d = 1554 features and N = 3279 points). The empirical error was estimated on holdout sets using SVM with default settings and 30 random splits (in proportion 2/3 training & 1/3 testing) of the data. We standardised the data first, and scaled it to max

n∈{1,...,N}xn = 1.

slide-22
SLIDE 22

Conclusions & Future work

  • We proved new bounds on the dot product under random projection that

take the same form as the optimal bounds on the Euclidean distance in the Johnson-Lindenstrauss lemma. The dot product is ubiquitous in data mining, and the use of RP on this operation is now better justified.

  • We cleared the controversy about preservation of obtuse angles and clar-

ified the precise role of angles in the relative distortion of the dot product under random projection.

  • We further discussed connections with the notion of margin in general-

isation theory, and connections with sign random projections generalise earlier results.

  • Our proof technique applies to any subgaussian RP matrices with i.i.d
  • entries. In future work is would be of interest to see if it could be adapted

to Fast JL transforms whose entries are not i.i.d.

slide-23
SLIDE 23

Selected References

[Achlioptas] D. Achlioptas. Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins. Journal of Computer and System Sciences, 66(4):671–687, 2003. [Balcan & Blum] M.F. Balcan, A. Blum, S. Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (1), 79-94, 2006. [Bingham & Mannila] E. Bingham and H. Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining (KDD), pp. 245–250, ACM Press, 2001. [Buldygin & Kozachenko] V.V. Buldygin, Y.V. Kozachenko. Metric characterization of ran- dom variables and random processes. American Mathematical Society, 2000. [Dasgupta & Gupta] S. Dasgupta, and A. Gupta. An elementary proof of the Johnson– Lindenstrauss Lemma. Random Structures & Algorithms, 22:60–65, 2002. [Durrant & Kab´ an] R.J. Durrant, A. Kab´

  • an. Sharp generalization error bounds for randomly-

projected classifiers. ICML’13, Journal of Machine Learning Research-Proceedings Track 28(3):693-701, 2013. [Larsen & Nelson] K.G. Larsen, J. Nelson. The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, arXiv preprint arXiv:1411.2404, 2014. [Li et al.] P. Li, T. Hastie, K. Church. Improving random projections using marginal infor-

  • mation. In Proc. Conference on Learning Theory (COLT) 4005, 635-649, 2006.

[Shi et al.] Q. Shi, C. Shen, R. Hill, A. Hengel. Is margin preserved after random projection? Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 591–598, 2012.