Randomized methods for machine learning
David Lopez-Paz, FAIR May 17, 2016 http://tinyurl.com/randomized-practical
Randomized methods for machine learning David Lopez-Paz, FAIR May - - PowerPoint PPT Presentation
Randomized methods for machine learning David Lopez-Paz, FAIR May 17, 2016 http://tinyurl.com/randomized-practical Some random examples Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear
David Lopez-Paz, FAIR May 17, 2016 http://tinyurl.com/randomized-practical
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
m
From [Eck87].
O(m−1/2) convergence regardless of the dimensionality of x! Why?
O(m−1/2) convergence regardless of the dimensionality of x! Why?
O(m−1/2) convergence regardless of the dimensionality of x! Why?
O(m−1/2) convergence regardless of the dimensionality of x! Why?
Still cross-validating over grids? From [BB12], a.k.a. “The rule of 59“ [NLB16]: P(Fµ(min(x1, . . . , xT )) ≤ α) = 1 − (1 − α)⊤.
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
From research.facebook.com/blog/fast-randomized-svd/. Complexity of O(mn2)! [GVL12]
Computation of r-rank SVD of A ∈ Rm×n
Computation of r-rank SVD of A ∈ Rm×n
Hey, but how do I compute Q?
Computation of r-rank SVD of A ∈ Rm×n
Hey, but how do I compute Q? At random! :)
Computation of r-rank SVD of A ∈ Rm×n
Hey, but how do I compute Q? At random! :)
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
Random projections offer fast and efficient dimensionality reduction.
R40500 R100 x1 x2 y1 y2 w ∈ R40500×100 w ∈ R40500×100 δ δ(1 ± ǫ) (1 − ǫ)x1 − x22 ≤ y1 − y22 ≤ (1 + ǫ)x1 − x22 This result is formalized in the Johnson-Lindenstrauss Lemma
The proof is one example of Erd¨
Paul Erd¨
Joram Lindenstrauss William Johnson 1913-1996 1936-2012 1944- §12.5 of Foundations of Machine Learning (Mohri et al., 2012)
Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof:
Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality
a
Pr[Q ≥ (1+ǫ)k] =
Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality
a
Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k =
Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality
a
Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k = (1 − 2λ)−k/2 eλ(1+ǫ)k , where E[eλQ] = (1 − 2λ)−k/2 is the mgf of a χ2 distr., λ < 1
2.
Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality
a
Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k = (1 − 2λ)−k/2 eλ(1+ǫ)k , where E[eλQ] = (1 − 2λ)−k/2 is the mgf of a χ2 distr., λ < 1
2.
To tight the bound we minimize the rhs with λ =
ǫ 2(1+ǫ):
Pr[Q ≥ (1 + ǫ)k] ≤ (1 −
ǫ 1+ǫ)−k/2
eǫk/2 = (1 + ǫ)k/2 (eǫ)k/2 = 1 + ǫ eǫ k/2 .
Using 1 + ǫ ≤ eǫ−(ǫ2−ǫ3)/2 yields Pr[Q ≥ (1 + ǫ)k] ≤ 1 + ǫ eǫ k/2 ≤ eǫ− ǫ2−ǫ3
2
eǫ = e− k
4 (ǫ2−ǫ3).
Pr[Q ≤ (1 − ǫ)k] is bounded similarly, and the lemma follows by union bound.
Let x ∈ RN, k < N and A ∈ Rk×N with Aij ∼ N(0, 1). Then, for any 0 ≤ ǫ ≤ 1/2: Pr[(1 − ǫ)x2 ≤ 1 √ k Ax2 ≤ (1 + ǫ)x2] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: let ˆ x = Ax. Then, E[ˆ xj] = 0, and E[ˆ x2
j] = E
N
Ajixi 2 = E N
A2
jix2 i
N
x2
i = x2.
Note that Tj = ˆ xj/x ∼ N(0, 1). Then, Q = k
i T 2 j ∼ χ2 k.
Remember the previous lemma?
Remember: ˆ x = Ax, Tj = ˆ xj/x ∼ N(0, 1), Q = k
i T 2 j ∼ χ2 k:
Pr[(1 − ǫ)x2 ≤ 1 √ k Ax2 ≤ (1 + ǫ)x2] = Pr[(1 − ǫ)x2 ≤ ˆ x2 k ≤ (1 + ǫ)x2] = Pr[(1 − ǫ)k ≤ ˆ x2 x2 ≤ (1 + ǫ)k] = Pr
k
T 2
j ≤ (1 + ǫ)k
Pr [(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4
For any 0 < ǫ < 1/2 and any integer m > 4, let k = 20 log m
ǫ2
. Then, for any set V of m points in RN ∃ f : RN → Rk s.t. ∀ u, v ∈ V : (1 − ǫ)u − v2 ≤ f(u) − f(v)2 ≤ (1 + ǫ)u − v2. Proof: Let f =
1 √ kA, A ∈ Rk×N, k < N and Aij ∼ N(0, 1). ◮ Apply previous lemma with x = u − v to lower bound the
success probability by 1 − 2e−(ǫ2−ǫ3)k/4.
◮ Union bound over the m2 pairs in V with k = 20 log m ǫ2
and ǫ < 1/2 to obtain: Pr[success] ≥ 1−2m2e−(ǫ2−ǫ3)k/4 = 1−2m5ǫ−3 > 1−2m−1/2 > 0.
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
k(x, x′) = φ(x), φ(x′)H, f(x)
k(x, x′) = φ(x), φ(x′)H, f(x) ≈
n
αik(x, xi).
To compute {αi}n
i=1, construct the n × n monster:
K · k(xi, xj)
Theorem (Mercer’s condition [Mer09])
Under mild technical assumptions, k admit a representation k(x, x′) =
∞
λjφλj(x)φλj(x′). If λ1 :=
i |λj| ≤ ∞, we can cast the previous as
k(x, x′) = λ1 E
λ∼p(λ)
Any ideas? :)
Theorem (Mercer’s condition [Mer09])
Under mild technical assumptions, k admit a representation k(x, x′) =
∞
λjφλj(x)φλj(x′). If λ1 :=
i |λj| ≤ ∞, we can cast the previous as
k(x, x′) = λ1 E
λ∼p(λ)
Any ideas? :) Monte Carlo again!
Theorem (Mercer’s condition [Mer09])
Under mild technical assumptions, k admit a representation k(x, x′) =
∞
λjφλj(x)φλj(x′). If λ1 :=
i |λj| ≤ ∞, we can cast the previous as
k(x, x′) = λ1 E
λ∼p(λ)
Any ideas? :) Monte Carlo again! But what are those {φλ}λ?
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck
= ck
= 2ck
2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m
m
cos(wi, x + bi) cos(wi, x′ + bi) =
where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck
√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm
Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).
x f(x)
3 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
4 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
5 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
10 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
20 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
200 features GP Posterior Mean GP Posterior Uncertainty Data
x f(x)
400 features GP Posterior Mean GP Posterior Uncertainty Data
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
Exploratory analysis Manifold learning
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
Feature learning Computing KPCA takes O(n3) time
Randomize! Perform linear PCA on top of random features. For analysis, we will study the spectral properties of
◮ The true kernel matrix Kij = k(xi, xj). ◮ Its randomized approximation ˆ
Kij = z(xi), z(xj)Rm. That is, we are interested in upper-bounding ˆ K − K.
Theorem (RPCA error bound [LSS+14])
Let z(x)j ≤ 1 for all x and j. Then, E ˆ K − K ≤
m + 2n log n m .
Theorem (Matrix Bernstein [Tro15])
Let Z1, . . . Zm be independent d × d random matrices. Assume that E[Zi] = 0 and that Zi ≤ R. Define the variance parameter σ2 :=
i Zi]
Then, E
◮ Let zi = (z(x1)i, . . . , z(xn)i).
◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ
Km − K.
◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ
Km − K.
◮ Bound residual norm:
Ei = 1 mziz⊤
i −E[zz⊤] ≤ 1
m(zi2 +E[z]2) ≤ 2n m =: R
◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ
Km − K.
◮ Bound residual norm:
Ei = 1 mziz⊤
i −E[zz⊤] ≤ 1
m(zi2 +E[z]2) ≤ 2n m =: R
◮ Bound marginal variances:
E[E2
i ] = 1
m2 E
i − K)2
= 1 m2 E
i − zizT i K − KzizT i + K2
1 m2
nK m2 .
◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ
Km − K.
◮ Bound residual norm:
Ei = 1 mziz⊤
i −E[zz⊤] ≤ 1
m(zi2 +E[z]2) ≤ 2n m =: R
◮ Bound marginal variances:
E[E2
i ] = 1
m2 E
i − K)2
= 1 m2 E
i − zizT i K − KzizT i + K2
1 m2
nK m2 .
◮ Combine marginal variances with Jensen’s:
E[E2] ≤
i=1 EE2 i
m .
◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ
Km − K.
◮ Bound residual norm:
Ei = 1 mziz⊤
i −E[zz⊤] ≤ 1
m(zi2 +E[z]2) ≤ 2n m =: R
◮ Bound marginal variances:
E[E2
i ] = 1
m2 E
i − K)2
= 1 m2 E
i − zizT i K − KzizT i + K2
1 m2
nK m2 .
◮ Combine marginal variances with Jensen’s:
E[E2] ≤
i=1 EE2 i
m .
◮ Invoke Matrix Bernstein on E − E[E] = E = ˆ
K − K.
MNIST reconstructions from top 20 nonlinear PCs: CIFAR-10 reconstructions from top 40 nonlinear PCs: CIFAR-10 reconstructions from top 100 nonlinear PCs:
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
Correlation: 0.00535698173655
From [LPHS13, LP16]
Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
Given {xi}nx
i=1 ∼ P nx and {yi}ny i=1 ∼ Qny, is P = Q?
Remember Arthur’s lecture? Use the MMD test! [GBR+12] MMD2({xi}, {yi}, k) = µk({xi}) − µk({yi})2
Hk
= 1 n2
x n2
x
k(xi, xj) − 2 nxny
nx,ny
k(xi, yj) + 1 n2
y ny
k(yi, yj) Computing MMD takes O(n2
x + nxny + n2 y) time...
Given {xi}nx
i=1 ∼ P nx and {yi}ny i=1 ∼ Qny, is P = Q?
Remember Arthur’s lecture? Use the randomized MMD test! RMMD2({xi}, {yi}, k) =
nx
nx
z(xi) − 1 ny
ny
z(yi)
Rm
Computing RMMD takes O(nx + ny) time! From [LP16].
x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x → y x → y x → y x → y x ← y x ← y x → y x → y x ← y x → y x ← y x → y x ← y x → y x ← y x ← y x → y x → y x → y x ← y x → y
20 40 60 80 100 20 40 60 80 100 decission rate classification accuracy RCC ANM IGCI
From [LP16]
James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 13(1):281–305, 2012. Roger Eckhardt. Stan Ulam, John von Neumann, and the Monte Carlo Method. Los Alamos Science, pages 131–143, 1987. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨
A kernel two-sample test. JMLR, 13(1):723–773, 2012. Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011. David Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016. David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨
The randomized dependence coefficient. In NIPS, pages 1–9, 2013. David Lopez-Paz, Suvrit Sra, Alexander J. Smola, Zoubin Ghahramani, and Bernhard Sch¨
Randomized nonlinear component analysis. In ICML, pages 1359–1367, 2014.
James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London, pages 415–446, 1909. Robert Nishihara, David Lopez-Paz, and L´ eon Bottou. No regret bound for extreme bandits. AISTATS, 2016. Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2008.
Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.