[PPT] - Randomized methods for machine learning David Lopez-Paz, FAIR May PowerPoint Presentation

SLIDE 1

Randomized methods for machine learning

David Lopez-Paz, FAIR May 17, 2016 http://tinyurl.com/randomized-practical

SLIDE 2

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 3

It all starts with a big bang...

SLIDE 4

... By some smart people.

SLIDE 5

The problem

f(x)dx

SLIDE 6

The problem, simplified

p(x)f(x)dx

SLIDE 7

The solution

p(x)f(x)dx ≈ 1

m

i=1

f(xi), xi ∼ p

SLIDE 8

From [Eck87].

SLIDE 9

Exercise

SLIDE 10

Example: computing π

O(m−1/2) convergence regardless of the dimensionality of x! Why?

SLIDE 11

Example: computing π

O(m−1/2) convergence regardless of the dimensionality of x! Why?

SLIDE 12

Example: computing π

O(m−1/2) convergence regardless of the dimensionality of x! Why?

SLIDE 13

Example: computing π

O(m−1/2) convergence regardless of the dimensionality of x! Why?

SLIDE 14

Monte Carlo model selection

Still cross-validating over grids? From [BB12], a.k.a. “The rule of 59“ [NLB16]: P(Fµ(min(x1, . . . , xT )) ≤ α) = 1 − (1 − α)⊤.

SLIDE 15

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 16

Truncated SVD

From research.facebook.com/blog/fast-randomized-svd/. Complexity of O(mn2)! [GVL12]

SLIDE 17

Randomized SVD [HMT11]

Computation of r-rank SVD of A ∈ Rm×n

1. Compute a column-ort. Q ∈ Rm×(r+p) s.t. A ≈ QQ⊤A
2. Construct B = Q⊤A, now B ∈ R(r+p)×n
3. Compute the SVD of B = SΣV ⊤ O((r + p)n2))
4. Note A ≈ QQ⊤A = QB = Q(SΣV ⊤)
5. Taking U = QS, return the SVD A = UΣV ⊤

SLIDE 18

Randomized SVD [HMT11]

Computation of r-rank SVD of A ∈ Rm×n

1. Compute a column-ort. Q ∈ Rm×(r+p) s.t. A ≈ QQ⊤A
2. Construct B = Q⊤A, now B ∈ R(r+p)×n
3. Compute the SVD of B = SΣV ⊤ O((r + p)n2))
4. Note A ≈ QQ⊤A = QB = Q(SΣV ⊤)
5. Taking U = QS, return the SVD A = UΣV ⊤

Hey, but how do I compute Q?

SLIDE 19

Randomized SVD [HMT11]

Computation of r-rank SVD of A ∈ Rm×n

1. Compute a column-ort. Q ∈ Rm×(r+p) s.t. A ≈ QQ⊤A
2. Construct B = Q⊤A, now B ∈ R(r+p)×n
3. Compute the SVD of B = SΣV ⊤ O((r + p)n2))
4. Note A ≈ QQ⊤A = QB = Q(SΣV ⊤)
5. Taking U = QS, return the SVD A = UΣV ⊤

Hey, but how do I compute Q? At random! :)

SLIDE 20

Randomized SVD [HMT11]

Computation of r-rank SVD of A ∈ Rm×n

1. Compute a column-ort. Q ∈ Rm×(r+p) s.t. A ≈ QQ⊤A
2. Construct B = Q⊤A, now B ∈ R(r+p)×n
3. Compute the SVD of B = SΣV ⊤ O((r + p)n2))
4. Note A ≈ QQ⊤A = QB = Q(SΣV ⊤)
5. Taking U = QS, return the SVD A = UΣV ⊤

Hey, but how do I compute Q? At random! :)

1. Take Y = AΩ, where Ωi,j ∼ N(0, 1).
2. Ω ∈ Rn×(r+p), but allows efficient multiplication
3. Compute Y = QR O(m(r + p)2)
4. Return Q ∈ Rm×(r+p)

SLIDE 21

Exercise

SLIDE 22

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 23

Dimensionality reduction

Random projections offer fast and efficient dimensionality reduction.

R40500 R100 x1 x2 y1 y2 w ∈ R40500×100 w ∈ R40500×100 δ δ(1 ± ǫ) (1 − ǫ)x1 − x22 ≤ y1 − y22 ≤ (1 + ǫ)x1 − x22 This result is formalized in the Johnson-Lindenstrauss Lemma

SLIDE 24

The Johnson-Lindenstrauss Lemma

The proof is one example of Erd¨

s’ probabilistic method (1947).

Paul Erd¨

s

Joram Lindenstrauss William Johnson 1913-1996 1936-2012 1944- §12.5 of Foundations of Machine Learning (Mohri et al., 2012)

SLIDE 25

Auxiliary Lemma 1

Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof:

SLIDE 26

Auxiliary Lemma 1

Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality

Pr[X ≥ a] ≤ E[X]

a

:

Pr[Q ≥ (1+ǫ)k] =

SLIDE 27

Auxiliary Lemma 1

Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality

Pr[X ≥ a] ≤ E[X]

a

:

Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k =

SLIDE 28

Auxiliary Lemma 1

Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality

Pr[X ≥ a] ≤ E[X]

a

:

Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k = (1 − 2λ)−k/2 eλ(1+ǫ)k , where E[eλQ] = (1 − 2λ)−k/2 is the mgf of a χ2 distr., λ < 1

2.

SLIDE 29

Auxiliary Lemma 1

Let Q be a random variable following a χ2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1/2: Pr[(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: start with Markov’s inequality

Pr[X ≥ a] ≤ E[X]

a

:

Pr[Q ≥ (1+ǫ)k] = Pr[eλQ ≥ eλ(1+ǫ)k] ≤ E[eλQ] eλ(1+ǫ)k = (1 − 2λ)−k/2 eλ(1+ǫ)k , where E[eλQ] = (1 − 2λ)−k/2 is the mgf of a χ2 distr., λ < 1

2.

To tight the bound we minimize the rhs with λ =

ǫ 2(1+ǫ):

Pr[Q ≥ (1 + ǫ)k] ≤ (1 −

ǫ 1+ǫ)−k/2

eǫk/2 = (1 + ǫ)k/2 (eǫ)k/2 = 1 + ǫ eǫ k/2 .

SLIDE 30

Auxiliary Lemma 1

Using 1 + ǫ ≤ eǫ−(ǫ2−ǫ3)/2 yields Pr[Q ≥ (1 + ǫ)k] ≤ 1 + ǫ eǫ k/2 ≤  eǫ− ǫ2−ǫ3

2

eǫ   = e− k

4 (ǫ2−ǫ3).

Pr[Q ≤ (1 − ǫ)k] is bounded similarly, and the lemma follows by union bound.

SLIDE 31

Auxiliary Lemma 2

Let x ∈ RN, k < N and A ∈ Rk×N with Aij ∼ N(0, 1). Then, for any 0 ≤ ǫ ≤ 1/2: Pr[(1 − ǫ)x2 ≤ 1 √ k Ax2 ≤ (1 + ǫ)x2] ≥ 1 − 2e−(ǫ2−ǫ3)k/4. Proof: let ˆ x = Ax. Then, E[ˆ xj] = 0, and E[ˆ x2

j] = E

  N

i

Ajixi 2  = E N

i

A2

jix2 i

=

N

i

x2

i = x2.

Note that Tj = ˆ xj/x ∼ N(0, 1). Then, Q = k

i T 2 j ∼ χ2 k.

Remember the previous lemma?

SLIDE 32

Auxiliary Lemma 2

Remember: ˆ x = Ax, Tj = ˆ xj/x ∼ N(0, 1), Q = k

i T 2 j ∼ χ2 k:

Pr[(1 − ǫ)x2 ≤ 1 √ k Ax2 ≤ (1 + ǫ)x2] = Pr[(1 − ǫ)x2 ≤ ˆ x2 k ≤ (1 + ǫ)x2] = Pr[(1 − ǫ)k ≤ ˆ x2 x2 ≤ (1 + ǫ)k] = Pr

(1 − ǫ)k ≤

k

i

T 2

j ≤ (1 + ǫ)k

=

Pr [(1 − ǫ)k ≤ Q ≤ (1 + ǫ)k] ≥ 1 − 2e−(ǫ2−ǫ3)k/4

SLIDE 33

The Johnson-Lindenstrauss Lemma

For any 0 < ǫ < 1/2 and any integer m > 4, let k = 20 log m

ǫ2

. Then, for any set V of m points in RN ∃ f : RN → Rk s.t. ∀ u, v ∈ V : (1 − ǫ)u − v2 ≤ f(u) − f(v)2 ≤ (1 + ǫ)u − v2. Proof: Let f =

1 √ kA, A ∈ Rk×N, k < N and Aij ∼ N(0, 1). ◮ Apply previous lemma with x = u − v to lower bound the

success probability by 1 − 2e−(ǫ2−ǫ3)k/4.

◮ Union bound over the m2 pairs in V with k = 20 log m ǫ2

and ǫ < 1/2 to obtain: Pr[success] ≥ 1−2m2e−(ǫ2−ǫ3)k/4 = 1−2m5ǫ−3 > 1−2m−1/2 > 0.

SLIDE 34

Exercise

SLIDE 35

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 36

The kernel trick

k(x, x′) = φ(x), φ(x′)H, f(x)

SLIDE 37

The kernel trick?

k(x, x′) = φ(x), φ(x′)H, f(x) ≈

n

i=1

αik(x, xi).

SLIDE 38

The kernel trap!

To compute {αi}n

i=1, construct the n × n monster:

K · k(xi, xj)

SLIDE 39

Mercer’s theorem

Theorem (Mercer’s condition [Mer09])

Under mild technical assumptions, k admit a representation k(x, x′) =

∞

j=1

λjφλj(x)φλj(x′). If λ1 :=

i |λj| ≤ ∞, we can cast the previous as

k(x, x′) = λ1 E

λ∼p(λ)

φλ(x)φλ(x′)
.

Any ideas? :)

SLIDE 40

Mercer’s theorem

Theorem (Mercer’s condition [Mer09])

Under mild technical assumptions, k admit a representation k(x, x′) =

∞

j=1

λjφλj(x)φλj(x′). If λ1 :=

i |λj| ≤ ∞, we can cast the previous as

k(x, x′) = λ1 E

λ∼p(λ)

φλ(x)φλ(x′)
.

Any ideas? :) Monte Carlo again!

SLIDE 41

Mercer’s theorem

Theorem (Mercer’s condition [Mer09])

Under mild technical assumptions, k admit a representation k(x, x′) =

∞

j=1

λjφλj(x)φλj(x′). If λ1 :=

i |λj| ≤ ∞, we can cast the previous as

k(x, x′) = λ1 E

λ∼p(λ)

φλ(x)φλ(x′)
.

Any ideas? :) Monte Carlo again! But what are those {φλ}λ?

SLIDE 42

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 43

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 44

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 45

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 46

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 47

Random features [RR08]

For shift-invariant k, the basis {φλi}i is the Fourier basis: k(x − x′) = ck

Rd p(w)ejw,x−x′dw

= ck

Rd p(w) cos
w, x − x′
dw

= 2ck

Rd

2π p(w) 2π cos(w, x + b) cos(w, x′ + b)dwdb ≈ 2ck m

m

i=1

cos(wi, x + bi) cos(wi, x′ + bi) =

z(x), z(x′)
Rm ,

where wi ∼ p(w), bi ∼ U[0, 2π], for all 1 ≤ i ≤ m and z(x) = 2ck

√m (cos(w1, x + b1), . . . , cos(wm, x + bm))⊤ ∈ Rm

Gaussian kernel exp(−γx − x′) ⇒ p(w) = N(0, 2γI).

SLIDE 48

Exercise

SLIDE 49

Random features and uncertainty

x f(x)

3 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 50

Random features and uncertainty

x f(x)

4 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 51

Random features and uncertainty

x f(x)

5 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 52

Random features and uncertainty

x f(x)

10 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 53

Random features and uncertainty

x f(x)

20 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 54

Random features and uncertainty

x f(x)

200 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 55

Random features and uncertainty

x f(x)

400 features GP Posterior Mean GP Posterior Uncertainty Data

SLIDE 56

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 57

Nonlinear component analysis: PCA

Exploratory analysis Manifold learning

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

Feature learning Computing KPCA takes O(n3) time

SLIDE 58

Nonlinear component analysis: PCA

Randomize! Perform linear PCA on top of random features. For analysis, we will study the spectral properties of

◮ The true kernel matrix Kij = k(xi, xj). ◮ Its randomized approximation ˆ

Kij = z(xi), z(xj)Rm. That is, we are interested in upper-bounding ˆ K − K.

SLIDE 59

Nonlinear component analysis: PCA

Theorem (RPCA error bound [LSS+14])

Let z(x)j ≤ 1 for all x and j. Then, E ˆ K − K ≤

3nK log n

m + 2n log n m .

Theorem (Matrix Bernstein [Tro15])

Let Z1, . . . Zm be independent d × d random matrices. Assume that E[Zi] = 0 and that Zi ≤ R. Define the variance parameter σ2 :=

i E[ZT

i Zi]

.

Then, E

i Zi
≤
3σ2 log(2d) + R log(2d).

SLIDE 60

◮ Let zi = (z(x1)i, . . . , z(xn)i).

SLIDE 61

◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ

Km − K.

SLIDE 62

◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ

Km − K.

◮ Bound residual norm:

Ei = 1 mziz⊤

i −E[zz⊤] ≤ 1

m(zi2 +E[z]2) ≤ 2n m =: R

SLIDE 63

◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ

Km − K.

◮ Bound residual norm:

Ei = 1 mziz⊤

i −E[zz⊤] ≤ 1

m(zi2 +E[z]2) ≤ 2n m =: R

◮ Bound marginal variances:

E[E2

i ] = 1

m2 E

(zizT

i − K)2

= 1 m2 E

zi2zizT

i − zizT i K − KzizT i + K2

1 m2

nK − 2K2 + K2

nK m2 .

SLIDE 64

◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ

Km − K.

◮ Bound residual norm:

Ei = 1 mziz⊤

i −E[zz⊤] ≤ 1

m(zi2 +E[z]2) ≤ 2n m =: R

◮ Bound marginal variances:

E[E2

i ] = 1

m2 E

(zizT

i − K)2

= 1 m2 E

zi2zizT

i − zizT i K − KzizT i + K2

1 m2

nK − 2K2 + K2

nK m2 .

◮ Combine marginal variances with Jensen’s:

E[E2] ≤

m

i=1 EE2 i

≤ nK

m .

SLIDE 65

◮ Let zi = (z(x1)i, . . . , z(xn)i). ◮ Define residuals Ei = 1 m(ziz⊤ i − K) such that E = ˆ

Km − K.

◮ Bound residual norm:

Ei = 1 mziz⊤

i −E[zz⊤] ≤ 1

m(zi2 +E[z]2) ≤ 2n m =: R

◮ Bound marginal variances:

E[E2

i ] = 1

m2 E

(zizT

i − K)2

= 1 m2 E

zi2zizT

i − zizT i K − KzizT i + K2

1 m2

nK − 2K2 + K2

nK m2 .

◮ Combine marginal variances with Jensen’s:

E[E2] ≤

m

i=1 EE2 i

≤ nK

m .

◮ Invoke Matrix Bernstein on E − E[E] = E = ˆ

K − K.

SLIDE 66

Nonlinear component analysis: PCA

MNIST reconstructions from top 20 nonlinear PCs: CIFAR-10 reconstructions from top 40 nonlinear PCs: CIFAR-10 reconstructions from top 100 nonlinear PCs:

SLIDE 67

Exercise

SLIDE 68

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 69

Measures of dependence

1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5

Correlation: 0.00535698173655

From [LPHS13, LP16]

SLIDE 70

Exercise

SLIDE 71

Some random examples

Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings

SLIDE 72

Two sample testing

Given {xi}nx

i=1 ∼ P nx and {yi}ny i=1 ∼ Qny, is P = Q?

Remember Arthur’s lecture? Use the MMD test! [GBR+12] MMD2({xi}, {yi}, k) = µk({xi}) − µk({yi})2

Hk

= 1 n2

x n2

x

i,j=1

k(xi, xj) − 2 nxny

nx,ny

i,j=1

k(xi, yj) + 1 n2

y ny

i,j=1

k(yi, yj) Computing MMD takes O(n2

x + nxny + n2 y) time...

SLIDE 73

Two sample testing

Given {xi}nx

i=1 ∼ P nx and {yi}ny i=1 ∼ Qny, is P = Q?

Remember Arthur’s lecture? Use the randomized MMD test! RMMD2({xi}, {yi}, k) =

1

nx

i=1

z(xi) − 1 ny

ny

i=1

z(yi)

2

Rm

Computing RMMD takes O(nx + ny) time! From [LP16].

SLIDE 74

Exercise

SLIDE 75

Distribution classification

x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x → y x → y x → y x → y x ← y x ← y x → y x → y x ← y x → y x ← y x → y x ← y x → y x ← y x ← y x → y x → y x → y x ← y x → y

SLIDE 76

Exercise

SLIDE 77

Distribution classification

20 40 60 80 100 20 40 60 80 100 decission rate classification accuracy RCC ANM IGCI

From [LP16]

SLIDE 78

References I

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 13(1):281–305, 2012. Roger Eckhardt. Stan Ulam, John von Neumann, and the Monte Carlo Method. Los Alamos Science, pages 131–143, 1987. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨

lkopf, and Alexander Smola.

A kernel two-sample test. JMLR, 13(1):723–773, 2012. Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.

SLIDE 79

References II

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011. David Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016. David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨

lkopf.

The randomized dependence coefficient. In NIPS, pages 1–9, 2013. David Lopez-Paz, Suvrit Sra, Alexander J. Smola, Zoubin Ghahramani, and Bernhard Sch¨

lkopf.

Randomized nonlinear component analysis. In ICML, pages 1359–1367, 2014.

SLIDE 80

References III

James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London, pages 415–446, 1909. Robert Nishihara, David Lopez-Paz, and L´ eon Bottou. No regret bound for extreme bandits. AISTATS, 2016. Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2008.

SLIDE 81

References IV

Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.