Final exam review CS 446 Selected lecture slides 1 / 61 - - PowerPoint PPT Presentation

final exam review
SMART_READER_LITE
LIVE PREVIEW

Final exam review CS 446 Selected lecture slides 1 / 61 - - PowerPoint PPT Presentation

Final exam review CS 446 Selected lecture slides 1 / 61 Hoeffdings inequality Theorem (Hoeffdings inequality). Given IID Z i [ a, b ] , 2 n 2 1 exp Pr Z i E Z 1 . ( b a ) 2 n


slide-1
SLIDE 1

Final exam review

CS 446

slide-2
SLIDE 2

Selected lecture slides

1 / 61

slide-3
SLIDE 3

Hoeffding’s inequality

Theorem (Hoeffding’s inequality). Given IID Zi ∈ [a, b], Pr   1 n

  • i

Zi − EZ1 ≥ ǫ   ≤ exp

  • −2nǫ2

(b − a)2

  • .

Alternatively, with probability at least 1 − δ, 1 n

n

  • i=1

Zi ≤ EZ1 + (b − a)

  • ln(1/δ)

2n .

2 / 61

slide-4
SLIDE 4

Hoeffding’s inequality

Theorem (Hoeffding’s inequality). Given IID Zi ∈ [a, b], Pr   1 n

  • i

Zi − EZ1 ≥ ǫ   ≤ exp

  • −2nǫ2

(b − a)2

  • .

Alternatively, with probability at least 1 − δ, 1 n

n

  • i=1

Zi ≤ EZ1 + (b − a)

  • ln(1/δ)

2n . Remarks. ◮ Can flip inequality by replacing Zi with −Zi. ◮ Using the second (“inverted”) form: with probability at least 1 − δ, R(h) ≤ R(h) +

  • ln(1/δ)

2n . ◮ Alternatively: setting δ = 10−k, with probability at least 99.99 · · · 9% (k 9s), we have R(h) ≤ R(h) +

  • k ln(10)/(2n): to

add more bits of confidence, we must increase sample size n linearly. ◮ Hoeffding captures a “concentration of measure” phenomenon: probability mass concentrates within [−1/√n, +1/√n].

2 / 61

slide-5
SLIDE 5

Finite class bound — summary

  • Theorem. Let predictors (h1, . . . , hk) be given.

With probability ≥ 1 − δ over an IID draw ((Xi, Yi))n

i=1,

R(hj) ≤ R(hj) +

  • ln k + ln(1/δ)

2n ∀j. Remarks. ◮ If we choose (h1, . . . , hk) before seeing ((Xi, Yi))n

i=1,

we can use this bound. ◮ Example: train k classifiers, pick the best on validation set! ◮ This approach “produce bound for all possible algo outputs” may seem sloppy, but it’s the best we have! ◮ Letting F = (h1, . . . , hk) denote our set of predictors, the bound is: with probability ≥ 1 − δ, every f ∈ F satisfies R(f) ≤ R(f) +

  • ln |F| + ln 1/δ

2n . In the next sections, we’ll handle |F| = ∞ by replacing ln |F| with complexity(F), whose meaning will vary.

3 / 61

slide-6
SLIDE 6

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number of points where F can realize all labelings: VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn),

4 / 61

slide-7
SLIDE 7

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number of points where F can realize all labelings: VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn),

4 / 61

slide-8
SLIDE 8

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number of points where F can realize all labelings: VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

4 / 61

slide-9
SLIDE 9

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number of points where F can realize all labelings: VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

Remarks. ◮ |F| can be infinite! ◮ Definition only requires some set of points we can label in every way; this set is unrelated to the IID sample for the bound. ◮ Say that F shatters (x1, . . . , xn) when it can realize all labelings.

4 / 61

slide-10
SLIDE 10

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

5 / 61

slide-11
SLIDE 11

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C.

5 / 61

slide-12
SLIDE 12

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C.

5 / 61

slide-13
SLIDE 13

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C. ◮ Compared to VC: depends on (x1, . . . , xn), doesn’t require classification, is sensitive to scale of f.

5 / 61

slide-14
SLIDE 14

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C. ◮ Compared to VC: depends on (x1, . . . , xn), doesn’t require classification, is sensitive to scale of f. ◮ The general form (not presented here) can handle including labels, multiclass, amongst other things.

5 / 61

slide-15
SLIDE 15

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization).

6 / 61

slide-16
SLIDE 16

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given.

6 / 61

slide-17
SLIDE 17

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies:

6 / 61

slide-18
SLIDE 18

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”).

6 / 61

slide-19
SLIDE 19

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F.

6 / 61

slide-20
SLIDE 20

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F. With probability ≥ 1 − δ, every f ∈ F satisfies Rℓ(f) ≤ Rℓ(f) + 2ρRad(F) + 3(b − a)

  • ln(2/δ)

2n .

6 / 61

slide-21
SLIDE 21

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F. With probability ≥ 1 − δ, every f ∈ F satisfies Rℓ(f) ≤ Rℓ(f) + 2ρRad(F) + 3(b − a)

  • ln(2/δ)

2n . Remarks. ◮ Can get a bound in terms of 1/(λn) for linear SVM and ridge

  • regression. (Homework problems?)

◮ Kernel SVM okay as well.

6 / 61

slide-22
SLIDE 22

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples.

7 / 61

slide-23
SLIDE 23

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples. ◮ If x ≤ R, then Rad({x → x

Tw : w ≤ W}) ≤ RW

√n . For SVM, we can set W =

  • 2/λ.

7 / 61

slide-24
SLIDE 24

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples. ◮ If x ≤ R, then Rad({x → x

Tw : w ≤ W}) ≤ RW

√n . For SVM, we can set W =

  • 2/λ.

◮ For deep networks, we have Rad(F) ≤ Lipschitz ·

  • Junk/n;

still very loose.

7 / 61

slide-25
SLIDE 25

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

8 / 61

slide-26
SLIDE 26

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ?

8 / 61

slide-27
SLIDE 27

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it!

8 / 61

slide-28
SLIDE 28

SVD reminder

9 / 61

slide-29
SLIDE 29

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv. 9 / 61

slide-30
SLIDE 30

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv.

  • 2. Thin decomposition SVD: M = r

i=1 siuivT i.

9 / 61

slide-31
SLIDE 31

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv.

  • 2. Thin decomposition SVD: M = r

i=1 siuivT i.

  • 3. Full factorization SVD: M = USV

T. 9 / 61

slide-32
SLIDE 32

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv.

  • 2. Thin decomposition SVD: M = r

i=1 siuivT i.

  • 3. Full factorization SVD: M = USV

T.

  • 4. “Operational” view of SVD: for M ∈ Rn×d,

  ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓   ·       s1 ... sr       ·   ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓  

.

9 / 61

slide-33
SLIDE 33

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv.

  • 2. Thin decomposition SVD: M = r

i=1 siuivT i.

  • 3. Full factorization SVD: M = USV

T.

  • 4. “Operational” view of SVD: for M ∈ Rn×d,

  ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓   ·       s1 ... sr       ·   ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓  

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively).

9 / 61

slide-34
SLIDE 34

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M

Tu = sv.

  • 2. Thin decomposition SVD: M = r

i=1 siuivT i.

  • 3. Full factorization SVD: M = USV

T.

  • 4. “Operational” view of SVD: for M ∈ Rn×d,

  ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓   ·       s1 ... sr       ·   ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓  

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let (U k, Sk, V k) denote the truncated SVD with U k ∈ Rd×k (first k columns of U), similarly for the others.

9 / 61

slide-35
SLIDE 35

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV

T and integer k ≤ r be

given. min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

10 / 61

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV

T and integer k ≤ r be

given. min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is unique, but r

i=1 s2 i unique.

10 / 61

slide-39
SLIDE 39

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV

T and integer k ≤ r be

given. min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is unique, but r

i=1 s2 i unique.

Remark 2. As written, this is not a convex optimization problem!

10 / 61

slide-40
SLIDE 40

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV

T and integer k ≤ r be

given. min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is unique, but r

i=1 s2 i unique.

Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . .

10 / 61

slide-41
SLIDE 41

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

11 / 61

slide-42
SLIDE 42

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance; 11 / 61

slide-43
SLIDE 43

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance;

1 n(XD)

T(XD) is data covariance after projection; 11 / 61

slide-44
SLIDE 44

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance;

1 n(XD)

T(XD) is data covariance after projection;

lastly 1 nXD2

F = 1

n tr

  • (XD)

T(XD)

  • = 1

n

k

  • i=1

(XDei)

T(XDei),

therefore PCA is maximizing the resulting per-coordinate variances!

11 / 61

slide-45
SLIDE 45

Lloyd’s method revisited

  • 1. Choose initial clusters (S1, . . . , Sk).
  • 2. Repeat until convergence:

2.1 (Recenter.) Set µj := mean(Sj) for j ∈ (1, . . . , k). 2.2 (Reassign). Update Sj := {xi : µ(xi) = µj} for j ∈ (1, . . . , k). (“µ(xi)” means “center closest to xi”; break ties arbitrarily).

12 / 61

slide-46
SLIDE 46

Lloyd’s method revisited

  • 1. Choose initial clusters (S1, . . . , Sk).
  • 2. Repeat until convergence:

2.1 (Recenter.) Set µj := mean(Sj) for j ∈ (1, . . . , k). 2.2 (Reassign). Update Sj := {xi : µ(xi) = µj} for j ∈ (1, . . . , k). (“µ(xi)” means “center closest to xi”; break ties arbitrarily).

Geometric perspective: ◮ Centers define a Voronoi diagram/partition: for each µj, define cell Vj := {x ∈ Rd : µ(x) = µj} (break ties arbitrarily). ◮ Reassignment leaves assignment consistent with Voronoi cells. ◮ Recentering might shift data outside Voronoi cells, except if we’ve converged! ◮ See http://mjt.cs.illinois.edu/htv/ for an interactive demo.

12 / 61

slide-47
SLIDE 47

Does Lloyd’s method solve the original problem?

Theorem. ◮ For all t, φ(Ct; At−1) ≥ φ(Ct; At) ≥ φ(Ct+1; At). ◮ The method terminates.

13 / 61

slide-48
SLIDE 48

Does Lloyd’s method solve the original problem?

Theorem. ◮ For all t, φ(Ct; At−1) ≥ φ(Ct; At) ≥ φ(Ct+1; At). ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ(Ct; At) = φ(Ct; A(Ct−1)) = min

A∈A φ(Ct; A) ≤ φ(Ct, At−1),

φ(Ct+1; At) = φ(C(At); At) = min

C∈C φ(C; At) ≤ φ(Ct, At),

◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of (xi)n

i=1.

  • 13 / 61
slide-49
SLIDE 49

Does Lloyd’s method solve the original problem?

Theorem. ◮ For all t, φ(Ct; At−1) ≥ φ(Ct; At) ≥ φ(Ct+1; At). ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ(Ct; At) = φ(Ct; A(Ct−1)) = min

A∈A φ(Ct; A) ≤ φ(Ct, At−1),

φ(Ct+1; At) = φ(C(At); At) = min

C∈C φ(C; At) ≤ φ(Ct, At),

◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of (xi)n

i=1.

  • (That didn’t answer the question. . . )

13 / 61

slide-50
SLIDE 50

Seriously: does Lloyd’s method solve the original problem?

◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost. (Suppose width is c > 1 and height is 1.)

14 / 61

slide-51
SLIDE 51

Seriously: does Lloyd’s method solve the original problem?

◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost. (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2Ω(√n) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . )

14 / 61

slide-52
SLIDE 52

Seriously: does Lloyd’s method solve the original problem?

◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost. (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2Ω(√n) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) So: in practice, yes; in theory, don’t know. . .

14 / 61

slide-53
SLIDE 53

Application: vector quantization.

Vector quantization with k-means. ◮ Let (xi)n

i=1 be given.

◮ run k-means to obtain (µ1, . . . , µk). ◮ Replace each (xi)n

i=1 with (µ(xi))n i=1.

Encoding size reduces from O(nd) to O(kd + n ln(k)). Examples. ◮ Audio compression. ◮ Image compression.

15 / 61

slide-54
SLIDE 54

100 200 300 400 500 100 200 300 400 500

16 / 61

slide-55
SLIDE 55

100 200 300 400 500 100 200 300 400 500

patch quantization, width 10, 8 exemplars

16 / 61

slide-56
SLIDE 56

100 200 300 400 500 100 200 300 400 500

patch quantization, width 10, 32 exemplars

16 / 61

slide-57
SLIDE 57

100 200 300 400 500 100 200 300 400 500

patch quantization, width 10, 128 exemplars

16 / 61

slide-58
SLIDE 58

100 200 300 400 500 100 200 300 400 500

patch quantization, width 10, 512 exemplars

16 / 61

slide-59
SLIDE 59

100 200 300 400 500 100 200 300 400 500

patch quantization, width 10, 2048 exemplars

16 / 61

slide-60
SLIDE 60

100 200 300 400 500 100 200 300 400 500

patch quantization, width 25, 8 exemplars

16 / 61

slide-61
SLIDE 61

100 200 300 400 500 100 200 300 400 500

patch quantization, width 25, 32 exemplars

16 / 61

slide-62
SLIDE 62

100 200 300 400 500 100 200 300 400 500

patch quantization, width 25, 128 exemplars

16 / 61

slide-63
SLIDE 63

100 200 300 400 500 100 200 300 400 500

patch quantization, width 25, 256 exemplars

16 / 61

slide-64
SLIDE 64

100 200 300 400 500 100 200 300 400 500

patch quantization, width 50, 8 exemplars

16 / 61

slide-65
SLIDE 65

100 200 300 400 500 100 200 300 400 500

patch quantization, width 50, 32 exemplars

16 / 61

slide-66
SLIDE 66

100 200 300 400 500 100 200 300 400 500

patch quantization, width 50, 64 exemplars

16 / 61

slide-67
SLIDE 67

Initialization matters!

◮ Easy choices:

◮ k random points from dataset. ◮ Random partition.

◮ Standard choice (theory and practice): “D2-sampling”/kmeans++

  • 1. Choose µ1 uniformly at random from data.
  • 2. for j ∈ (2, . . . , k):

2.1 Choose xi ∝ minl<j xi − µl2

2.

◮ kmeans++ is randomized furthest-first traversal; regular furthest-first fails with outliers. ◮ Scikits-learn and Matlab both default to kmeans++.

17 / 61

slide-68
SLIDE 68

Maximum likelihood: abstract formulation

We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data.

18 / 61

slide-69
SLIDE 69

Maximum likelihood: abstract formulation

We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle: ◮ Pick a set of probability models for your data: P := {pθ : θ ∈ Θ}.

◮ pθ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples (zi)n

i=1, pick the model that maximized the likelihood

max

θ∈Θ L(θ) = max θ∈Θ ln n

  • i=1

pθ(zi) = max

θ∈Θ n

  • i=1

ln pθ(zi), where the ln(·) is for mathematical convenience, and zi can be a labeled pair (xi, yi) or just xi.

18 / 61

slide-70
SLIDE 70

Example 1: coin flips.

◮ We flip a coin of bias θ ∈ [0, 1]. ◮ Write down xi = 0 for tails, xi = 1 for heads; then pθ(xi) = xiθ + (1 − xi)(1 − θ),

  • r alternatively

pθ(xi) = θxi(1 − θ)1−xi. The second form will be more convenient.

19 / 61

slide-71
SLIDE 71

Example 1: coin flips.

◮ We flip a coin of bias θ ∈ [0, 1]. ◮ Write down xi = 0 for tails, xi = 1 for heads; then pθ(xi) = xiθ + (1 − xi)(1 − θ),

  • r alternatively

pθ(xi) = θxi(1 − θ)1−xi. The second form will be more convenient. ◮ Writing H :=

i xi and T := i(1 − xi) = n − H for convenience,

L(θ) =

n

  • i=1
  • xi ln θ + (1 − xi) ln(1 − θ)
  • = H ln θ + T ln(1 − θ).

19 / 61

slide-72
SLIDE 72

Example 1: coin flips.

◮ We flip a coin of bias θ ∈ [0, 1]. ◮ Write down xi = 0 for tails, xi = 1 for heads; then pθ(xi) = xiθ + (1 − xi)(1 − θ),

  • r alternatively

pθ(xi) = θxi(1 − θ)1−xi. The second form will be more convenient. ◮ Writing H :=

i xi and T := i(1 − xi) = n − H for convenience,

L(θ) =

n

  • i=1
  • xi ln θ + (1 − xi) ln(1 − θ)
  • = H ln θ + T ln(1 − θ).

Differentiating and setting to 0, 0 = H θ − T 1 − θ, which gives θ =

H T +H = H N .

◮ In this way, we’ve justified a natural algorithm.

19 / 61

slide-73
SLIDE 73

Example 2: mean of a Gaussian

◮ Suppose xi ∼ N(µ, σ2), so θ = (µ, σ2), and ln pθ(xi) = ln exp

  • − (xi−µ)2

2σ2

2πσ2 = −(xi − µ)2 2σ2 − ln(2πσ2) 2 .

20 / 61

slide-74
SLIDE 74

Example 2: mean of a Gaussian

◮ Suppose xi ∼ N(µ, σ2), so θ = (µ, σ2), and ln pθ(xi) = ln exp

  • − (xi−µ)2

2σ2

2πσ2 = −(xi − µ)2 2σ2 − ln(2πσ2) 2 . ◮ Therefore L(θ) = − 1 2σ2

n

  • i=1

(xi − µ)2 + stuff without µ; applying ∇µ and setting to zero gives µ = 1 n

  • i

xi. ◮ A similar derivation gives σ2 = 1

n

  • i(xi − µ)2.

20 / 61

slide-75
SLIDE 75

Example 4: Naive Bayes

◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max

y∈Y

p(Y = y|X = x). (We haven’t discussed this concept a lot, but it’s widespread in ML.)

21 / 61

slide-76
SLIDE 76

Example 4: Naive Bayes

◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max

y∈Y

p(Y = y|X = x). (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p(Y |X) exactly; that’s a pain.

21 / 61

slide-77
SLIDE 77

Example 4: Naive Bayes

◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max

y∈Y

p(Y = y|X = x). (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p(Y |X) exactly; that’s a pain. ◮ Let’s assume coordinates of X = (X1, . . . , Xd) are independent given Y : p(Y = y|X = x) = p(Y = y, X = x) p(X = x) = p(X = x|Y = y)p(Y = y) p(X = x) = p(Y = y) d

j=1 p(Xj = xj|Y = y)

p(X = x) , and arg max

y∈Y

p(Y = y|X = x) = arg max

y∈Y

p(Y = y)

d

  • j=1

p(X = x)|Y = y).

21 / 61

slide-78
SLIDE 78

Example 4: Naive Bayes (part 2)

arg max

y∈Y

p(Y = y|X = x) = arg max

y∈Y

p(Y = y)

d

  • j=1

p(X = x)|Y = y).

22 / 61

slide-79
SLIDE 79

Example 4: Naive Bayes (part 2)

arg max

y∈Y

p(Y = y|X = x) = arg max

y∈Y

p(Y = y)

d

  • j=1

p(X = x)|Y = y). Examples where this helps: ◮ Suppose X ∈ {0, 1}d has an arbitrary distribution;\ it’s specified with 2d − 1 numbers.\ The factored form above needs d numbers. To see how this can help, suppose x ∈ {0, 1}d; instead of having to learn a probability model of 2d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”.\ This data is continuous, Naive Bayes would approximate univariate distributions.

22 / 61

slide-80
SLIDE 80

Gaussian Mixture Model

◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete(π1, . . . , πk), X = x|Y = j ∼ N(µj, Σj), and the parameters are θ = ((π1, µ1, Σ1), . . . , (πk, µk, Σk)). (Note: this is a generative model, and we have a way to sample.)

23 / 61

slide-81
SLIDE 81

Gaussian Mixture Model

◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete(π1, . . . , πk), X = x|Y = j ∼ N(µj, Σj), and the parameters are θ = ((π1, µ1, Σ1), . . . , (πk, µk, Σk)). (Note: this is a generative model, and we have a way to sample.) ◮ The probability density (with parameters θ = ((πj, µj, Σj))k

j=1) at

a given x is pθ(x) =

k

  • j=1

pθ(x|y = j)pθ(y = j) =

k

  • j=1

pµj,Σj(x|Y = j)πj, and the likelihood problem is

L(θ) =

n

  • i=1

ln

k

  • j=1

πj

  • (2π)d|Σ|

exp

  • −1

2(xi − µj)

TΣ−1(xi − µj)

  • .

The ln and the exp are no longer next to each other; we can’t just take the derivative and set the answer to 0.

23 / 61

slide-82
SLIDE 82

Pearson’s crabs.

Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs

0.58 0.60 0.62 0.64 0.66 0.68 5 10 15 20 25

24 / 61

slide-83
SLIDE 83

Pearson’s crabs.

Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs

0.58 0.60 0.62 0.64 0.66 0.68 5 10 15 20 25

Doesn’t look Gaussian!

24 / 61

slide-84
SLIDE 84

Pearson’s crabs.

Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs

0.58 0.60 0.62 0.64 0.66 0.68 5 10 15 20 25

Pearson fit a mixture of two Gaussians.

25 / 61

slide-85
SLIDE 85

Pearson’s crabs.

Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs

0.58 0.60 0.62 0.64 0.66 0.68 5 10 15 20 25

Pearson fit a mixture of two Gaussians.

  • Remark. Pearson did not use E-M. For this he invented the “method of

moments” and obtained a solution by hand.

25 / 61

slide-86
SLIDE 86

Gaussian mixture likelihood with responsibility matrix R

Let’s replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • where R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} is a responsibility

matrix.

26 / 61

slide-87
SLIDE 87

Gaussian mixture likelihood with responsibility matrix R

Let’s replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • where R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} is a responsibility

matrix. Holding R fixed and optimizing θ gives πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n ; µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj . (Should use new mean in Σj so that all deriviatives 0.)

26 / 61

slide-88
SLIDE 88

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj].

27 / 61

slide-89
SLIDE 89

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj]. ◮ Key property: by this choice, φ(C; A) =

n

  • i=1

k

  • j=1

Aijxi − µj2 =

n

  • i=1

min

j

xi − µj2 = φ(C); therefore we can decrase φ(C) = φ(C; A) first by optimizing C to get φ(C′; A) ≤ φ(C; A), then setting A as above to get φ(C′) = φ(C′; A′) ≤ φ(C′; A) ≤ φ(C; A) = φ(C). In other words: we minimize φ(C) via φ(C; A).

27 / 61

slide-90
SLIDE 90

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj]. ◮ Key property: by this choice, φ(C; A) =

n

  • i=1

k

  • j=1

Aijxi − µj2 =

n

  • i=1

min

j

xi − µj2 = φ(C); therefore we can decrase φ(C) = φ(C; A) first by optimizing C to get φ(C′; A) ≤ φ(C; A), then setting A as above to get φ(C′) = φ(C′; A′) ≤ φ(C′; A) ≤ φ(C; A) = φ(C). In other words: we minimize φ(C) via φ(C; A). What fulfills the same role for L?

27 / 61

slide-91
SLIDE 91

E-M method for latent variable models

Define augmented likelihood L(θ; R) :=

n

  • i=1

k

  • j=1

Rij ln pθ(xi, yi = j) Rij , with responsibility matrix R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} . Alternate two steps: ◮ E-step: set (Rt)ij := pθt−1(yi = j|xi). ◮ M-step: set θt = arg maxθ∈Θ L(θ; Rt).

28 / 61

slide-92
SLIDE 92

E-M method for latent variable models

Define augmented likelihood L(θ; R) :=

n

  • i=1

k

  • j=1

Rij ln pθ(xi, yi = j) Rij , with responsibility matrix R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} . Alternate two steps: ◮ E-step: set (Rt)ij := pθt−1(yi = j|xi). ◮ M-step: set θt = arg maxθ∈Θ L(θ; Rt). Soon: we’ll see this gives nondecreasing likelihood!

28 / 61

slide-93
SLIDE 93

E-M for Gaussian mixtures

Initialization: a standard choice is πj = 1/k, Σj = I, and (µj)k

j=1 given

by k-means. ◮ E-step: Set Rij = pθ(yi = j|xi), meaning Rij = pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) = πjpµj,Σj(xi) k

l=1 πlpµl,Σl(xi)

. ◮ M-step: solve arg maxθ∈Θ L(θ; R), meaning πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n , µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj . (These are as before.)

29 / 61

slide-94
SLIDE 94

Demo: elliptical clusters

  • E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-95
SLIDE 95

Demo: elliptical clusters

  • E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-96
SLIDE 96

Demo: elliptical clusters

  • E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-97
SLIDE 97

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-98
SLIDE 98

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-99
SLIDE 99

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-100
SLIDE 100

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-101
SLIDE 101

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-102
SLIDE 102

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-103
SLIDE 103

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 / 61

slide-104
SLIDE 104

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-105
SLIDE 105

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-106
SLIDE 106

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-107
SLIDE 107

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-108
SLIDE 108

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-109
SLIDE 109

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-110
SLIDE 110

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-111
SLIDE 111

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-112
SLIDE 112

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-113
SLIDE 113

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-114
SLIDE 114

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-115
SLIDE 115

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-116
SLIDE 116

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-117
SLIDE 117

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-118
SLIDE 118

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-119
SLIDE 119

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-120
SLIDE 120

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-121
SLIDE 121

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 30 / 61

slide-122
SLIDE 122

Theorem. Suppose (R0, θ0) ∈ Rn,k × Θ arbitrary, thereafter (Rt, θt) given by E-M: (Rt)ij := pθt−1(y = j|xi). and θt := arg max

θ∈Θ

L(θ; Rt) Then L(θt; Rt) ≤ max

R∈Rn×k L(θt; R) = L(θt; Rt+1) = L(θt)

≤ L(θt+1; Rt+1). In particular, L(θt) ≤ L(θt+1).

31 / 61

slide-123
SLIDE 123

Theorem. Suppose (R0, θ0) ∈ Rn,k × Θ arbitrary, thereafter (Rt, θt) given by E-M: (Rt)ij := pθt−1(y = j|xi). and θt := arg max

θ∈Θ

L(θ; Rt) Then L(θt; Rt) ≤ max

R∈Rn×k L(θt; R) = L(θt; Rt+1) = L(θt)

≤ L(θt+1; Rt+1). In particular, L(θt) ≤ L(θt+1). Remarks. ◮ We proved a similar guarantee for k-means, which is also an alternating minimization scheme. ◮ Similarly, MLE for Gaussian mixtures is NP-hard; it is also known to need exponentially many samples in k to information-theoretically recover the parameters.

31 / 61

slide-124
SLIDE 124

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters.

32 / 61

slide-125
SLIDE 125

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = (1/k, . . . , 1/k) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σj := diag((σj)2

1, . . . , (σj)2 d) where

(σj)2

l :=

n

i=1 Rij(xi − µj)2 l

nπj ; that is: we use coordinate-wise sample variances weighted by R. Why is this a good idea?

32 / 61

slide-126
SLIDE 126

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = (1/k, . . . , 1/k) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σj := diag((σj)2

1, . . . , (σj)2 d) where

(σj)2

l :=

n

i=1 Rij(xi − µj)2 l

nπj ; that is: we use coordinate-wise sample variances weighted by R. Why is this a good idea? Computation (of inverse), sample complexity, . . .

32 / 61

slide-127
SLIDE 127

Singularities

E-M with GMMs suffers from singularities: trivial situations where the likelihood goes to ∞ but the solution is bad. ◮ Suppose: d = 1, k = 2, πj = 1/2, n = 3 with x1 = −1 and x2 = +1 and x3 = +3. Initialize with µ1 = 0 and σ1 = 1, but µ2 = +3 = x3 and σ2 = 1/100. Then σ2 → 0 and L ↑ ∞.

33 / 61

slide-128
SLIDE 128

Graphical models

Recall the generative story for GMMs: Y ∼ Discrete(π1, . . . , πk) pick a Gaussian; X|Y = j ∼ N(µj, Σj) pick a point. Y is latent/hidden/unobserved, X is observed.

34 / 61

slide-129
SLIDE 129

Graphical models

Recall the generative story for GMMs: Y ∼ Discrete(π1, . . . , πk) pick a Gaussian; X|Y = j ∼ N(µj, Σj) pick a point. Y is latent/hidden/unobserved, X is observed. This model consists of random variables (X, Y ) with a specific conditional dependence structure.

34 / 61

slide-130
SLIDE 130

Graphical models

Recall the generative story for GMMs: Y ∼ Discrete(π1, . . . , πk) pick a Gaussian; X|Y = j ∼ N(µj, Σj) pick a point. Y is latent/hidden/unobserved, X is observed. This model consists of random variables (X, Y ) with a specific conditional dependence structure. A graphical model is a compact way of representing a family of r.v.’s, most notably their conditional dependencies. Typically, the graphical model gives us ◮ a way to write down the (joint) probability distribution, ◮ guidance on how to sample.

34 / 61

slide-131
SLIDE 131

Graphical model for GMMs

Y X

Basic rules (there are many more): ◮ Nodes denote random variables. (Here we have (X, Y ).) ◮ Edges denote conditional dependence. (Here X depends on Y .) ◮ Shaded nodes (e.g., X) are observed; unshaded (Y ) are unobserved.

35 / 61

slide-132
SLIDE 132

Graphical model for GMMs

Y X

Basic rules (there are many more): ◮ Nodes denote random variables. (Here we have (X, Y ).) ◮ Edges denote conditional dependence. (Here X depends on Y .) ◮ Shaded nodes (e.g., X) are observed; unshaded (Y ) are unobserved. Likelihood of observations (X1, . . . , Xn) drawn from GMM:

p(X1, . . . , Xn) =

  • j1∈{1,...,k},...,jn∈{1,...,k}

p(X1, . . . , Xn, Y1 = j1, . . . , Yn = jn) =

  • j1∈{1,...,k}

... jn∈{1,...,k} n

  • i=1

p(Yi = ji)p(Xi|Yi = ji) =

n

  • i=1

k

  • ji=1

p(Yi = ji)p(Xi|Yi = ji).

35 / 61

slide-133
SLIDE 133

Graphical model for Naive Bayes

Y X1 X2 Xd

Recall the Naive Bayes model: both inputs and outputs (X, Y ) are observed (both are shaded!); coordinates (X1, . . . , Xd) are conditionally independent given Y (as indicated by the arrows!).

36 / 61

slide-134
SLIDE 134

Why do people use graphical models?

◮ Easy to interpret how data inter-depends, flows. ◮ Easy to add nodes and edges based on observations and beliefs. ◮ MLE, E-M, and others provide a well-weathered toolbox to fit them to data, sample, etc. ◮ Were very popular in the natural sciences (easy to encode domain knowledge); not yet clear how deep networks are displacing them (how to encode prior knowledge with deep networks?).

37 / 61

slide-135
SLIDE 135

Basic generative story

Generative networks provide a way to sample from any distribution.

  • 1. Sample z ∼ µ, where µ denotes an efficiently sampleable

distribution (e.g., uniform or Gaussian).

  • 2. Output g(z), where g : Rd → Rm is a deep network.

Notation: let g#µ (pushforward of µ through g) denote this distribution.

38 / 61

slide-136
SLIDE 136

Basic generative story

Generative networks provide a way to sample from any distribution.

  • 1. Sample z ∼ µ, where µ denotes an efficiently sampleable

distribution (e.g., uniform or Gaussian).

  • 2. Output g(z), where g : Rd → Rm is a deep network.

Notation: let g#µ (pushforward of µ through g) denote this distribution. Brief remarks: ◮ Can this model any target distribution ν? Yes, (roughly) for the same reason that g can approximate any f : Rd → Rm. ◮ Graphical models let us sample and estimate probabilities; what about here? Nope.

38 / 61

slide-137
SLIDE 137

Linear encoding: PCA

Encoding/decoding was one motivation for unsupervised learning. PCA gives the optimal linear encoding:

39 / 61

slide-138
SLIDE 138

Linear encoding: PCA

Encoding/decoding was one motivation for unsupervised learning. PCA gives the optimal linear encoding: PCA Theorem (part of it). Given X = USV

T and k ≤ r,

min

E∈Rd×k D∈Rk×d

X − XED2

F =

  • X − XV kV

T

k

  • 2

F . 39 / 61

slide-139
SLIDE 139

Encoding with deep networks

Let encoders E and decoders D denote families of deep networks from Rd to Rk and back. min

f∈E g∈D

1 n

n

  • i=1

xi − g(f(xi))2

2.

40 / 61

slide-140
SLIDE 140

Encoding with deep networks

Let encoders E and decoders D denote families of deep networks from Rd to Rk and back. min

f∈E g∈D

1 n

n

  • i=1

xi − g(f(xi))2

2.

Remarks. ◮ This is called an autoencoder. ◮ Rk is the latent space, f(x) ∈ Rk is latent representation of x. ◮ As with PCA, can relate weights of f and g; e.g., g uses the transposes of f’s matrices. (Not necessary, though.) ◮ Does small k enforce positive error? No; powerful E and D can do f(xi) = 1/i and g(1/i) = xi. ◮ Therefore we must regularize in some other way! ◮ Does squared error matter? Not really; it’s a choice like everything else in ML.

40 / 61

slide-141
SLIDE 141

Another way to visualize generative networks

Given a sample from a distribution (even g#µ), here’s the “kernel density” / “Parzen window” estimate of its density:

  • 1. Start with random draw (xi)n

i=1.

  • 2. “Place bumps at every xi”:

Define ˆ p(x) := 1

n

n

i=1 k

x−xi

h

  • ,

where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example:

41 / 61

slide-142
SLIDE 142

Another way to visualize generative networks

Given a sample from a distribution (even g#µ), here’s the “kernel density” / “Parzen window” estimate of its density:

  • 1. Start with random draw (xi)n

i=1.

  • 2. “Place bumps at every xi”:

Define ˆ p(x) := 1

n

n

i=1 k

x−xi

h

  • ,

where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example:

◮ Gaussian: k(z) ∝ exp

  • −z2/2
  • ;

◮ Epanechnikov: k(z) ∝ max{0, 1 − z2}.

41 / 61

slide-143
SLIDE 143

(Variational) Autoencoders

◮ Autoencoder: xi

f

− − →

map

latent zi = f(xi)

g

− − →

map

ˆ xi = g(zi). Objective:

1 n

n

i=1 ℓ(xi, ˆ

xi).

42 / 61

slide-144
SLIDE 144

(Variational) Autoencoders

◮ Autoencoder: xi

f

− − →

map

latent zi = f(xi)

g

− − →

map

ˆ xi = g(zi). Objective:

1 n

n

i=1 ℓ(xi, ˆ

xi). ◮ Variational Autoencoder: xi

f

− − →

map

latent distribution µi = f(xi)

g

− − − − − − − →

pushforward

ˆ xi ∼ g#µi. Objective:

1 n

n

i=1

  • ℓ(xi, ˆ

xi) + λKL(µ, µi)

  • .

42 / 61

slide-145
SLIDE 145
slide-146
SLIDE 146

Generative network setup and training.

◮ We are given (xi)n

i=1 ∼ ν.

◮ We want to find g so that (g(zi))n

i=1 ≈ (xi)n i=1, where (zi)n i=1 ∼ µ.

Problem: this isn’t as simple as fitting g(zi) ≈ xi.

43 / 61

slide-147
SLIDE 147

Generative network setup and training.

◮ We are given (xi)n

i=1 ∼ ν.

◮ We want to find g so that (g(zi))n

i=1 ≈ (xi)n i=1, where (zi)n i=1 ∼ µ.

Problem: this isn’t as simple as fitting g(zi) ≈ xi. Solutions: ◮ VAE: For each xi, construct distribution µi, so that ˆ xi ∼ g#µi and xi are close, as are µi and µ. To generate fresh samples, get z ∼ µ and output g(z). ◮ GAN: Pick a distance notion between distributions (or between samples (g(zi))n

i=1 and (xi)n i=1) and pick g to minimize that!

43 / 61

slide-148
SLIDE 148

GAN overview

GAN approach: we minimize D(ν, g#µ) directly, where “D” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up).

44 / 61

slide-149
SLIDE 149

GAN overview

GAN approach: we minimize D(ν, g#µ) directly, where “D” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). Each distance is computed with an alternating/adversarial scheme:

  • 1. We have some current choice gt, and use it to produce a sample

(ˆ xi)n

i=1 with ˆ

xi = gt(zi).

  • 2. We train a discriminator/critic ft to find differences between (ˆ

xi)n

i=1

and (xi)n

i=1.

  • 3. We then pick a new generator gt+1, trained to fool ft!

44 / 61

slide-150
SLIDE 150

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) .

45 / 61

slide-151
SLIDE 151

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) . But we’ve been saying we can’t write down pg?

45 / 61

slide-152
SLIDE 152

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) . But we’ve been saying we can’t write down pg? Original GAN approach applies alternating minimization to inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

45 / 61

slide-153
SLIDE 153

Original GAN formulation and algorithm.

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Algorithm alternates these two steps:

  • 1. Hold g fixed and optimize f. Specifically, generate a sample

(ˆ xj)m

j=1 = (g(zj))m j=1, and approximately optimize

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(ˆ

xj)

 .

  • 2. Hold f fixed and optimize g. Specifically, generate (zj)m

j=1 and

approximately optimize inf

g∈G

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

46 / 61

slide-154
SLIDE 154

Some implementation issues

Algorithm alternates these two steps:

  • 1. Hold g fixed and optimize f. Specifically, generate a sample

(ˆ xj)m

j=1 = (g(zj))m j=1, and approximately optimize

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(ˆ

xj)

 .

  • 2. Hold f fixed and optimize g. Specifically, generate (zj)m

j=1 and

approximately optimize inf

g∈G

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Remarks. ◮ Common practice: do many f ascents for each g descent. ◮ Training has all sorts of instabilities and heuristics fixes; e.g., mode collapse (g outputs a small subset of training elements). ◮ Original intuition was game-theoretic: generator and critic compete.

47 / 61

slide-155
SLIDE 155

Wasserstein GAN (WGAN)

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

48 / 61

slide-156
SLIDE 156

Wasserstein GAN (WGAN)

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y).

48 / 61

slide-157
SLIDE 157

WGAN remarks

Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y).

49 / 61

slide-158
SLIDE 158

WGAN remarks

Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y). Remarks. ◮ In practice, G and F are deep networks architectures, fLip is only approximately enforced. ◮ This objective is a “Wasserstein distance” or “earth mover distance”; it can be interpreted as how much mass we have to shift to convert

  • ne distribution into another (in this case, g#µ and the original).

◮ The above formulate for Wasserstein distance is the “dual form” given via “Kantorovich-Rubinstein duality”.

49 / 61

slide-159
SLIDE 159

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25

#classifiers = n = 10, fraction green = 0.366897

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-160
SLIDE 160

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175

#classifiers = n = 20, fraction green = 0.244663

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-161
SLIDE 161

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

#classifiers = n = 30, fraction green = 0.175369

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-162
SLIDE 162

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 40, fraction green = 0.129766

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-163
SLIDE 163

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 50, fraction green = 0.0978074

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-164
SLIDE 164

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

50 / 61

slide-165
SLIDE 165

Bottom line

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote!

51 / 61

slide-166
SLIDE 166

Bottom line

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! Error of majority vote classifier goes down exponentially in n. Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

51 / 61

slide-167
SLIDE 167

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

52 / 61

slide-168
SLIDE 168

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Good news: errors are independent! (Our exponential error estimate from before is valid.) ◮ Bad news: classifiers trained on 1/T fraction of data (why not just train ResNet on all of it. . . ).

52 / 61

slide-169
SLIDE 169

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

53 / 61

slide-170
SLIDE 170

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Good news: errors are independent! (Our exponential error estimate from before is valid.) ◮ Bad news: classifiers trained on 1/T fraction of data (why not just train ResNet on all of it. . . ).

53 / 61

slide-171
SLIDE 171

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

54 / 61

slide-172
SLIDE 172

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Good news: using most of the data for each ft ! ◮ Bad news: errors no longer indepedent. . . ?

54 / 61

slide-173
SLIDE 173

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked?

55 / 61

slide-174
SLIDE 174

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked? Answer:

  • 1 − 1

n n ; for large n: lim

n→∞

  • 1 − 1

n n = 1 e ≈ 0.3679 .

55 / 61

slide-175
SLIDE 175

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked? Answer:

  • 1 − 1

n n ; for large n: lim

n→∞

  • 1 − 1

n n = 1 e ≈ 0.3679 . Implications for bagging: ◮ Each bootstrap sample contains about 63% of the data set. ◮ Remaining 37% can be used to estimate error rate of classifier trained on the bootstrap sample. ◮ If we have three classifiers, some of their error estimates must share examples! Independence is violated!

55 / 61

slide-176
SLIDE 176

Boosting overview

◮ We no longer assume classifiers have independent errors. ◮ We no longer output a simple majority: we reweight the classifiers via optimization. ◮ There is a rich theory with many interpretations.

56 / 61

slide-177
SLIDE 177

Simplified boosting scheme

  • 1. Start with data ((xi, yi)n

i=1 and classifiers (h1, . . . , hT ).

  • 2. Find weights w ∈ RT which approximately minimize

1 n

n

  • i=1

ℓ  yi

T

  • j=1

wjhj(xi)   = 1 n

n

  • i=1

  • yiw

Tzi

  • ,

where zi =

  • h1(xi), . . . , hT (xi)
  • ∈ RT .

(We use classifiers to give us features.)

  • 3. Predict with x → T

j=1 wjhj(x).

57 / 61

slide-178
SLIDE 178

Simplified boosting scheme

  • 1. Start with data ((xi, yi)n

i=1 and classifiers (h1, . . . , hT ).

  • 2. Find weights w ∈ RT which approximately minimize

1 n

n

  • i=1

ℓ  yi

T

  • j=1

wjhj(xi)   = 1 n

n

  • i=1

  • yiw

Tzi

  • ,

where zi =

  • h1(xi), . . . , hT (xi)
  • ∈ RT .

(We use classifiers to give us features.)

  • 3. Predict with x → T

j=1 wjhj(x).

Remarks. ◮ If ℓ is convex, this is standard linear prediction: convex in w. ◮ In the classical setting: ℓ(r) = exp(−r), optimizer = coordinate descent, T = ∞. ◮ Most commonly, (h1, . . . , hT ) are decision stumps. ◮ Popular software implementation: xgboost.

57 / 61

slide-179
SLIDE 179

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

58 / 61

slide-180
SLIDE 180

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

ˆ y = 2

58 / 61

slide-181
SLIDE 181

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7

58 / 61

slide-182
SLIDE 182

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 ˆ y = 1 ˆ y = 3

58 / 61

slide-183
SLIDE 183

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 ˆ y = 1 ˆ y = 3

. . . and stop there!

58 / 61

slide-184
SLIDE 184

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3.000
  • 1.500

0.000 0.000 1.500 1.500 3.000 4.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4

.

  • 3

.

  • 2.000
  • 1.000

0.000 1 . 2 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

. 2

  • 2

. 4

  • 1

. 6

  • 0.800

0.000 0.800 1.600

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

59 / 61

slide-185
SLIDE 185

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4.000
  • 2.000

0.000 . 2.000 2.000 4.000 6.000 8.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6

.

  • 4

. 5

  • 3.000
  • 1.500

. 1.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8.000
  • 6

.

  • 4.000
  • 2

. . 2 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

59 / 61

slide-186
SLIDE 186

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5

.

  • 2

. 5

  • 2

. 5 0.000 . 2.500 5.000 7.500 10.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8

.

  • 6

.

  • 4.000
  • 2

. . . 2 . 4.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6.000
  • 3

. 0.000 0.000 3.000 6 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

59 / 61

slide-187
SLIDE 187

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5.000
  • 2

. 5

  • 2

. 5 0.000 0.000 2.500 5.000 7.500 10.000 12.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 10.000
  • 7.500
  • 5

.

  • 2

. 5 0.000 . 2 . 5 5 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 1

6 .

  • 12.000
  • 8.000
  • 4.000

0.000 . 4.000 8 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

59 / 61

slide-188
SLIDE 188

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6.000
  • 3

.

  • 3

. 0.000 0.000 3.000 6.000 9.000 12.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6

.

  • 3

. 0.000 0.000 3 . 6 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 20.000
  • 15.000
  • 10.000
  • 5.000

0.000 . 5.000 10.000

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

59 / 61

slide-189
SLIDE 189

Selected homework concepts

60 / 61

slide-190
SLIDE 190

◮ Please go through all mathy homework problems (except for the extra credit in hw6).

61 / 61