Cyrus Cousins with Eli Upfal Brown University BigData Group Spring - - PowerPoint PPT Presentation

cyrus cousins with eli upfal
SMART_READER_LITE
LIVE PREVIEW

Cyrus Cousins with Eli Upfal Brown University BigData Group Spring - - PowerPoint PPT Presentation

Distance Based Learning k -Nearest Representatives Uniform Convergence The Curse of Dimensionality Generalization Bounds for Distance-Based Learning with High-Dimensional Domains and Codomains Cyrus Cousins with Eli Upfal Brown University


slide-1
SLIDE 1

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds for Distance-Based Learning with High-Dimensional Domains and Codomains

Cyrus Cousins with Eli Upfal

Brown University BigData Group Spring 2019 Web:

bigdata.cs.brown.edu

Mail:

cyrus cousins@brown.edu

Cyrus Cousins Brown University Distance Based Learning

slide-2
SLIDE 2

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Supervised Learning in Metric Spaces

Domain X, with metric ∆(x1, x2) : X × X → [0, ∞)

Rd with Euclidean, Lp, or Mahalanobis distance Graph with shortest-path distance Strings with edit distance

Cyrus Cousins Brown University Distance Based Learning

slide-3
SLIDE 3

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Supervised Learning in Metric Spaces

Domain X, with metric ∆(x1, x2) : X × X → [0, ∞)

Rd with Euclidean, Lp, or Mahalanobis distance Graph with shortest-path distance Strings with edit distance

Codomain Y

Probabilistic classification: Y = Sn = { y : y1 = 1, 0 |y| } Regression: Y = Rc

Cyrus Cousins Brown University Distance Based Learning

slide-4
SLIDE 4

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Supervised Learning in Metric Spaces

Domain X, with metric ∆(x1, x2) : X × X → [0, ∞)

Rd with Euclidean, Lp, or Mahalanobis distance Graph with shortest-path distance Strings with edit distance

Codomain Y

Probabilistic classification: Y = Sn = { y : y1 = 1, 0 |y| } Regression: Y = Rc

Training set z drawn from X, with labels from Y

Assume z drawn i.i.d.from distribution D over Z = X × Y

Cyrus Cousins Brown University Distance Based Learning

slide-5
SLIDE 5

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Supervised Learning in Metric Spaces

Domain X, with metric ∆(x1, x2) : X × X → [0, ∞)

Rd with Euclidean, Lp, or Mahalanobis distance Graph with shortest-path distance Strings with edit distance

Codomain Y

Probabilistic classification: Y = Sn = { y : y1 = 1, 0 |y| } Regression: Y = Rc

Training set z drawn from X, with labels from Y

Assume z drawn i.i.d.from distribution D over Z = X × Y

Underlying Assumption:

Nearby points usually have similar labels

Cyrus Cousins Brown University Distance Based Learning

slide-6
SLIDE 6

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

Cyrus Cousins Brown University Distance Based Learning

slide-7
SLIDE 7

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

2 Identify k nearest neighbors to the query

query

Cyrus Cousins Brown University Distance Based Learning

slide-8
SLIDE 8

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

2 Identify k nearest neighbors to the query

query

3 Winner-takes-all vote over neighbor labels

Cyrus Cousins Brown University Distance Based Learning

slide-9
SLIDE 9

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

2 Identify k nearest neighbors to the query

query

3 Winner-takes-all vote over neighbor labels

Lazy learner

No training procedure Nearest neighbors queries on training set

Cyrus Cousins Brown University Distance Based Learning

slide-10
SLIDE 10

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

2 Identify k nearest neighbors to the query

query

3 Winner-takes-all vote over neighbor labels

Lazy learner

No training procedure Nearest neighbors queries on training set

Control for overfitting by adjusting k

Low k ⇒ high variance High k ⇒ high bias

Cyrus Cousins Brown University Distance Based Learning

slide-11
SLIDE 11

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Neighbors Classifier

1 Model defined by k and training set

training set z ∼ D

2 Identify k nearest neighbors to the query

query

3 Winner-takes-all vote over neighbor labels

Lazy learner

No training procedure Nearest neighbors queries on training set

Control for overfitting by adjusting k

Low k ⇒ high variance High k ⇒ high bias

Want to bound true error R(D)

Have leave-one-out cross validation ˆ Rloocv(z) Want P

  • R(D) ≤ ˆ

Rloocv(z) + ǫ

  • ≥ 1 − δ

Quantify degree of overfitting

Cyrus Cousins Brown University Distance Based Learning

slide-12
SLIDE 12

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Problems with k-Nearest Neighbors

Statistics: probability 1 − δ tail bounds on model loss

Hypothesis stability: R(D) ≤ ˆ Rloocv(z) +

  • 1 + 24
  • k/2π

2mδ

Cyrus Cousins Brown University Distance Based Learning

slide-13
SLIDE 13

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Problems with k-Nearest Neighbors

Statistics: probability 1 − δ tail bounds on model loss

Hypothesis stability: R(D) ≤ ˆ Rloocv(z) +

  • 1 + 24
  • k/2π

2mδ An exponential stability bound:

Assume Euclidean distance in Rd γd . = maximum kissing number, exponential in d κ ≥ 1.271

R(D) ≤ ˆ Rloocv(z) + γdk

  • 512eκ ln( 2

δ )

m + 2 √ 2k √πm

Cyrus Cousins Brown University Distance Based Learning

slide-14
SLIDE 14

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Problems with k-Nearest Neighbors

Statistics: probability 1 − δ tail bounds on model loss

Hypothesis stability: R(D) ≤ ˆ Rloocv(z) +

  • 1 + 24
  • k/2π

2mδ An exponential stability bound:

Assume Euclidean distance in Rd γd . = maximum kissing number, exponential in d κ ≥ 1.271

R(D) ≤ ˆ Rloocv(z) + γdk

  • 512eκ ln( 2

δ )

m + 2 √ 2k √πm

Stability bounds scale poorly with δ, d, k, large constants, metric specific

Should improve with k, which smooths predictions Both γd and k terms due to proof technique

Max number of zj s.t. zi is a k-nearest neighbor

Cyrus Cousins Brown University Distance Based Learning

slide-15
SLIDE 15

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Problems with k-Nearest Neighbors

Statistics: probability 1 − δ tail bounds on model loss

Hypothesis stability: R(D) ≤ ˆ Rloocv(z) +

  • 1 + 24
  • k/2π

2mδ An exponential stability bound:

Assume Euclidean distance in Rd γd . = maximum kissing number, exponential in d κ ≥ 1.271

R(D) ≤ ˆ Rloocv(z) + γdk

  • 512eκ ln( 2

δ )

m + 2 √ 2k √πm

Stability bounds scale poorly with δ, d, k, large constants, metric specific

Should improve with k, which smooths predictions Both γd and k terms due to proof technique

Max number of zj s.t. zi is a k-nearest neighbor

Computational Efficiency

Approximate k-NN queries computationally difficult High storage cost

Cyrus Cousins Brown University Distance Based Learning

slide-16
SLIDE 16

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Classifier

Training:

1 Select a parliament

Draw a parliament p of unlabeled i.i.d. points from D

Cyrus Cousins Brown University Distance Based Learning

slide-17
SLIDE 17

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Classifier

Training:

1 Select a parliament

Draw a parliament p of unlabeled i.i.d. points from D

2 Vote

Draw training set z of m labeled i.i.d. points from D Associate each zi with its k-nearest representatives (from p)

Cyrus Cousins Brown University Distance Based Learning

slide-18
SLIDE 18

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Classifier

Training:

1 Select a parliament

Draw a parliament p of unlabeled i.i.d. points from D

2 Vote

Draw training set z of m labeled i.i.d. points from D Associate each zi with its k-nearest representatives (from p)

3 Decide the election

Label each representative by winner-takes all vote Resolve ties arbitrarily

Cyrus Cousins Brown University Distance Based Learning

slide-19
SLIDE 19

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Classifier

Training:

1 Select a parliament

Draw a parliament p of unlabeled i.i.d. points from D

2 Vote

Draw training set z of m labeled i.i.d. points from D Associate each zi with its k-nearest representatives (from p)

3 Decide the election

Label each representative by winner-takes all vote Resolve ties arbitrarily

Classification:

1 Identify k-nearest representatives to query point 2 Average associated labels to produce a soft classification

Cyrus Cousins Brown University Distance Based Learning

slide-20
SLIDE 20

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The 1-NR Classifier:

1 Draw Parliament

Cyrus Cousins Brown University Distance Based Learning

slide-21
SLIDE 21

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The 1-NR Classifier:

1 Draw Parliament 2 Draw Training Set

Cyrus Cousins Brown University Distance Based Learning

slide-22
SLIDE 22

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The 1-NR Classifier:

1 Draw Parliament 2 Draw Training Set 3 Label Parliament

Cyrus Cousins Brown University Distance Based Learning

slide-23
SLIDE 23

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The 1-NR Classifier:

1 Draw Parliament 2 Draw Training Set 3 Label Parliament 4 Classification

Cyrus Cousins Brown University Distance Based Learning

slide-24
SLIDE 24

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Gaussian Checkerboard Dataset

xi ∼ N

  • 0,

1 1

  • yi determined by xi

and checkerboard

Cyrus Cousins Brown University Distance Based Learning

slide-25
SLIDE 25

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

1-NR m = 10000 |p| = 1000

Cyrus Cousins Brown University Distance Based Learning

slide-26
SLIDE 26

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

2-NR m = 10000 |p| = 2000

Cyrus Cousins Brown University Distance Based Learning

slide-27
SLIDE 27

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

5-NR m = 10000 |p| = 5000

Cyrus Cousins Brown University Distance Based Learning

slide-28
SLIDE 28

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

10-NR m = 10000 |p| = 10000

Cyrus Cousins Brown University Distance Based Learning

slide-29
SLIDE 29

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

10-NN versus 10-NR

10-NN 10-NR

Cyrus Cousins Brown University Distance Based Learning

slide-30
SLIDE 30

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

10-NN versus 10-NR (Saturated)

10-NN Saturated 10-NR

Cyrus Cousins Brown University Distance Based Learning

slide-31
SLIDE 31

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Model

Problem: Need many samples at each representative to learn Solution: average over the labels of k-nearest representatives

Averaging mitigates impact of outliers Similar to k-parameter in k-NN

Cyrus Cousins Brown University Distance Based Learning

slide-32
SLIDE 32

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Model

Problem: Need many samples at each representative to learn Solution: average over the labels of k-nearest representatives

Averaging mitigates impact of outliers Similar to k-parameter in k-NN

k-NR Hypothesis Class

Domain X, distance metric ∆, c classes, codomain Y = Sc Take Pi(x) = 1

k if pi is a {1, . . . , k}-nearest representative of pi, else 0

H . =    h(x) =

|p|

  • i=1

PiYi : Y ∈ S|p|

c

   Yi,· is the label vector associated with representative pi Weighted k-NR: Pi(x) function of relative distances of k-nearest representatives

Cyrus Cousins Brown University Distance Based Learning

slide-33
SLIDE 33

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The k-Nearest Representatives Model

Problem: Need many samples at each representative to learn Solution: average over the labels of k-nearest representatives

Averaging mitigates impact of outliers Similar to k-parameter in k-NN

k-NR Hypothesis Class

Domain X, distance metric ∆, c classes, codomain Y = Sc Take Pi(x) = 1

k if pi is a {1, . . . , k}-nearest representative of pi, else 0

H . =    h(x) =

|p|

  • i=1

PiYi : Y ∈ S|p|

c

   Yi,· is the label vector associated with representative pi Weighted k-NR: Pi(x) function of relative distances of k-nearest representatives

Other learning problems

Regression: allow Y = Rd, learn Y ∈ Y|p|×d Analysis easier with domain restricted to ball Y = B(0, 1)

Cyrus Cousins Brown University Distance Based Learning

slide-34
SLIDE 34

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Training the k-NR

Given label y ∈ 1, . . . , c, predicted label ˆ y, require a loss function ℓ

Quadratic or Brier loss: estimate means/probabilities: ℓ2(ˆ y, y) . = ✶y − ˆ y2

Cyrus Cousins Brown University Distance Based Learning

slide-35
SLIDE 35

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Training the k-NR

Given label y ∈ 1, . . . , c, predicted label ˆ y, require a loss function ℓ

Quadratic or Brier loss: estimate means/probabilities: ℓ2(ˆ y, y) . = ✶y − ˆ y2 Absolute loss: estimate geometric medians (robust): ℓ1(ˆ y, y) . = y − ˆ yy1

Cyrus Cousins Brown University Distance Based Learning

slide-36
SLIDE 36

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Training the k-NR

Given label y ∈ 1, . . . , c, predicted label ˆ y, require a loss function ℓ

Quadratic or Brier loss: estimate means/probabilities: ℓ2(ˆ y, y) . = ✶y − ˆ y2 Absolute loss: estimate geometric medians (robust): ℓ1(ˆ y, y) . = y − ˆ yy1 Cross entropy loss: ℓH(ˆ y, y) . = ln(ˆ yy)

Cyrus Cousins Brown University Distance Based Learning

slide-37
SLIDE 37

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Training the k-NR

Given label y ∈ 1, . . . , c, predicted label ˆ y, require a loss function ℓ

Quadratic or Brier loss: estimate means/probabilities: ℓ2(ˆ y, y) . = ✶y − ˆ y2 Absolute loss: estimate geometric medians (robust): ℓ1(ˆ y, y) . = y − ˆ yy1 Cross entropy loss: ℓH(ˆ y, y) . = ln(ˆ yy)

Risk of distribution D and empirical risk of sample z ∼ Dm ˆ Rℓ(z) . = 1 m

m

  • i=1

  • h(xi), yi
  • ,

Rℓ(D) . = E

(x,y)∼D

  • h(x), y
  • Cyrus Cousins

Brown University Distance Based Learning

slide-38
SLIDE 38

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Training the k-NR

Given label y ∈ 1, . . . , c, predicted label ˆ y, require a loss function ℓ

Quadratic or Brier loss: estimate means/probabilities: ℓ2(ˆ y, y) . = ✶y − ˆ y2 Absolute loss: estimate geometric medians (robust): ℓ1(ˆ y, y) . = y − ˆ yy1 Cross entropy loss: ℓH(ˆ y, y) . = ln(ˆ yy)

Risk of distribution D and empirical risk of sample z ∼ Dm ˆ Rℓ(z) . = 1 m

m

  • i=1

  • h(xi), yi
  • ,

Rℓ(D) . = E

(x,y)∼D

  • h(x), y
  • Learn ˆ

h ∈ H via empirical risk minimization ˆ h . = argmin

h∈H

ˆ Rℓ(z) Overfitting:

How well does ˆ R approximate R? Often Rℓ(z) < ˆ Rℓ(D) Selection bias in choosing ˆ h

Cyrus Cousins Brown University Distance Based Learning

slide-39
SLIDE 39

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Rademacher Average

Function family F ⊆ Z → R Distribution D over Z, sample z ∼ Dm Rademacher sequence σ i.i.d. uniform on ±1

Cyrus Cousins Brown University Distance Based Learning

slide-40
SLIDE 40

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Rademacher Average

Function family F ⊆ Z → R Distribution D over Z, sample z ∼ Dm Rademacher sequence σ i.i.d. uniform on ±1 Empirical Mean . = 1 m

m

  • i=1

f(zi) mean estimate

Cyrus Cousins Brown University Distance Based Learning

slide-41
SLIDE 41

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Rademacher Average

Function family F ⊆ Z → R Distribution D over Z, sample z ∼ Dm Rademacher sequence σ i.i.d. uniform on ±1 Empirical Mean . = 1 m

m

  • i=1

f(zi) mean estimate Empirical Sup . = sup

f∈F

1 m

m

  • i=1

f(zi) max estimate

Cyrus Cousins Brown University Distance Based Learning

slide-42
SLIDE 42

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Rademacher Average

Function family F ⊆ Z → R Distribution D over Z, sample z ∼ Dm Rademacher sequence σ i.i.d. uniform on ±1 Empirical Mean . = 1 m

m

  • i=1

f(zi) mean estimate Empirical Sup . = sup

f∈F

1 m

m

  • i=1

f(zi) max estimate ˆ Rm(F, z) . = E

σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • sample-σ correlation

Cyrus Cousins Brown University Distance Based Learning

slide-43
SLIDE 43

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Rademacher Average

Function family F ⊆ Z → R Distribution D over Z, sample z ∼ Dm Rademacher sequence σ i.i.d. uniform on ±1 Empirical Mean . = 1 m

m

  • i=1

f(zi) mean estimate Empirical Sup . = sup

f∈F

1 m

m

  • i=1

f(zi) max estimate ˆ Rm(F, z) . = E

σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • sample-σ correlation

Rm(F, D) . = E

z∼Dm

  • E

σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • distribution-σ correlation

Cyrus Cousins Brown University Distance Based Learning

slide-44
SLIDE 44

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds

Let F ⊆ X → [0, 1], z ∼ Dm, and let δ ∈ (0, 1). Symmetrization (consequence of Jensen’s inequality): E

z

  • sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

  • ≤ 2 E

z,σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • = 2Rm(F, z)

Cyrus Cousins Brown University Distance Based Learning

slide-45
SLIDE 45

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds

Let F ⊆ X → [0, 1], z ∼ Dm, and let δ ∈ (0, 1). Symmetrization (consequence of Jensen’s inequality): E

z

  • sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

  • ≤ 2 E

z,σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • = 2Rm(F, z)

Exponential tail bounds, with probability ≥ 1 − δ over choice of z: sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

   2Rm(F, D) +

  • ln( 1

δ )

2m

2ˆ Rm(F, z) + 3

  • ln( 1

δ )

2m

Cyrus Cousins Brown University Distance Based Learning

slide-46
SLIDE 46

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds

Let F ⊆ X → [0, 1], z ∼ Dm, and let δ ∈ (0, 1). Symmetrization (consequence of Jensen’s inequality): E

z

  • sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

  • ≤ 2 E

z,σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • = 2Rm(F, z)

Exponential tail bounds, with probability ≥ 1 − δ over choice of z: sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

   2Rm(F, D) +

  • ln( 1

δ )

2m

2ˆ Rm(F, z) + 3

  • ln( 1

δ )

2m

Hold simultaneously for all f ∈ F

Addresses multiple comparisons problem

Cyrus Cousins Brown University Distance Based Learning

slide-47
SLIDE 47

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds

Let F ⊆ X → [0, 1], z ∼ Dm, and let δ ∈ (0, 1). Symmetrization (consequence of Jensen’s inequality): E

z

  • sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

  • ≤ 2 E

z,σ

  • sup

f∈F

1 m

m

  • i=1

σif(zi)

  • = 2Rm(F, z)

Exponential tail bounds, with probability ≥ 1 − δ over choice of z: sup

f∈F

  • E

z∼D[f] − 1

m

m

  • i=1

f(zi)

   2Rm(F, D) +

  • ln( 1

δ )

2m

2ˆ Rm(F, z) + 3

  • ln( 1

δ )

2m

Hold simultaneously for all f ∈ F

Addresses multiple comparisons problem

Sample value is probably approximately equal to true value

Therefore, picking the best f won’t overfit Bound on Rademacher average ⇒ bound on probability of overfitting

Cyrus Cousins Brown University Distance Based Learning

slide-48
SLIDE 48

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Massart’s Finite Family Inequality

ˆ Rm(F, z) ≤ sup

f∈F

  • 1

m

m

  • i=1

f 2(zi)

  • 2 ln|F|

m Bound on the Rademacher complexity in terms of:

Maximum L2 average over sample z Cardinality of function family |F|

Empirical shattering coefficient: use |F(z)|

Cyrus Cousins Brown University Distance Based Learning

slide-49
SLIDE 49

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Massart’s Finite Family Inequality

ˆ Rm(F, z) ≤ sup

f∈F

  • 1

m

m

  • i=1

f 2(zi)

  • 2 ln|F|

m Bound on the Rademacher complexity in terms of:

Maximum L2 average over sample z Cardinality of function family |F|

Empirical shattering coefficient: use |F(z)|

With centralization for image(F) ⊆ [0, 1]

Wimpy variance ˆ σ2 . = sup

f∈F

1 m

m

  • i=1
  • f(zi) − ˆ

E

z[f]

2 ˆ

Rm(F, z) ≤

1 √m +

σ2 ln|F(z)| m

Cyrus Cousins Brown University Distance Based Learning

slide-50
SLIDE 50

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Massart’s Finite Family Inequality

ˆ Rm(F, z) ≤ sup

f∈F

  • 1

m

m

  • i=1

f 2(zi)

  • 2 ln|F|

m Bound on the Rademacher complexity in terms of:

Maximum L2 average over sample z Cardinality of function family |F|

Empirical shattering coefficient: use |F(z)|

With centralization for image(F) ⊆ [0, 1]

Wimpy variance ˆ σ2 . = sup

f∈F

1 m

m

  • i=1
  • f(zi) − ˆ

E

z[f]

2 ˆ

Rm(F, z) ≤

1 √m +

σ2 ln|F(z)| m

Intuitive result

Subexponential F can’t correlate with all possible σ Proved by bounding exp

  • λ ˆ

Rm( ˆ

F, z)

  • Cyrus Cousins

Brown University Distance Based Learning

slide-51
SLIDE 51

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Covering Numbers

How many towers required to cover every point?

Each point covered by at least 1 tower

Towers are L2 balls in a subset of R2

Cyrus Cousins Brown University Distance Based Learning

slide-52
SLIDE 52

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Covering Numbers

How many towers required to cover every point?

Each point covered by at least 1 tower

Towers are L2 balls in a subset of R2 γ-cover of X w.r.t. distance metric ∆:

Set x ⊆ X Any x′ ∈ X has infx∈x ∆(x′, x) ≤ γ

Cyrus Cousins Brown University Distance Based Learning

slide-53
SLIDE 53

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Covering Numbers

How many towers required to cover every point?

Each point covered by at least 1 tower

Towers are L2 balls in a subset of R2 γ-cover of X w.r.t. distance metric ∆:

Set x ⊆ X Any x′ ∈ X has infx∈x ∆(x′, x) ≤ γ

Covering number N(X, ∆, γ) . =

Minimum cardinality of any γ-cover of X w.r.t. ∆ How many balls required to cover the space?

Cyrus Cousins Brown University Distance Based Learning

slide-54
SLIDE 54

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Rademacher Averages in “Approximately Finite” Families

Suppose ˆ F s.t. ˆ F(z) ℓ2-γ covers F(z)

Want to bound complexity of F in terms of ˆ F, with γ additive error

Cyrus Cousins Brown University Distance Based Learning

slide-55
SLIDE 55

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Rademacher Averages in “Approximately Finite” Families

Suppose ˆ F s.t. ˆ F(z) ℓ2-γ covers F(z)

Want to bound complexity of F in terms of ˆ F, with γ additive error N2(F, x, ǫ) an empirical ℓ2-covering number (w.r.t. z) of F

Number of ℓ2 balls of radius ǫ required to cover F(z)

Cyrus Cousins Brown University Distance Based Learning

slide-56
SLIDE 56

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Rademacher Averages in “Approximately Finite” Families

Suppose ˆ F s.t. ˆ F(z) ℓ2-γ covers F(z)

Want to bound complexity of F in terms of ˆ F, with γ additive error N2(F, x, ǫ) an empirical ℓ2-covering number (w.r.t. z) of F

Number of ℓ2 balls of radius ǫ required to cover F(z)

ˆ Rm(F, x) ≤

  • 2 ln(2)

m +

σ2 ln

  • N2(F, z, ǫ)
  • m

+ ǫ

Cyrus Cousins Brown University Distance Based Learning

slide-57
SLIDE 57

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Rademacher Averages in “Approximately Finite” Families

Suppose ˆ F s.t. ˆ F(z) ℓ2-γ covers F(z)

Want to bound complexity of F in terms of ˆ F, with γ additive error N2(F, x, ǫ) an empirical ℓ2-covering number (w.r.t. z) of F

Number of ℓ2 balls of radius ǫ required to cover F(z)

ˆ Rm(F, x) ≤

  • 2 ln(2)

m +

σ2 ln

  • N2(F, z, ǫ)
  • m

+ ǫ Massart’s inequality not sensitive to similarity between functions in F

N2(F, x, ǫ) ≤ |F(z)| captures this information N2(F, x, ǫ) often finite, even for infinite F

Cyrus Cousins Brown University Distance Based Learning

slide-58
SLIDE 58

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages

k-NR Hypothesis Class

P maps x to a stochastic vector that is 0 except at k-nearest representatives (p) Label matrix L ∈ |p| × c maps representatives to classes

HP . =

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Cyrus Cousins Brown University Distance Based Learning

slide-59
SLIDE 59

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages

k-NR Hypothesis Class

P maps x to a stochastic vector that is 0 except at k-nearest representatives (p) Label matrix L ∈ |p| × c maps representatives to classes

HP . =

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Contraction Inequalities

Suppose φ : R → R a λ-Lipschitz function, take F = ℓ ◦ H. Then ˆ

Rm(F, z) ≤ λ ˆ Rm(H, z)

Similar multivariate results (> 2 classes, regression in Rc) ℓ2(·, y), ℓ1(·, y) are Lipschitz for bounded inputs Bounds for ℓ ◦ H (nonlinear) based on bounds for H (linear)

Cyrus Cousins Brown University Distance Based Learning

slide-60
SLIDE 60

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages

k-NR Hypothesis Class

P maps x to a stochastic vector that is 0 except at k-nearest representatives (p) Label matrix L ∈ |p| × c maps representatives to classes

HP . =

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Contraction Inequalities

Suppose φ : R → R a λ-Lipschitz function, take F = ℓ ◦ H. Then ˆ

Rm(F, z) ≤ λ ˆ Rm(H, z)

Similar multivariate results (> 2 classes, regression in Rc) ℓ2(·, y), ℓ1(·, y) are Lipschitz for bounded inputs Bounds for ℓ ◦ H (nonlinear) based on bounds for H (linear)

Only the convex hull F⋄ = ConvHull(F) matters ˆ Rm(F, x) = E

σ

  • sup

f∈F

1 m

m

  • i=1

σif(xi)

  • = E

σ

  • sup

f∈F⋄

1 m

m

  • i=1

σif(xi)

  • = ˆ

Rm(F⋄, x)

Convenient for linear H⋄, linear loss ℓ1

Cyrus Cousins Brown University Distance Based Learning

slide-61
SLIDE 61

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages: Counting

Hypothesis Class: HP .

=

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Loss Family: ℓ ◦ HP .

=

  • f(x, y) .

= ℓ(P(x)L, y) : L ∈ Sp ⊆ X → Sc

Goal: bound

  • ConvHull
  • H(x)
  • Massart’s Lemma + Contraction Inequalities =

⇒ k-NR Generalization Bounds

✶ ✶ ✶

Cyrus Cousins Brown University Distance Based Learning

slide-62
SLIDE 62

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages: Counting

Hypothesis Class: HP .

=

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Loss Family: ℓ ◦ HP .

=

  • f(x, y) .

= ℓ(P(x)L, y) : L ∈ Sp ⊆ X → Sc

Goal: bound

  • ConvHull
  • H(x)
  • Massart’s Lemma + Contraction Inequalities =

⇒ k-NR Generalization Bounds

Counting bound

Data-independent Size of convex hull of HP

Each label Li,· can be ✶1, ✶2, . . . , ✶c in the convex hull of HP |p| total labels

ln

  • ConvHull(HP)
  • = ln

 

|p|

  • i=1

c   = |p| ln(c)

Cyrus Cousins Brown University Distance Based Learning

slide-63
SLIDE 63

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Bounding k-NR Rademacher Averages: Counting

Hypothesis Class: HP .

=

  • h(x) .

= P(x)L : L ∈ Sp ⊆ X → Sc

Loss Family: ℓ ◦ HP .

=

  • f(x, y) .

= ℓ(P(x)L, y) : L ∈ Sp ⊆ X → Sc

Goal: bound

  • ConvHull
  • H(x)
  • Massart’s Lemma + Contraction Inequalities =

⇒ k-NR Generalization Bounds

Counting bound

Data-independent Size of convex hull of HP

Each label Li,· can be ✶1, ✶2, . . . , ✶c in the convex hull of HP |p| total labels

ln

  • ConvHull(HP)
  • = ln

 

|p|

  • i=1

c   = |p| ln(c)

By Massart’s Inequality: ˆ Rm(ℓ1 ◦ H, z) ≤

  • 2 ln(2)

m +

σ2|p| ln(c) m

Cyrus Cousins Brown University Distance Based Learning

slide-64
SLIDE 64

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30

Cyrus Cousins Brown University Distance Based Learning

slide-65
SLIDE 65

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493

Cyrus Cousins Brown University Distance Based Learning

slide-66
SLIDE 66

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493 Projection size bound (pay for what you use):

Cyrus Cousins Brown University Distance Based Learning

slide-67
SLIDE 67

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30 Empty cells

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493 Projection size bound (pay for what you use):

If no zi mapped to Pj, label Lj,· doesn’t matter

Cyrus Cousins Brown University Distance Based Learning

slide-68
SLIDE 68

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30 Empty cells

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493 Projection size bound (pay for what you use):

If no zi mapped to Pj, label Lj,· doesn’t matter If no zi mapped to Pj has label a or b, then swapping Lj,a, Lj,b doesn’t matter ln

  • H⋄

P(x)

  • =

|p|

  • j=1

ln

  • min
  • c, 1 + |{yj : Pj(xi) > 0}|
  • ≤ |{j : Pj(x) = 0}|
  • # of nonempty reps

·ln min

  • c, 1 + max

j

|{yi : Pj(xi) = 0}|

  • # of classes at pj
  • Cyrus Cousins

Brown University Distance Based Learning

slide-69
SLIDE 69

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30 Empty cells

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493 Projection size bound (pay for what you use):

If no zi mapped to Pj, label Lj,· doesn’t matter If no zi mapped to Pj has label a or b, then swapping Lj,a, Lj,b doesn’t matter ln

  • H⋄

P(x)

  • =

|p|

  • j=1

ln

  • min
  • c, 1 + |{yj : Pj(xi) > 0}|
  • ≤ |{j : Pj(x) = 0}|
  • # of nonempty reps

·ln min

  • c, 1 + max

j

|{yi : Pj(xi) = 0}|

  • # of classes at pj
  • Data-dependent bound: ˆ

Rm(ℓ1 ◦ H, z) ≤ 0.402

Cyrus Cousins Brown University Distance Based Learning

slide-70
SLIDE 70

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Data-Dependent Counting

1-NR Voronoi Partition c = 5 classes: { } Cell i maps to pi |p| = 6 × 5 = 30 Empty cells Low-frequency cells

Counting bound: ˆ Rm(ℓ1 ◦ H, z) ≤

  • c ln |p|

2m

≈ 0.493 Projection size bound (pay for what you use):

If no zi mapped to Pj, label Lj,· doesn’t matter If no zi mapped to Pj has label a or b, then swapping Lj,a, Lj,b doesn’t matter ln

  • H⋄

P(x)

  • =

|p|

  • j=1

ln

  • min
  • c, 1 + |{yj : Pj(xi) > 0}|
  • ≤ |{j : Pj(x) = 0}|
  • # of nonempty reps

·ln min

  • c, 1 + max

j

|{yi : Pj(xi) = 0}|

  • # of classes at pj
  • Data-dependent bound: ˆ

Rm(ℓ1 ◦ H, z) ≤ 0.402 Sensitive to label existence, but not frequency

Improve with covering number bounds Ignore low frequency events With certainty (pidgeonhole principle): ∃J ⊆ {1, . . . , |p| s.t.

  • j∈J
  • (x,y)∈z

Pj(x) ≤ m|J| |p| Covering bound: ˆ

Rm(ℓ1 ◦ H, z) ≤ 0.392

Cyrus Cousins Brown University Distance Based Learning

slide-71
SLIDE 71

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Curse of Dimensionality

Suppose X = Rd for large d Distance-based methods often ineffective

Distances usually dominated by noise Local properties more meaningful

Cyrus Cousins Brown University Distance Based Learning

slide-72
SLIDE 72

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Curse of Dimensionality

Suppose X = Rd for large d Distance-based methods often ineffective

Distances usually dominated by noise Local properties more meaningful

Assumption:

Exists low-dimensional subspace X ′ . = MX Distances in X ′ more meaningful than distances in X Rank-constrained Mahalanobis metric: ∆M(x, x′) =

  • (x − x′)MM ⊺(x − x′)
  • 2

Don’t know M a priori!

Can we learn it?

Cyrus Cousins Brown University Distance Based Learning

slide-73
SLIDE 73

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

The Curse of Dimensionality

Suppose X = Rd for large d Distance-based methods often ineffective

Distances usually dominated by noise Local properties more meaningful

Assumption:

Exists low-dimensional subspace X ′ . = MX Distances in X ′ more meaningful than distances in X Rank-constrained Mahalanobis metric: ∆M(x, x′) =

  • (x − x′)MM ⊺(x − x′)
  • 2

Don’t know M a priori!

Can we learn it?

Appropriate distance metric is task-specific

Select p in ℓp distance Feature selection Normalize features of different scales Rotate and identify predictive linear combinations of features

Cyrus Cousins Brown University Distance Based Learning

slide-74
SLIDE 74

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Learning Distance Metrics

Modify k-NR to optimize over distance metrics

Set ∆ of distance metrics Take Pj(x; ∆) = 1

k if Pj is a {1, . . . , k}-nearest representative of x (w.r.t. ∆)

Define hypothesis class

HP,∆ . = { h(x) . = P(x; ∆)Y : Y ∈ Sc, ∆ ∈ ∆ }

Cyrus Cousins Brown University Distance Based Learning

slide-75
SLIDE 75

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Learning Distance Metrics

Modify k-NR to optimize over distance metrics

Set ∆ of distance metrics Take Pj(x; ∆) = 1

k if Pj is a {1, . . . , k}-nearest representative of x (w.r.t. ∆)

Define hypothesis class

HP,∆ . = { h(x) . = P(x; ∆)Y : Y ∈ Sc, ∆ ∈ ∆ } How to select ∆?

Randomly select projection matrices M (Johnson-Lindenstrauss lemma) Choose between domain-specific expert guesses Do whatever you feel like doing

Cyrus Cousins Brown University Distance Based Learning

slide-76
SLIDE 76

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Generalization Bounds with Learned Metrics

Does selecting ∆ cause overfitting? Counting argument: ˆ Rm(ℓ1 ◦ HP,∆, z) ≤ 1 √m + 2ˆ σ

  • ln|∆| + |p| ln(c)

2m Covering argument:

Compute N1 = γ1-empirical covering number of { P(x; ∆) : ∆ ∈ ∆ } Similar ∆(x, x′) = ⇒ similar k-NR Upper-bound N2 = γ2-covering number of k-NR with any fixed ∆ γ additive, log cover sizes additive

ˆ Rm(ℓ1 ◦ HP,∆, z) ≤ 1 √m + 2ˆ σ

  • ln(N1) + ln(N2)

2m + γ1 + γ2

Cyrus Cousins Brown University Distance Based Learning

slide-77
SLIDE 77

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

High-Dimensional Codomains

Unrealizable learning: conflicting noisy observations at each representative Classification: Many possible classes (NLP, image recognition, . . . )

Unfactored Factored Factors Feline Canine Classes Housecat Leopard Cheetah Dog Wolf Feline Canine p1 8 6 3 2 10 17 12 p2 18 13 15 9 6 46 12 p3 20 5 1 16 19 26 35

Cyrus Cousins Brown University Distance Based Learning

slide-78
SLIDE 78

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

High-Dimensional Codomains

Unrealizable learning: conflicting noisy observations at each representative Classification: Many possible classes (NLP, image recognition, . . . )

Unfactored Factored Factors Feline Canine Classes Housecat Leopard Cheetah Dog Wolf Feline Canine p1 8 6 3 2 10 17 12 p2 18 13 15 9 6 46 12 p3 20 5 1 16 19 26 35

Regression: high dimensional noisy observations

Meaningless at full rank (w.h.p. signal overcome by noise) Signal lies on a low-dimensional subspace

Cyrus Cousins Brown University Distance Based Learning

slide-79
SLIDE 79

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

High-Dimensional Codomains

Unrealizable learning: conflicting noisy observations at each representative Classification: Many possible classes (NLP, image recognition, . . . )

Unfactored Factored Factors Feline Canine Classes Housecat Leopard Cheetah Dog Wolf Feline Canine p1 8 6 3 2 10 17 12 p2 18 13 15 9 6 46 12 p3 20 5 1 16 19 26 35

Regression: high dimensional noisy observations

Meaningless at full rank (w.h.p. signal overcome by noise) Signal lies on a low-dimensional subspace

Factorization learns global patterns

We have enough data globally to learn factoring Classification: factoring is groups of co-occuring classes Regression: factoring is low-dimensional subspace of signal Knowing global patterns makes learning local patterns easier

Cyrus Cousins Brown University Distance Based Learning

slide-80
SLIDE 80

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Factorization with Rank Constraints

Define rank-constrained hypothesis class HP,r . = { h(x) . = P(x; ∆)Y : ∆ ∈ ∆, rank(Y ) ≤ r }

Cyrus Cousins Brown University Distance Based Learning

slide-81
SLIDE 81

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Factorization with Rank Constraints

Define rank-constrained hypothesis class HP,r . = { h(x) . = P(x; ∆)Y : ∆ ∈ ∆, rank(Y ) ≤ r } Why rank constraints

Factors in Y are globally learned Labels for each representative chosen locally, in low-dimensional factored space More sophisticated techniques for full-rank prediction

Cyrus Cousins Brown University Distance Based Learning

slide-82
SLIDE 82

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Factorization with Rank Constraints

Define rank-constrained hypothesis class HP,r . = { h(x) . = P(x; ∆)Y : ∆ ∈ ∆, rank(Y ) ≤ r } Why rank constraints

Factors in Y are globally learned Labels for each representative chosen locally, in low-dimensional factored space More sophisticated techniques for full-rank prediction

Generalization bounds

Take N1 ≥ the size of any γ1-cover of the space of rank r label matrices Take N2 ≥ the size of any γ2-cover of space of r-class k-NR N2(HP,r, γ1 + γ2) ≤ N1N2 Multiply cover sizes, add cover granularities

Rm

  • ℓ1 ◦ HP,r, D
  • ln(N1) + ln(N2)

2m + γ1 + γ2

Cyrus Cousins Brown University Distance Based Learning

slide-83
SLIDE 83

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Allowing Full-Rank Prediction

Factored classifier captures only general patterns

Fails to distinguish between “similar” classes, even with sufficient information Some representatives represent more training points than others

Cyrus Cousins Brown University Distance Based Learning

slide-84
SLIDE 84

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Allowing Full-Rank Prediction

Factored classifier captures only general patterns

Fails to distinguish between “similar” classes, even with sufficient information Some representatives represent more training points than others

Idea: relax constraints on Y

Approximately low rank: HP,r,α has Y constrained s.t. ∃Y ′ s.t. rank(Y ′) ≤ r, Y − Y ′1 ≤ α Rademacher averages linearly interpolate ˆ

Rm(ℓ1 ◦ HP,r,α, z)

  • Approximately low rank

≤ (1 − α) ˆ

Rm(ℓ1 ◦ HP,r, z)

  • Low rank

+α ˆ

Rm(ℓ1 ◦ HP, z)

  • Full rank

Trace norm constraints: constrain Y s.t. Y Tr ≤ α

Low trace norm = ⇒ approximated by low rank Y ′

Cyrus Cousins Brown University Distance Based Learning

slide-85
SLIDE 85

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

Allowing Full-Rank Prediction

Factored classifier captures only general patterns

Fails to distinguish between “similar” classes, even with sufficient information Some representatives represent more training points than others

Idea: relax constraints on Y

Approximately low rank: HP,r,α has Y constrained s.t. ∃Y ′ s.t. rank(Y ′) ≤ r, Y − Y ′1 ≤ α Rademacher averages linearly interpolate ˆ

Rm(ℓ1 ◦ HP,r,α, z)

  • Approximately low rank

≤ (1 − α) ˆ

Rm(ℓ1 ◦ HP,r, z)

  • Low rank

+α ˆ

Rm(ℓ1 ◦ HP, z)

  • Full rank

Trace norm constraints: constrain Y s.t. Y Tr ≤ α

Low trace norm = ⇒ approximated by low rank Y ′

Bias-variance tradeoff

Replace regularity constraints with regularization penalty Can learn complicated model, but only if data supports it Analysis: structural risk minimization

Cyrus Cousins Brown University Distance Based Learning

slide-86
SLIDE 86

Distance Based Learning k-Nearest Representatives Uniform Convergence The Curse of Dimensionality

A Brief Recapitulation

Construct distance based hypothesis class

Optimize for classification or regression Analyze with uniform convergence theory

Problems with high dimensional X or Y

Handle high-dimensional X with metric learning Handle high-dimensional Y with factorization

Generalization bounds reflect difficulty of learning in high-dimensional spaces

Dependence on metric space cover Dependence on label factorization rank

Cyrus Cousins Brown University Distance Based Learning