Ri Risk bo bounds unds for r some me classificati tion n and - - PowerPoint PPT Presentation

ri risk bo bounds unds for r some me classificati tion n
SMART_READER_LITE
LIVE PREVIEW

Ri Risk bo bounds unds for r some me classificati tion n and - - PowerPoint PPT Presentation

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor


slide-1
SLIDE 1

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate

Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor Laboratory)

slide-2
SLIDE 2

2

(Breiman, 1995)

slide-3
SLIDE 3

When is "interpolation" justified in ML?

  • Supervised learning: use training examples to find function that

predicts accurately on new example

  • Interpolation: find function that perfectly fits training examples
  • Some call this "overfitting"
  • PAC learning (Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; …):
  • realizable, noise-free setting
  • bounded-capacity hypothesis class
  • Regression models:
  • Can interpolate if no noise!
  • E.g., linear models with ! ≥ #

3

slide-4
SLIDE 4

Overfitting

4

slide-5
SLIDE 5

Some observations from the field

  • Can fit any training data, given enough

time and large enough network.

  • Can generalize even when training data

has substantial amount of label noise.

(Zhang, Bengio, Hardt, Recht, & Vinyals, 2017)

5

slide-6
SLIDE 6

More observations from the field

  • Can fit any training data, given enough

time and rich enough feature space.

  • Can generalize even when training data

has substantial amount of label noise.

(Belkin, Ma, & Mandal, 2018)

6

MNIST

slide-7
SLIDE 7

Summary of some empirical observations

  • Training produces a function !

" that perfectly fits noisy training data.

  • !

" is likely a very complex function!

  • Yet, test error of !

" is non-trivial: e.g., noise rate + 5%.

7

Can theory explain these observations?

slide-8
SLIDE 8

"Classical" learning theory

Generalization: true error rate ≤ training error rate + deviation bound

  • Deviation bound: depends on "complexity" of learned function
  • Capacity control, regularization, smoothing, algorithmic stability, margins, …
  • None known to be non-trivial for functions interpolating noisy data.
  • E.g., function is chosen from class rich enough to express all possible ways to

label Ω(%) training examples.

  • Bound must exploit specific properties of chosen function.

8

slide-9
SLIDE 9

Even more observations from the field

  • Some "local interpolation" methods

are robust to label noise.

  • Can limit influence of noisy points

in other parts of data space.

(Wyner, Olson, Bleich, & Mease, 2017)

9

slide-10
SLIDE 10

What is known in theory?

Nearest neighbor (Cover & Hart, 1967)

  • Predict with label of nearest

training example

  • Interpolates training data
  • Not always consistent, but almost

Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998)

  • Bandwidth-free(!) Nadaraya-Watson

smoothing kernel regression

  • Interpolates training data
  • Consistent(!!), but no rates

! " − "$ = 1 " − "$ '

10

slide-11
SLIDE 11

Our goals

  • Counter the "conventional wisdom" re: interpolation

Show interpolation methods can be consistent (or almost consistent) for classification & regression problems

  • Identify some useful properties of certain local prediction methods
  • Suggest connections to practical methods

11

slide-12
SLIDE 12

Our new results

Analyses of two new interpolation schemes

  • 1. Simplicial interpolation
  • Natural linear interpolation based on multivariate triangulation
  • Asymptotic advantages compared to nearest neighbor rule
  • 2. Weighted k-NN interpolation
  • Consistency + non-asymptotic convergence rates

12

slide-13
SLIDE 13
  • 1. Simplicial interpolation

13

slide-14
SLIDE 14

Interpolation via multivariate triangulation

  • IID training examples !", $" , … , !&, $& ∈ ℝ)×[0,1]
  • Partition / ≔ conv !", … , !& into simplices with !5 as vertices via Delaunay.
  • Define ̂

7(!) on each simplex by affine interpolation of vertices' labels.

  • Result is piecewise linear on /. (Punt on what happens outside of /.)
  • For classification ($ ∈ {0,1}), let <

= be plug-in classifier based on ̂ 7.

14

slide-15
SLIDE 15

!" !# !$

What happens on a single simplex

  • Simplex on !", … , !'(" with corresponding labels )", … , )'("
  • Test point ! in simplex, with barycentric coordinates (+", … , +'(").
  • Linear interpolation at ! (i.e., least squares fit, evaluated at !):

̂ . ! = 0

12" '("

+1)1

!

15

Key idea: aggregates information from all vertices to make prediction. (C.f. nearest neighbor rule.)

slide-16
SLIDE 16

Comparison to nearest neighbor rule

  • Suppose ! " = Pr(' = 1 ∣ ") < 1/2 for all points in a simplex
  • Bayes optimal prediction is 0 for all points in simplex.
  • Suppose '. = ⋯ = '0 = 0, but '02. = 1 (due to "label noise")

x1 x3 x2 1

Nearest neighbor rule

x1 x3 x2 1

Simplicial interpolation 3 4 " = 1 here

Effect even more pronounced in high dimensions!

16

slide-17
SLIDE 17

Asymptotic risk

Theorem: Assume distribution of ! is uniform on some convex set, " is Holder smooth. Then simplicial interpolation estimate satisfies limsup

)

* ̂ " ! − " !

2 0 + 1 3- and plug-in classifier satisfies limsup

)

Pr 6 7 ! ≠ 7

9:; !

≤ < 1

17

  • Near-consistency in high-dimension: Bayes optimal + <

= >

  • C.f. nearest neighbor classifier: ≤ twice Bayes optimal
  • "Blessing" of dimensionality (with caveat about convergence rate).
slide-18
SLIDE 18
  • 2. Weighted k-NN interpolation

18

slide-19
SLIDE 19

Weighted k-NN scheme

  • For given test point !, let !(#), … , ! ' be ( nearest neighbors in

training data, and let )(#), … , ) ' be corresponding labels.

19

!(#) !(*) !(') !

Define ̂ , ! = ∑/0#

'

1(!, ! / ) ) / ∑/0#

'

1(!, ! / ) where 1 !, ! / = ! − ! /

34,

5 > 0

Interpolation: ̂ , ! → )/ as ! → !/

slide-20
SLIDE 20

Comparison to Hilbert kernel estimate

Weighted k-NN Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) ̂ " # = ∑&'(

)

*(#, # & ) . & ∑&'(

)

*(#, # & ) *(#, # & ) = ‖# − # & ‖12 Our analysis needs 0 < 5 < 6/2 ̂ " # = ∑&'(

9

*(#, #&) .& ∑&'(

9

*(#, #&) * #, #& = # − #&

12

MUST have 5 = 6 for consistency

20

Localization makes it possible to prove non-asymptotic rate.

slide-21
SLIDE 21

Convergence rates

Theorem: Assume distribution of ! is uniform on some compact set satisfying regularity condition, and " is #-Holder smooth. For appropriate setting of $, weighted k-NN estimate satisfies % ̂ " ' − " '

) ≤ + ,-)./().12)

If Tsybakov noise condition with parameter 4 > 0 also holds, then plug-in classifier, with appropriate setting of $, satisfies Pr 9 : ! ≠ :

<=> !

≤ + ,-.?/(. )1? 12)

21

slide-22
SLIDE 22

Closing thoughts

22

slide-23
SLIDE 23

Connections to models used in practice

  • Kernel ridge regression:
  • Simplicial interpolation is like Laplace kernel in ℝ"
  • Random forests:
  • Large ensembles with random thresholds may approximate locally-linear

interpolation (Cutler & Zhao, 2001)

  • Neural nets:
  • Many recent empirical studies that find similarities between neural nets and

k-NN in terms of performance and noise-robustness

(Drory, Avidan, & Giryes, 2018; Cohen, Sapiro, & Giryes, 2018)

23

slide-24
SLIDE 24

"Adversarial examples"

  • Interpolation works because mass of region immediately around

noisily-labeled training examples is small in high-dimensions. But also a great source of adversarial examples -- easy to find using local

  • ptimization around training

examples.

24

slide-25
SLIDE 25

Open problems

  • Generalization theory to explain behavior of interpolation methods
  • Kernel methods:

min$∈ℋ ' ℋ

( subject to ' )* = ,* for all - = 1, … , 1.

When does this work (with noisy labels)?

  • Very recent work by T. Liang and A. Rakhlin (2018+) provides some analysis in

some regimes.

  • Benefits of interpolation?

25

slide-26
SLIDE 26

Acknowledgements

  • National Science Foundation
  • Sloan Foundation
  • Simons Institute for the Theory of Computing

26

arxiv.org/abs/1806.05161