ri risk bo bounds unds for r some me classificati tion n
play

Ri Risk bo bounds unds for r some me classificati tion n and - PowerPoint PPT Presentation

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor


  1. Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor Laboratory)

  2. (Breiman, 1995) 2

  3. When is "interpolation" justified in ML? • Supervised learning : use training examples to find function that predicts accurately on new example • Interpolation : find function that perfectly fits training examples • Some call this "overfitting" • PAC learning (Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; …) : • realizable, noise-free setting • bounded-capacity hypothesis class • Regression models : • Can interpolate if no noise! • E.g., linear models with ! ≥ # 3

  4. Overfitting 4

  5. (Zhang, Bengio, Hardt, Recht, & Vinyals, 2017) Some observations from the field • Can fit any training data, given enough time and large enough network. • Can generalize even when training data has substantial amount of label noise. 5

  6. (Belkin, Ma, & Mandal, 2018) More observations from the field MNIST • Can fit any training data, given enough time and rich enough feature space. • Can generalize even when training data has substantial amount of label noise. 6

  7. Summary of some empirical observations • Training produces a function ! " that perfectly fits noisy training data. • ! " is likely a very complex function! • Yet, test error of ! " is non-trivial: e.g., noise rate + 5%. Can theory explain these observations? 7

  8. "Classical" learning theory Generalization : 0 true error rate ≤ training error rate + deviation bound • Deviation bound : depends on "complexity" of learned function • Capacity control, regularization, smoothing, algorithmic stability, margins, … • None known to be non-trivial for functions interpolating noisy data. • E.g., function is chosen from class rich enough to express all possible ways to label Ω(%) training examples. • Bound must exploit specific properties of chosen function. 8

  9. (Wyner, Olson, Bleich, & Mease, 2017) Even more observations from the field • Some "local interpolation" methods are robust to label noise. • Can limit influence of noisy points in other parts of data space. 9

  10. What is known in theory? Nearest neighbor (Cover & Hart, 1967) Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) • Predict with label of nearest • Bandwidth-free(!) Nadaraya-Watson training example smoothing kernel regression • Interpolates training data • Interpolates training data • Not always consistent, but almost • Consistent(!!), but no rates 1 ! " − " $ = " − " $ ' 10

  11. Our goals • Counter the "conventional wisdom" re: interpolation Show interpolation methods can be consistent (or almost consistent) for classification & regression problems • Identify some useful properties of certain local prediction methods • Suggest connections to practical methods 11

  12. Our new results Analyses of two new interpolation schemes 1. Simplicial interpolation • Natural linear interpolation based on multivariate triangulation • Asymptotic advantages compared to nearest neighbor rule 2. Weighted k -NN interpolation • Consistency + non-asymptotic convergence rates 12

  13. 1. Simplicial interpolation 13

  14. Interpolation via multivariate triangulation • IID training examples ! " , $ " , … , ! & , $ & ∈ ℝ ) ×[0,1] • Partition / ≔ conv ! " , … , ! & into simplices with ! 5 as vertices via Delaunay. • Define ̂ 7(!) on each simplex by affine interpolation of vertices' labels. • Result is piecewise linear on / . (Punt on what happens outside of / .) • For classification ( $ ∈ {0,1} ), let < = be plug-in classifier based on ̂ 7 . 14

  15. ̂ What happens on a single simplex • Simplex on ! " , … , ! '(" with corresponding labels ) " , … , ) '(" • Test point ! in simplex, with barycentric coordinates (+ " , … , + '(" ) . • Linear interpolation at ! (i.e., least squares fit, evaluated at ! ): '(" ! " . ! = 0 + 1 ) 1 ! 12" ! $ Key idea : aggregates information from all vertices to make prediction. ! # (C.f. nearest neighbor rule.) 15

  16. Comparison to nearest neighbor rule • Suppose ! " = Pr(' = 1 ∣ ") < 1/2 for all points in a simplex • Bayes optimal prediction is 0 for all points in simplex. • Suppose ' . = ⋯ = ' 0 = 0 , but ' 02. = 1 (due to "label noise") x 1 x 1 3 4 " = 1 here 0 0 Effect even more pronounced in high dimensions! 0 1 0 1 x 2 x 3 x 2 x 3 Nearest neighbor rule Simplicial interpolation 16

  17. ̂ Asymptotic risk Theorem : Assume distribution of ! is uniform on some convex set, " is Holder smooth. Then simplicial interpolation estimate satisfies 2 - ≤ 0 + 1 3 - limsup * " ! − " ! ) and plug-in classifier satisfies 1 6 limsup Pr 7 ! ≠ 7 9:; ! ≤ < 0 ) = • Near-consistency in high-dimension : Bayes optimal + < > • C.f. nearest neighbor classifier : ≤ twice Bayes optimal • "Blessing" of dimensionality (with caveat about convergence rate). 17

  18. 2. Weighted k -NN interpolation 18

  19. ̂ Weighted k -NN scheme • For given test point ! , let ! (#) , … , ! ' be ( nearest neighbors in training data, and let ) (#) , … , ) ' be corresponding labels. Define ' ∑ /0# 1(!, ! / ) ) / ! (#) , ! = ' ∑ /0# 1(!, ! / ) where ! ! (*) 34 , 1 !, ! / = ! − ! / 5 > 0 ! (') Interpolation : ̂ , ! → ) / as ! → ! / 19

  20. ̂ ̂ Comparison to Hilbert kernel estimate Weighted k -NN Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) 9 ) " # = ∑ &'( *(#, # & ) . & ∑ &'( *(#, # & ) . & " # = 9 ∑ &'( *(#, # & ) ) ∑ &'( *(#, # & ) 12 *(#, # & ) = ‖# − # & ‖ 12 * #, # & = # − # & Our analysis needs 0 < 5 < 6/2 MUST have 5 = 6 for consistency Localization makes it possible to prove non-asymptotic rate. 20

  21. ̂ Convergence rates Theorem : Assume distribution of ! is uniform on some compact set satisfying regularity condition, and " is # -Holder smooth. For appropriate setting of $ , weighted k -NN estimate satisfies ) ≤ + , -)./().12) % " ' − " ' If Tsybakov noise condition with parameter 4 > 0 also holds, then plug-in classifier, with appropriate setting of $ , satisfies 9 ≤ + , -.?/(. )1? 12) Pr : ! ≠ : <=> ! 21

  22. Closing thoughts 22

  23. Connections to models used in practice • Kernel ridge regression: • Simplicial interpolation is like Laplace kernel in ℝ " • Random forests: • Large ensembles with random thresholds may approximate locally-linear interpolation (Cutler & Zhao, 2001) • Neural nets: • Many recent empirical studies that find similarities between neural nets and k -NN in terms of performance and noise-robustness (Drory, Avidan, & Giryes, 2018; Cohen, Sapiro, & Giryes, 2018) 23

  24. "Adversarial examples" • Interpolation works because mass of region immediately around noisily-labeled training examples is small in high-dimensions. But also a great source of adversarial examples -- easy to find using local optimization around training examples. 24

  25. Open problems • Generalization theory to explain behavior of interpolation methods • Kernel methods: ( subject to ' ) * = , * for all - = 1, … , 1 . min $∈ℋ ' ℋ When does this work (with noisy labels) ? • Very recent work by T. Liang and A. Rakhlin (2018+) provides some analysis in some regimes. • Benefits of interpolation? 25

  26. Acknowledgements • National Science Foundation • Sloan Foundation • Simons Institute for the Theory of Computing arxiv.org/abs/1806.05161 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend