statistical learning theory a hitchhiker s guide
play

Statistical Learning Theory: A Hitchhikers Guide John Shawe-Taylor - PowerPoint PPT Presentation

Statistical Learning Theory: A Hitchhikers Guide John Shawe-Taylor UCL Omar Rivasplata UCL / DeepMind December 2018 Neural Information Processing Systems Slide 1 / 52 Why SLT NeurIPS 2018 Slide 2 / 52 Error distribution picture 20 Parzen


  1. Statistical Learning Theory: A Hitchhiker’s Guide John Shawe-Taylor UCL Omar Rivasplata UCL / DeepMind December 2018 Neural Information Processing Systems Slide 1 / 52

  2. Why SLT NeurIPS 2018 Slide 2 / 52

  3. Error distribution picture 20 Parzen window 18 Linear SVM 16 14 95% confidence 12 10 95% confidence 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 mean mean NeurIPS 2018 Slide 3 / 52

  4. SLT is about high confidence Why SLT For a fixed algorithm, function class and sample size, Overview generating random samples −→ distribution of test errors Notation Focusing on the mean of the error distribution? � First generation ⊲ can be misleading: learner only has one sample Second generation Next generation Statistical Learning Theory: tail of the distribution � ⊲ finding bounds which hold with high probability over random samples of size m Compare to a statistical test – at 99% confidence level � ⊲ chances of the conclusion not being true are less than 1% PAC: probably approximately correct � P m [large error] ≤ δ Use a ‘confidence parameter’ δ : δ is probability of being misled by the training set Hence high confidence: P m [approximately correct] ≥ 1 − δ � NeurIPS 2018 Slide 4 / 52

  5. Error distribution picture 20 Parzen window 18 Linear SVM 16 14 95% confidence 12 10 95% confidence 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 mean mean NeurIPS 2018 Slide 5 / 52

  6. Overview NeurIPS 2018 Slide 6 / 52

  7. The Plan � Definitions and Notation: (John) ⊲ risk measures, generalization � First generation SLT: (Omar) ⊲ worst-case uniform bounds ⊲ Vapnik-Chervonenkis characterization � Second generation SLT: (John) ⊲ hypothesis-dependent complexity ⊲ SRM, Margin, PAC-Bayes framework � Next generation SLT? (Omar) ⊲ Stability. Deep NN’s. Future directions NeurIPS 2018 Slide 7 / 52

  8. What to expect We will... ⊲ Focus on aims / methods / key ideas ⊲ Outline some proofs ⊲ Hitchhiker’s guide! We will not... ⊲ Detailed proofs / full literature (apologies!) ⊲ Complete history / other learning paradigms ⊲ Encyclopaedic coverage of SLT NeurIPS 2018 Slide 8 / 52

  9. Definitions and Notation NeurIPS 2018 Slide 9 / 52

  10. Mathematical formalization A : Z m → H Why SLT Learning algorithm Overview Notation Z = X × Y H = hypothesis class • • First generation Second generation X = set of inputs = set of predictors Next generation Y = set of labels (e.g. classifiers) Training set (aka sample): S m = (( X 1 , Y 1 ) , . . . , ( X m , Y m )) a finite sequence of input-label examples. SLT assumptions : A data-generating distribution P over Z . • • Learner doesn’t know P , only sees the training set. S m ∼ P m The training set examples are i.i.d. from P : • ⊲ these can be relaxed (but beyond the scope of this tutorial) NeurIPS 2018 Slide 10 / 52

  11. What to achieve from the sample? Why SLT Use the available sample to: Overview (1) learn a predictor Notation (2) certify the predictor’s performance First generation Second generation Learning a predictor: Next generation • algorithm driven by some learning principle • informed by prior knowledge resulting in inductive bias Certifying performance: • what happens beyond the training set • generalization bounds Actually these two goals interact with each other! NeurIPS 2018 Slide 11 / 52

  12. Risk (aka error) measures A loss function ℓ ( h ( X ) , Y ) is used to measure the discrepancy Why SLT between a predicted label h ( X ) and the true label Y . Overview Notation First generation � m R in ( h ) = 1 i = 1 ℓ ( h ( X i ) , Y i ) Empirical risk: m Second generation (in-sample) Next generation R out ( h ) = E � ℓ ( h ( X ) , Y ) � Theoretical risk: (out-of-sample) Examples: ℓ ( h ( X ) , Y ) = 1 [ h ( X ) � Y ] : 0-1 loss (classification) • ℓ ( h ( X ) , Y ) = ( Y − h ( X )) 2 : square loss (regression) • ℓ ( h ( X ) , Y ) = (1 − Yh ( X )) + : hinge loss • ℓ ( h ( X ) , Y ) = − log( h ( X )) : log loss (density estimation) • NeurIPS 2018 Slide 12 / 52

  13. Generalization If classifier h does well on the in-sample ( X , Y ) pairs... Why SLT Overview ...will it still do well on out-of-sample pairs? Notation First generation ∆ ( h ) = R out ( h ) − R in ( h ) Generalization gap: Second generation Next generation ∆ ( h ) ≤ ǫ ( m , δ ) Upper bounds: w.h.p. R out ( h ) ≤ R in ( h ) + ǫ ( m , δ ) ◮ ∆ ( h ) ≥ ˜ ǫ ( m , δ ) Lower bounds: w.h.p. Flavours: distribution-free distribution-dependent � � algorithm-free algorithm-dependent � � NeurIPS 2018 Slide 13 / 52

  14. First generation SLT NeurIPS 2018 Slide 14 / 52

  15. Building block: One single function For one fixed (non data-dependent) h : Why SLT Overview � m � � E [ R in ( h )] = E i = 1 ℓ ( h ( X i ) , Y i ) = R out ( h ) 1 Notation m First generation Second generation P m [ ∆ ( h ) > ǫ ] = P m � E [ R in ( h )] − R in ( h ) > ǫ � deviation ineq. Next generation ◮ ℓ ( h ( X i ) , Y i ) are independent r.v.’s ◮ If 0 ≤ ℓ ( h ( X ) , Y ) ≤ 1, using Hoe ff ding’s inequality: ◮ P m � ∆ ( h ) > ǫ � ≤ exp � − 2 m ǫ 2 � = δ Given δ ∈ (0 , 1), equate RHS to δ , solve equation for ǫ , get ◮ P m � � ∆ ( h ) > (1 / 2 m ) log(1 /δ ) � ≤ δ � 1 � � R out ( h ) ≤ R in ( h ) + 1 with probability ≥ 1 − δ , 2 m log ◮ δ NeurIPS 2018 Slide 15 / 52

  16. Finite function class Algorithm A : Z m → H Why SLT Function class H with | H | < ∞ Overview P m � ∀ f ∈ H , ∆ ( f ) ≤ ǫ � ≥ 1 − δ Aim for a uniform bound: Notation First generation P m ( E 1 or E 2 or · · · ) ≤ P m ( E 1 ) + P m ( E 2 ) + · · · Basic tool: Second generation Next generation known as the union bound (aka countable sub-additivity) P m � � f ∈ H P m � � ∃ f ∈ H , ∆ ( f ) > ǫ ∆ ( f ) > ǫ ≤ � � − 2 m ǫ 2 � ≤ | H | exp = δ � | H | � ∀ h ∈ H , R out ( h ) ≤ R in ( h ) + � 1 w.p. ≥ 1 − δ , 2 m log δ NeurIPS 2018 Slide 16 / 52

  17. Uncountably infinite function class? Algorithm A : Z m → H Why SLT Function class H with | H | ≥ | N | Overview Double sample trick: a second ‘ghost sample’ Notation true error ↔ empirical error on the ‘ghost sample’ First generation � hence reduce to a finite number of behaviours Second generation � Next generation make union bound, but bad events grouped together � Symmetrization: bound the probability of good performance on one sample � but bad performance on the other sample swapping examples between actual and ghost sample � Growth function of class H : G H ( m ) = largest number of dichotomies ( ± 1 labels) � generated by the class H on any m points. VC dimension of class H : VC ( H ) = largest m such that G H ( m ) = 2 m � NeurIPS 2018 Slide 17 / 52

  18. VC upper bound For any m , for any δ ∈ (0 , 1), Why SLT Vapnik & Chervonenkis: Overview � 4 G H (2 m ) � � ∀ h ∈ H , ∆ ( h ) ≤ 8 Notation w.p. ≥ 1 − δ , m log δ First generation Second generation growth function Next generation Bounding the growth function → Sauer’s Lemma � If d = VC ( H ) finite, then G H ( m ) ≤ � d � m � for all m � k = 0 k implies G H ( m ) ≤ ( em / d ) d (polynomial in m ) For H with d = VC ( H ) finite, for any m , for any δ ∈ (0 , 1), � + 8 � 8 d m log � 2 em ∀ h ∈ H , ∆ ( h ) ≤ m log � 4 � w.p. ≥ 1 − δ , d δ NeurIPS 2018 Slide 18 / 52

  19. PAC learnability Why SLT VC upper bound: Overview Note that the bound is: � Notation the same for all functions in the class (uniform over H ) First generation and the same for all distributions (uniform over P ) Second generation Next generation VC lower bound: VC dimension characterises learnability in PAC setting: � there exist distributions such that with large probability over m random examples, the gap between the risk and the best possible risk achievable over the class is at least � d m . NeurIPS 2018 Slide 19 / 52

  20. Limitations of the VC framework Why SLT The theory is certainly valid and tight – lower and upper � Overview bounds match! Notation VC bounds motivate Empirical Risk Minimization (ERM), � First generation as apply to a hypothesis space, not hypothesis-dependent Second generation Next generation Practical algorithms often do not search a fixed hypothesis � space but regularise to trade complexity with empirical error, e.g. k -NN or SVMs or DNNs Mismatch between theory and practice � Let’s illustrate this with SVMs... � NeurIPS 2018 Slide 20 / 52

  21. SVM with Gaussian kernel 40 Parzen window 35 Kernel SVM � � − � x − z � 2 30 κ ( x , z ) = exp 2 σ 2 25 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 NeurIPS 2018 Slide 21 / 52

  22. SVM with Gaussian kernel: A case study Why SLT VC dimension −→ infinite � Overview but observed performance is often excellent � Notation VC bounds aren’t able to explain this � First generation lower bounds appear to contradict the observations � Second generation How to resolve this apparent contradiction? � Next generation Coming up... large margin ⊲ distribution may not be worst-case � NeurIPS 2018 Slide 22 / 52

  23. Hitchhiker’s guide Why SLT Overview right but wrong Notation First generation Second generation Next generation nice and complete Practical usefulness Theory not so much NeurIPS 2018 Slide 23 / 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend