This Lecture Basic de fi nitions and concepts. Introduction to the - PowerPoint PPT Presentation

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16

De fi nitions Spaces: input space , output space . Y X Loss function: . L : Y × Y → R • : cost of predicting instead of . L ( b y, y ) b y y • binary classi fi cation: 0-1 loss, . L ( y, y 0 )=1 y 6 = y 0 • regression: , . l ( y, y 0 )=( y 0 − y ) 2 Y ⊆ R Hypothesis set: , subset of functions out of which H ⊆ Y X the learner selects his hypothesis. • depends on features. • represents prior knowledge about task. Foundations of Machine Learning page 17

Supervised Learning Set-Up Training data: sample of size drawn i.i.d. from S X × Y m according to distribution : D S = (( x 1 , y 1 ) , . . . , ( x m , y m )) . Problem: fi nd hypothesis with small generalization h ∈ H error. • deterministic case: output label deterministic function of input, . y = f ( x ) • stochastic case: output probabilistic function of input. Foundations of Machine Learning page 18

Errors Generalization error: for , it is de fi ned by h ∈ H R ( h ) = ( x,y ) ∼ D [ L ( h ( x ) , y )] . E Empirical error: for and sample , it is S h ∈ H m X R ( h ) = 1 b L ( h ( x i ) , y i ) . m i =1 Bayes error: R ? = inf R ( h ) . h h measurable • R ? =0 . in deterministic case, Foundations of Machine Learning page 19

Noise Noise: • in binary classi fi cation, for any , x ∈ X noise ( x ) = min { Pr[1 | x ] , Pr[0 | x ] } . • observe that E[ noise ( x )] = R ∗ . Foundations of Machine Learning page 20

Learning ≠ Fitting Notion of simplicity/complexity. How do we de fi ne complexity? Foundations of Machine Learning page 21

Generalization Observations: • the best hypothesis on the sample may not be the best overall. • generalization is not memorization. • complex rules (very complex separation surfaces) can be poor predictors. • trade-o ff : complexity of hypothesis set vs sample size (under fi tting/over fi tting). Foundations of Machine Learning page 22

Model Selection best in class General equality: for any , h ∈ H R ( h ) − R ∗ = [ R ( h ) − R ( h ∗ )] + [ R ( h ∗ ) − R ∗ ] . | {z } | {z } estimation approximation Approximation: not a random variable, only depends on . H Estimation: only term we can hope to bound. Foundations of Machine Learning page 23

Empirical Risk Minimization Select hypothesis set . H Find hypothesis minimizing empirical error: h ∈ H b h = argmin R ( h ) . h ∈ H • but may be too complex. H • the sample size may not be large enough. Foundations of Machine Learning page 24

Generalization Bounds  � De fi nition: upper bound on | R ( h ) − b Pr sup R ( h ) | > ✏ . h ∈ H Bound on estimation error for hypothesis given by ERM: h 0 R ( h 0 ) − R ( h ∗ ) = R ( h 0 ) − b R ( h 0 ) + b R ( h 0 ) − R ( h ∗ ) ≤ R ( h 0 ) − b R ( h 0 ) + b R ( h ∗ ) − R ( h ∗ ) | R ( h ) − b ≤ 2 sup R ( h ) | . h ∈ H How should we choose ? (model selection problem) H Foundations of Machine Learning page 25

Model Selection error estimation approximation upper bound γ ∗ γ [ H = H γ . γ ∈ Γ Foundations of Machine Learning page 26

Structural Risk Minimization (Vapnik, 1995) Principle: consider an in fi nite sequence of hypothesis sets ordered for inclusion, H 1 ⊂ H 2 ⊂ · · · ⊂ H n ⊂ · · · b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N • strong theoretical guarantees. • typically computationally hard. Foundations of Machine Learning page 27

General Algorithm Families Empirical risk minimization (ERM): b h = argmin R ( h ) . h ∈ H Structural risk minimization (SRM): , H n ⊆ H n +1 b h = argmin R ( h ) + penalty ( H n , m ) . h ∈ H n ,n ∈ N Regularization-based algorithms: , λ ≥ 0 b R ( h ) + λ k h k 2 . h = argmin h ∈ H Foundations of Machine Learning page 28

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 29

Basic Properties Union bound: Pr[ A ∨ B ] ≤ Pr[ A ] + Pr[ B ] . Inversion: if , then, for any , with Pr[ X ≥ ✏ ] ≤ f ( ✏ ) δ > 0 probability at least , . X ≤ f − 1 ( δ ) 1 − δ Jensen’s inequality: if is convex, . f f (E[ X ]) ≤ E[ f ( X )] Z + ∞ Expectation: if , . X ≥ 0 E[ X ]= Pr[ X > t ] dt 0 Foundations of Machine Learning page 30

Basic Inequalities Markov’s inequality: if and , then ✏ > 0 X ≥ 0 Pr[ X ≥ ✏ ] ≤ E[ X ] . ✏ Chebyshev’s inequality: for any , ✏ > 0 Pr[ | X − E[ X ] | ≥ ✏ ] ≤ � 2 ✏ 2 . X Foundations of Machine Learning page 31

Hoe ff ding’s Inequality Theorem: Let be indep. rand. variables with the X 1 , . . . , X m same expectation and , ( ). Then, for any , X i ∈ [ a, b ] a<b ✏ > 0 µ the following inequalities hold: m − 2 m ✏ 2  µ − 1 � ✓ ◆ X Pr ≤ exp X i > ✏ ( b − a ) 2 m i =1  1 m − 2 m ✏ 2 � ✓ ◆ X Pr ≤ exp X i − µ > ✏ . ( b − a ) 2 m i =1 Foundations of Machine Learning page 32

McDiarmid’s Inequality (McDiarmid, 1989) Theorem: let be independent random variables X 1 , . . . , X m taking values in and a function verifying for f : U m → R U all , i ∈ [1 , m ] | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 sup i , . . . , x m ) | ≤ c i . x 1 ,...,x m ,x 0 i Then, for all , ✏ > 0 2 ✏ 2 ✓ ◆ h� i � Pr � f ( X 1 , . . . , X m ) − E[ f ( X 1 , . . . , X m )] ≤ 2 exp � > ✏ . − P m i =1 c 2 i Foundations of Machine Learning page 33

Appendix Foundations of Machine Learning page 34

Markov’s Inequality Theorem: let be a non-negative random variable X with , then, for all , E[ X ] < ∞ t> 0 Pr[ X ≥ t E[ X ]] ≤ 1 t . Proof: X Pr[ X ≥ t E[ X ]] = Pr[ X = x ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x ≥ t E[ X ] x X Pr[ X = x ] ≤ t E[ X ] x  � = 1 X = E t . t E[ X ] Foundations of Machine Learning page 35

Chebyshev’s Inequality Theorem: let be a random variable with , then, Var[ X ] < ∞ X for all , t> 0 Pr[ | X − E[ X ] | ≥ t σ X ] ≤ 1 t 2 . Proof: Observe that Pr[ | X − E[ X ] | ≥ t σ X ] = Pr[( X − E[ X ]) 2 ≥ t 2 σ 2 X ] . The result follows Markov’s inequality. Foundations of Machine Learning page 36

Weak Law of Large Numbers Theorem: let be a sequence of independent ( X n ) n ∈ N random variables with the same mean and variance σ 2 < ∞ µ and let , then, for any , P n X n = 1 i =1 X i ✏ > 0 n n →∞ Pr[ | X n − µ | ≥ ✏ ] = 0 . lim Proof: Since the variables are independent, n = n σ 2 n 2 = σ 2  X i � X Var[ X n ] = Var n . n i =1 Thus, by Chebyshev’s inequality, Pr[ | X n − µ | ≥ ✏ ] ≤ � 2 n ✏ 2 . Foundations of Machine Learning page 37

Concentration Inequalities Some general tools for error analysis and bounds: • Hoe ff ding’s inequality (additive). • Cherno ff bounds (multiplicative). • McDiarmid’s inequality (more general). Foundations of Machine Learning page 38

Hoe ff ding’s Lemma Lemma: Let be a random variable with X ∈ [ a, b ] E[ X ]=0 and . Then for any , b 6 = a t> 0 t 2( b − a )2 E [ e tX ] ≤ e 8 . Proof: by convexity of , for all , x 7! e tx a ≤ x ≤ b e tx ≤ b − x b − ae ta + x − a b − a e tb . Thus, b − a e ta + X − a b − a e ta + − a b − a e tb = e φ ( t ) , E[ e tX ] ≤ E [ b − X b b − a e tb ] = with, b − a e ta + − a b b − a + − a b b − a e t ( b − a ) ) . b − a e tb ) = ta + log( φ ( t ) = log( Foundations of Machine Learning page 39

Taking the derivative gives: ae t ( b − a ) φ 0 ( t ) = a − b − a e t ( b − a ) = a − a b − a . b a b a b − a e − t ( b − a ) � b − a � Note that: Furthermore, φ (0) = 0 and φ 0 (0) = 0 . − abe � t ( b � a ) Φ 00 ( t ) = b � a e � t ( b � a ) − b a b � a ] 2 [ = α (1 − α ) e � t ( b � a ) ( b − a ) 2 [(1 − α ) e � t ( b � a ) + α ] 2 (1 − α ) e � t ( b � a ) α [(1 − α ) e � t ( b � a ) + α ]( b − a ) 2 = [(1 − α ) e � t ( b � a ) + α ] = u (1 − u )( b − a ) 2 ≤ ( b − a ) 2 , 4 − a with There exists such that: b − a. 0 ≤ θ ≤ t α = φ ( t ) = φ (0) + t φ 0 (0) + t 2 2 φ 00 ( θ ) ≤ t 2 ( b − a ) 2 . 8 Foundations of Machine Learning page 40

This Lecture Basic de fi nitions and concepts. Introduction to the - PowerPoint PPT Presentation

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16 De fi nitions Spaces: input space , output space . Y X Loss function:

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Sketch Data Structures and Concentration Bounds Graham Cormode University of Warwick

Large Deviation Bounds A typical probability theory statement: Theorem (The Central Limit

The Probabilistic Method Week 8: Second Moment Method Joshua Brody CS49/Math59 Fall 2015

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions Scott Sheffield MIT 1

Randomized Algorithms Balls-into-bins model The threshold for throw m

Expectation, moments Two elementary definitions of expected values: Defn : If X has density f then

Hanson-Wright inequality in Banach spaces Rafa Meller (based on joint work with R. Adamczak and

ADNA: online, context-aware, intelligent framework for Anomaly Detection aNd Analysis in SCADA