Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, - PowerPoint PPT Presentation

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan

A BOUT THE COURSE No exams ! 5 assignments that count towards your grades (55%) One term project (40%) 5% for class participation

P RE - REQUISITES Basic probability theory Basics of algorithms and analysis Introductory level machine learning course Mathematical maturity, comfortable reading/writing formal mathematical proofs.

T ERM PROJECT One of the following three options : Pick your research problem, get it approved by me, write a report 1 on your work Pick two papers on learning theory, get it approved by me, write a 2 report with your own views/opinions I will provide a list of problems, workout problems worth a total 3 of 10 stars out of this list Oct 16th submit proposal/get your project approved by me Finals week projects are due

Lets get started ...

W HAT IS M ACHINE L EARNING Use past observations to automatically learn to make better predictions/decisions in the future.

W HERE IS IT USED ? Recommendation Systems

W HERE IS IT USED ? Pedestrian Detection

W HERE IS IT USED ? Market Predictions

W HERE IS IT USED ? Spam Classification

W HERE IS IT USED ? Online advertising (improving click through rates) Climate/weather prediction Text categorization Unsupervised clustering (of articles . . . ) . . .

W HAT IS L EARNING T HEORY

W HAT IS L EARNING T HEORY Oops . . .

W HAT IS M ACHINE L EARNING T HEORY How do formalize machine learning problems Right framework for right problems ( Eg. online , statistical ) What does it mean for a problem to be “ learnable ” How many instances do we need to see to learn to given accuracy How do we build sound learning algorithms based on theory Computational learning theory : which problems are efficiently learnable

O UTLINE OF T OPICS Learning problem and frameworks, settings, minimax rates Statistical learning theory Probably Approximately Correct (PAC) and Agnostic PAC frameworks Empirical Risk Minimization, Uniform convergence, Empirical process theory Finite model classes, MDL bounds, PAC Bayes theorem Infinite model classes, Rademacher complexity Binary Classification : growth function, VC dimension Real-valued function classes, covering numbers, chaining, fat-shattering dimension Supervised learning : necessary and sufficient conditions for learnability Online learning theory Sequential minimax and value of online learning game Martingale Uniform convergence, sequential empirical process theory Sequential Rademacher complexity Binary Classification : Littlestone dimension Real valued function classes, sequential covering numbers, chaining bounds, sequential fat-shattering dimension Online supervised learning : necessary & sufficient conditions for learnability Designing learning algorithms : relaxations, random play-outs Computational Learning theory and more if time permits ...

L EARNING P ROBLEM : B ASIC N OTATION Input space/ feature space : X (Eg. bag-of-words, n-grams, vector of grey-scale values, user-movie pair to rate) Feature extraction is an art, . . . an art we won’t cover in this course Output space/ label space Y (Eg. {± 1 } , [ K ] , R -valued output, structured output) Loss function : ℓ ∶ Y × Y ↦ R (Eg. 0 − 1 loss ℓ ( y ′ , y ) = 1 { y ′ ≠ y } , sq-loss ℓ ( y ′ , y ) = ( y − y ′ ) 2 ), absolute loss ℓ ( y ′ , y ) = ∣ y − y ′ ∣ Measures performance/cost per instance (inaccuracy of prediction/ cost of decision). Model class/Hypothesis class F ⊂ Y X (Eg. F = { x ↦ f ⊺ x ∶ ∥ f ∥ 2 ≤ 1 } , F = { x ↦ sign ( f ⊺ x )} )

F ORMALIZING L EARNING P ROBLEMS How is data generated ? How do we measure performance or success ? Where do we place our prior assumption or model assumptions ?

F ORMALIZING L EARNING P ROBLEMS How is data generated ? How do we measure performance or success ? Where do we place our prior assumption or model assumptions ? What we observe ?

P ROBABLY A PPROXIMATELY C ORRECT L EARNING Y = {± 1 } , ℓ ( y ′ , y ) = 1 { y ′ ≠ y } , F ⊂ Y X Learner only observes training sample S = {( x 1 , y 1 ) ,..., ( x n , y n )} x 1 ,..., x n ∼ D X ∀ t ∈ [ n ] , y t = f ∗ ( x t ) where f ∗ ∈ F y ∈ Y X to minimize Goal : find ˆ P x ∼ D X ( ˆ y ( x ) ≠ f ∗ ( x )) (Either in expectation or with high probability)

P ROBABLY A PPROXIMATELY C ORRECT L EARNING Definition Given δ > 0 , ǫ > 0, sample complexity n ( ǫ , δ ) is the smallest n such y s.t. with probability at least 1 − δ , that we can always find forecaster ˆ y ( x ) ≠ f ∗ ( x )) ≤ ǫ P x ∼ D X ( ˆ (efficiently PAC learnable if we can learn efficiently in 1 / δ and 1 / ǫ ) Eg. : learning output for deterministic systems

N ON - PARAMETRIC R EGRESSION Y ⊂ R , ℓ ( y ′ , y ) = ( y − y ′ ) 2 , F ⊂ Y X Learner only observes training sample S = {( x 1 , y 1 ) ,..., ( x n , y n )} x 1 ,..., x n ∼ D X ∀ t ∈ [ n ] , y t = f ∗ ( x t ) + ε t where f ∗ ∈ F and ε t ∼ N ( 0 , σ ) y ∈ R X to minimize Goal : find ˆ y − f ∗ ∥ 2 ∥ ˆ L 2 ( D X ) = E x ∼ D X [( ˆ y ( x ) − f ∗ ( x )) 2 ] y ( x ) − y ) 2 ] − inf f ∈ F E x ∼ D X [( f ( x ) − y ) 2 ] = E x ∼ D X [( ˆ (Either in expectation or in high probability) Eg. : clinical trials (inference problems) model class known.

N ON - PARAMETRIC R EGRESSION y ) 2 , F ⊂ Y X Y ⊂ R , ℓ ( ˆ y , y ) = ( y − ˆ Learner only observes training sample S = {( x 1 , y 1 ) ,..., ( x n , y n )} x 1 ,..., x n ∼ D X ∀ t ∈ [ n ] , y t = f ∗ ( x t ) + ε t where f ∗ ∈ F and ε t ∼ N ( 0 , σ ) y ∈ R X to minimize Goal : find ˆ y − f ∗ ∥ 2 ∥ ˆ L 2 ( D X ) = E x ∼ D X [( ˆ y ( x ) − f ∗ ( x )) 2 ] y ( x ) − y ) 2 ] − inf f ∈ F E x ∼ D X [( f ( x ) − y ) 2 ] = E x ∼ D X [( ˆ (Either in expectation or in high probability) Eg. : clinical trials (inference problems) model class known.

S TATISTICAL L EARNING (A GNOSTIC PAC) Learner only observes training sample S = {( x 1 , y 1 ) ,..., ( x n , y n )} drawn iid from joint distribution D on X × Y y ∈ R X to minimize expected loss over future instances Goal : find ˆ E ( x , y ) ∼ D [ ℓ ( ˆ y ( x ) , y )] − inf f ∈ F E ( x , y ) ∼ D [ ℓ ( f ( x ) , y )] ≤ ǫ L D ( ˆ y ) − inf f ∈ F L D ( f ) ≤ ǫ

S TATISTICAL L EARNING (A GNOSTIC PAC) Definition Given δ > 0 , ǫ > 0, sample complexity n ( ǫ , δ ) is the smallest n such y s.t. with probability at least 1 − δ , that we can always find forecaster ˆ L D ( ˆ y ) − inf f ∈F L D ( f ) ≤ ǫ

L EARNING P ROBLEMS Pedestrian Detection Spam Classification

L EARNING P ROBLEMS Pedestrian Detection Spam Classification (Batch/Statistical setting) (Online/adversarial setting)

O NLINE L EARNING (S EQUENTIAL P REDICTION ) For t = 1 to n Learner receives x t ∈ X y t ∈ Y Learner predicts output ˆ True output y t ∈ Y is revealed End for Goal : minimize regret Reg n ( F ) ∶= 1 1 ℓ ( ˆ y t , y t ) − inf ℓ ( f ( x t ) , y t ) n ∑ n ∑ f ∈ F t = 1 t = 1

O THER P ROBLEMS /F RAMEWORKS Unsupervised learning, clustering Semi-supervised learning Active learning and selective sampling Online convex optimization Bandit problems, partial monitoring, . . .

S NEEK P EEK No Free Lunch Theorems Statistical learning theory Empirical risk minimization Uniform convergence and learning Finite model classes, MDL , PAC Bayes theorem, . . .

H OMEWORK 0 : WARMUP Brush up Markov inequality, Chebychev inequality, central limit theorem Read up or brush up, concentration inequalities (specifically Hoeffding bound, Bernstein bound, Hoeffding-Azuma inequality, McDiarmid’s inequality also referred to as bounded difference inequality) Brush up union bound Watch out for homework 0, no need to submit, just a warmup

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, - PowerPoint PPT Presentation

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan A BOUT THE COURSE No exams ! 5 assignments that count towards your grades (55%) One term project (40%) 5% for class participation P RE -

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Phillips Hall, 407 Instructor : Karthik

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens & Martin

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale

If You Give a Judge a Risk Score Evidence from Kentucky Bail Decisions Alex Albright, Harvard

Using Probability of Exceedance to Compare the Resource Risk of Renewable and Gas-Fired

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

!"#"$%"$&#'(')#+$+,("-)./(

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, - PowerPoint PPT Presentation

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan A BOUT THE COURSE No exams ! 5 assignments that count towards your grades (55%) One term project (40%) 5% for class participation P RE -

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Phillips Hall, 407 Instructor : Karthik

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens &amp; Martin

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Ranking observations with latent information and binary feedback Nicolas Vayatis Ecole Normale

If You Give a Judge a Risk Score Evidence from Kentucky Bail Decisions Alex Albright, Harvard

Using Probability of Exceedance to Compare the Resource Risk of Renewable and Gas-Fired

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

!&quot;#&quot;$%&quot;$&amp;#'(')#*+$+,(&quot;-).*/(

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

Sambuz

Useful Links

Newsletter

Mail Us

The Quanto Theory of Exchange Rates Lukas Kremens Ian Martin April, 2018 Kremens & Martin

!"#"$%"$&#'(')#+$+,("-)./(