Machine Learning and Data Mining VC Dimension Kalev Kask Slides - PowerPoint PPT Presentation

+ Machine Learning and Data Mining VC Dimension Kalev Kask Slides based on Andrew Moore’s

Learners and Complexity • We’ ve seen many versions of underfit/overfit trade-off – Complexity of the learner – “ Representational Power ” • Different learners have different power Feature Values Parameters (measured) x 1 Predicted Class x 2 Classifier 3 … 2 x n 1 0 Example: -1 -2 (c) Alexander Ihler -3 -3 -2 -1 0 1 2 3

Learners and Complexity • We’ ve seen many versions of underfit/overfit trade-off – Complexity of the learner – “ Representational Power ” • Different learners have different power Feature Values Parameters (measured) x 1 Predicted Class x 2 Classifier … x n Example: (c) Alexander Ihler

Learners and Complexity • We’ ve seen many versions of underfit/overfit trade-off – Complexity of the learner – “ Representational Power ” • Different learners have different power • Usual trade-off: – More power = represent more complex systems, might overfit – Less power = won ’ t overfit, but may not find “ best ” learner • How can we quantify representational power? – Not easily… – One solution is VC (Vapnik-Chervonenkis) dimension (c) Alexander Ihler

Some notation • Assume training data are iid from some distribution p(x,y) • Define “ risk ” and “ empirical risk ” – These are just “ long term ” test and observed training error • How are these related? Depends on overfitting … – Underfitting domain: pretty similar… – Overfitting domain: test error might be lots worse! (c) Alexander Ihler

VC Dimension and Risk • Given some classifier, let H be its VC dimension – Represents “ representational power ” of classifier • With “ high probability ” (1- ´ ), Vapnik showed (c) Alexander Ihler

Shattering • We say a classifier f(x) can shatter points x (1) …x (h) iff For all y (1) …y (h) , f(x) can achieve zero error on training data (x (1) ,y (1) ), (x (2) ,y (2) ), … (x (h) ,y (h) ) (i.e., there exists some θ that gets zero error) • Can f(x; θ ) = sign( θ 0 + θ 1 x 1 + θ 2 x 2 ) shatter these points? (c) Alexander Ihler

Shattering • We say a classifier f(x) can shatter points x (1) …x (h) iff For all y (1) …y (h) , f(x) can achieve zero error on training data (x (1) ,y (1) ), (x (2) ,y (2) ), … (x (h) ,y (h) ) (i.e., there exists some θ that gets zero error) • Can f(x; θ ) = sign( θ 0 + θ 1 x 1 + θ 2 x 2 ) shatter these points? • Yes: there are 4 possible training sets… (c) Alexander Ihler

Shattering • We say a classifier f(x) can shatter points x (1) …x (h) iff For all y (1) …y (h) , f(x) can achieve zero error on training data (x (1) ,y (1) ), (x (2) ,y (2) ), … (x (h) ,y (h) ) (i.e., there exists some θ that gets zero error) 2 + x 2 • Can f(x; θ ) = sign(x 1 2 - θ ) shatter these points? (c) Alexander Ihler

Shattering • We say a classifier f(x) can shatter points x (1) …x (h) iff For all y (1) …y (h) , f(x) can achieve zero error on training data (x (1) ,y (1) ), (x (2) ,y (2) ), … (x (h) ,y (h) ) (i.e., there exists some θ that gets zero error) 2 + x 2 • Can f(x; θ ) = sign(x 1 2 - θ ) shatter these points? • Nope! (c) Alexander Ihler

VC Dimension • The VC dimension H is defined as The maximum number of points h that can be arranged so that f(x) can shatter them • A game: – Fix the definition of f(x; θ ) – Player 1: choose locations x (1) …x (h) – Player 2: choose target labels y (1) …y (h) – Player 1: choose value of θ – If f(x; θ ) can reproduce the target labels, P1 wins (c) Alexander Ihler

VC Dimension • The VC dimension H is defined as The maximum number of points h that can be arranged so that f(x) can shatter them • Example: what ’ s the VC dimension of the (zero-centered) 2 + x 2 2 - θ ) ? circle, f(x; θ ) = sign(x 1 (c) Alexander Ihler

VC Dimension • The VC dimension H is defined as The maximum number of points h that can be arranged so that f(x) can shatter them • Example: what ’ s the VC dimension of the (zero-centered) 2 + x 2 2 - θ ) ? circle, f(x; θ ) = sign(x 1 • VCdim = 1 : can arrange one point, cannot arrange two (previous example was general) (c) Alexander Ihler

VC Dimension • Example: what ’ s the VC dimension of the two-dimensional line, f(x; θ ) = sign( θ 1 x 1 + θ 2 x 2 + θ 0 )? (c) Alexander Ihler

VC Dimension • Example: what ’ s the VC dimension of the two-dimensional line, f(x; θ ) = sign( θ 1 x 1 + θ 2 x 2 + θ 0 )? • VC dim >= 3? Yes (c) Alexander Ihler

VC Dimension • Example: what ’ s the VC dimension of the two-dimensional line, f(x; θ ) = sign( θ 1 x 1 + θ 2 x 2 + θ 0 )? • VC dim >= 3? Yes • VC dim >= 4? (c) Alexander Ihler

VC Dimension • Example: what ’ s the VC dimension of the two-dimensional line, f(x; θ ) = sign( θ 1 x 1 + θ 2 x 2 + θ 0 )? • VC dim >= 3? Yes • VC dim >= 4? No… Any line through these points must split one pair (by crossing one of the lines) (c) Alexander Ihler

VC Dimension • Example: what ’ s the VC dimension of the two-dimensional line, f(x; θ ) = sign( θ 1 x 1 + θ 2 x 2 + θ 0 )? • VC dim >= 3? Yes Turns out: • VC dim >= 4? No… For a general , linear classifier (perceptron) Any line through these points in d dimensions with a constant term: must split one pair (by crossing one of the lines) VC dim = d+1 (c) Alexander Ihler

VC Dimension • VC dimension measures the “ power ” of the learner • Does *not* necessarily equal the # of parameters! • Number of parameters does not necessarily equal complexity – Can define a classifier with a lot of parameters but not much power (how?) – Can define a classifier with one parameter but lots of power (how?) • Lots of work to determine what the VC dimension of various learners is… (c) Alexander Ihler

Example • VC Dim >= 3? • VC Dim >= 4? (c) Alexander Ihler

Using VC dimension • Used validation / cross-validation to select complexity # Params Train Error X-Val Error f1 f2 f3 f4 f5 f6 (c) Alexander Ihler

Using VC dimension • Used validation / cross-validation to select complexity • Use VC dimension based bound on test error similarly • “ Structural Risk Minimization ” (SRM) # Params Train Error VC Term VC Test Bound f1 f2 f3 f4 f5 f6 (c) Alexander Ihler

Using VC dimension • Used validation / cross-validation to select complexity • Use VC dimension based bound on test error similarly • Other Alternatives – Probabilistic models: likelihood under model (rather than classification error) – AIC (Aikike Information Criterion) • Log-likelihood of training data - # of parameters – BIC (Bayesian Information Criterion) • Log-likelihood of training data - (# of parameters)*log(m) • Similar to VC dimension: performance + penalty • BIC conservative; SRM very conservative • Also, “ true Bayesian ” methods (take prob. learning…) (c) Alexander Ihler

Machine Learning and Data Mining VC Dimension Kalev Kask Slides - PowerPoint PPT Presentation

+ Machine Learning and Data Mining VC Dimension Kalev Kask Slides based on Andrew Moores Learners and Complexity We ve seen many versions of underfit/overfit trade-off Complexity of the learner Representational Power

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

People Centric Designs IoP - Dagstuhl October 2017 Paul Houghton Director, Wizardry and

Secure and version controlled configuration with Puppet, Hiera and GPG Jens Bruer

Ethereum in Enterprise Context Blockchain InnovationWeek Djuri Baars May 25th, 2018

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

GVPLS/LPE - Generic VPLS Solution based on LPE Framework Update version 01 Vasile Radoaca/Dinesh

LArSoft vectorization tests: status report Guilherme Lima LArSoft Coordination Meeting June 19,

VC Dimension and classification John Duchi Prof. John Duchi Outline I Setting: classification

Complete Information Flow Tracking from Gates Up Mohit Tiwari, Xun Li, Hassan M G Wassel,