Introduction to Machine Learning 2. Basic Tools Alex Smola & - PowerPoint PPT Presentation

Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701

This is not a toy dataset http://wadejohnston1962.files.wordpress.com/2012/09/datainoneminute.jpg

Linear Regression

Linear Regression • Observations x, labels y • Minimize squared distance • Linear function m X ∂ a [ . . . ] = 0 = x i ( ax i + b − y i ) f ( x ) = ax + b i =1 m 1 m X 2( ax i + b − y i ) 2 minimize X ∂ b [ . . . ] = 0 = ( ax i + b − y ) a,b i =1 i =1

Linear Regression • Optimization Problem f ( x ) = h a, x i + b = h w, ( x, 1) i m 1 X x i i � y i ) 2 minimize 2( h w, ¯ w i =1 • Solving it " m # m m X X X x > 0 = x i ( h w, ¯ ¯ x i i � y i ) ( ) x i ¯ ¯ w = y i ¯ x i i i =1 i =1 i =1 only requires a matrix inversion.

Nonlinear Regression • Linear model f ( x ) = h w, (1 , x ) i • Quadratic model w, (1 , x, x 2 ) ⌦ ↵ f ( x ) = • Cubic model w, (1 , x, x 2 , x 3 ) ⌦ ↵ f ( x ) = • Nonlinear model f ( x ) = h w, φ ( x ) i

Linear Regression • Optimization Problem f ( x ) = h a, x i + b = h w, ( x, 1) i m 1 X x i i � y i ) 2 minimize 2( h w, ¯ w i =1 • Solving it " m # m m X X X x > 0 = x i ( h w, ¯ ¯ x i i � y i ) ( ) x i ¯ ¯ w = y i ¯ x i i i =1 i =1 i =1 only requires a matrix inversion.

Nonlinear Regression • Optimization Problem f ( x ) = h w, φ ( x ) i m 1 X 2( h w, φ ( x i ) i � y i ) 2 minimize w i =1 • Solving it " m # m m X X X φ ( x i ) φ ( x i ) > 0 = φ ( x i )( h w, φ ( x i ) i � y i ) ( ) w = y i φ ( x i ) i =1 i =1 i =1 only requires a matrix inversion.

Pseudocode (degree 4) Training phi_xx = [xx.^4, xx.^3, xx.^2, xx, 1.0 + 0.0 * xx]; w = (yy' * phi_xx) / (phi_xx' * phi_xx); Testing phi_x = [x.^4, x.^3, x.^2, x, 1.0 + 0.0 * x]; y = phi_x * w';

Regression (d=1)

Regression (d=2)

Regression (d=3)

Regression (d=4)

Regression (d=5)

Regression (d=6)

Regression (d=7)

Regression (d=8)

Regression (d=9)

Nonlinear Regression warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5

Nonlinear Regression warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 Why does it fail? warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5

Model Selection • Underfitting (model is too simple to explain data) • Overfitting (model is too complicated to learn from data) • E.g. too many parameters • Insufficient confidence to estimate parameter (failed matrix inverse) • Often training error decreases nonetheless • Model selection Need to quantify model complexity vs. data • This course - algorithms, model selection, questions

Parzen Windows Parzen

Density Estimation • Observe some data x i • Want to estimate p(x) • Find unusual observations (e.g. security) • Find typical observations (e.g. prototypes) • Classifier via Bayes Rule p ( y | x ) = p ( x, y ) p ( x | y ) p ( y ) = p ( x ) P y 0 p ( x | y 0 ) p ( y 0 ) • Need tool for computing p(x) easily

Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 5 2 3 1 0 female 6 3 2 2 1

Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female not enough data • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

Curse of dimensionality (lite) • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female #bins grows • ZIP code exponentially • Day of the week • Operating system • ... • Continuous random variables • Income need many bins • Bandwidth per dimension • Time

Density Estimation 0.10 0.05 sample underlying density 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110 • Continuous domain = infinite number of bins • Curse of dimensionality • 10 bins on [0, 1] is probably good • 10 10 bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by 10 10 .

Bin Counting

Bin Counting can’t we just go and smooth this out?

Parzen Windows • Naive approach Use empirical density (delta distributions) m p emp ( x ) = 1 X δ x i ( x ) m i =1 • This breaks if we see slightly different instances • Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x’) satisfying Z k x ( x 0 ) dx 0 = 1 for all x X

Parzen Windows • Density estimate m p emp ( x ) = 1 X δ x i ( x ) m i =1 m p ( x ) = 1 X ˆ k x i ( x ) m • Smoothing kernels i =1 1.0 1.0 1.0 1.0 Gauss Epanechikov Uniform Laplace 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 3 1 1 2 x 2 (2 π ) − 1 2 e − 1 2 e − | x | 4 max(0 , 1 − x 2 ) 2 χ [ − 1 , 1] ( x )

Smoothing

Smoothing dist = norm(X - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))

Smoothing

Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 0.3 1 3 10 0.10 0.05 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110

Size matters Shape matters mostly in theory 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 ✓ x − x i ◆ • Kernel width k x i ( x ) = r − d h r • Too narrow overfits • Too wide smoothes with constant distribution • How to choose?

Model Selection

Maximum Likelihood • Need to measure how well we do • For density estimation we care about m Y Pr { X } = p ( x i ) i =1 • Finding a that maximizes P(X) will peak at all data points since x i explains x i best ... • Maxima are delta functions on data. • Overfitting!

Overfitting 0.050 Likelihood on training set is 0.025 0.025 much higher than typical. 0.000 40 60 80 100

Overfitting 0.050 density ≫ 0 Likelihood on training set is 0.025 0.025 much higher than typical. density 0 0.000 40 60 80 100

Underfitting 0.050 Likelihood on training set is very similar to 0.025 typical one. Too simple. 0.000 40 60 80 100

Introduction to Machine Learning 2. Basic Tools Alex Smola & - PowerPoint PPT Presentation

Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 This is not a toy dataset

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

I T S E LEMENTARY , D EAR W ATSON : A PPLYING L OGIC P ROGRAMMING TO C ONVERGENT S YSTEM M

TCP Attacks Chester Rebeiro IIT Madras Some of the slides borrowed from the book Computer

IBM Watson example https://www.youtube.com/watch?v=DywO4zksfXw VOLUME 56, NUMBER 3/4, MAY/JUL.

Chapter 12 Overview Devices and Output Visual Output Dynamic Visualizations Sound

Liberty Star CEO / Chief Geologist James A. Briscoe: Exploration of East & West Silverbell

USQCD Software: Status and Future Challenges Richard C. Brower All Hands Meeting @ FNAL

INTRODUCTION TO CREATIVE CODING AND GAMES Introduction to Programming Versatile programming

ICS Testbed Tetris: Practical Building Blocks Towards a Cyber Security Resource CSET 20 - Long

Introduction to Machine Learning 2. Basic Tools Alex Smola & - PowerPoint PPT Presentation

Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 This is not a toy dataset

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

I T S E LEMENTARY , D EAR W ATSON : A PPLYING L OGIC P ROGRAMMING TO C ONVERGENT S YSTEM M

TCP Attacks Chester Rebeiro IIT Madras Some of the slides borrowed from the book Computer

IBM Watson example https://www.youtube.com/watch?v=DywO4zksfXw VOLUME 56, NUMBER 3/4, MAY/JUL.

Chapter 12 Overview Devices and Output Visual Output Dynamic Visualizations Sound

Liberty Star CEO / Chief Geologist James A. Briscoe: Exploration of East &amp; West Silverbell

USQCD Software: Status and Future Challenges Richard C. Brower All Hands Meeting @ FNAL

INTRODUCTION TO CREATIVE CODING AND GAMES Introduction to Programming Versatile programming

ICS Testbed Tetris: Practical Building Blocks Towards a Cyber Security Resource CSET 20 - Long

Liberty Star CEO / Chief Geologist James A. Briscoe: Exploration of East & West Silverbell