Machine Learning A Geometric Approach Linear Classification: - PowerPoint PPT Presentation

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU)

Perceptron Frank Rosenblatt

deep learning multilayer perceptron perceptron linear regression SVM CRF structured perceptron

Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students

Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations

Neurons x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output σ ( ) X f ( x ) = w i x i = h w, x i i

Frank Rosenblatt’s Perceptron

Multilayer Perceptron (Neural Net)

Perceptron w/ bias x 3 x n x 1 x 2 . . . • Weighted linear combination w n w 1 synaptic • Nonlinear weights decision function • Linear offset (bias) output f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w and b

Perceptron w/o bias x 0 = 1 x 3 x n x 1 x 2 . . . • Weighted linear combination w 0 w n w 1 synaptic • Nonlinear weights decision function • No Linear offset (bias): output hyperplane through the origin f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w

Augmented Space can separate in 3D from the origin can separate in 2D from the origin 1 O can’t separate in 1D can’t separate in 2D from the origin from the origin

Perceptron Ham Spam

The Perceptron w/o bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I

The Perceptron w/ bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I

Demo x i w (bias=0)

Convergence Theorem • If there exists some oracle unit vector u : k u k = 1 y i ( u · x i ) ≥ δ for all i then the perceptron converges to a linear separator after a number of updates bounded by R 2 / δ 2 where R = max i k x i k • Dimensionality independent • Order independent (but order matters in output) • Dataset size independent • Scales with ‘difficulty’ of problem

Geometry of the Proof • part 1: progress (alignment) on oracle projection assume w i is the weight vector before the i th update (on h x i , y i i ) and assume initial w 0 = 0 w i +1 = w i + y i x i u · w i +1 = u · w i + y i ( u · x i ) y i ( u · x i ) ≥ δ for all i u · w i +1 ≥ u · w i + δ δ δ u · w i +1 ≥ i δ x i ⊕ projection on u increases! w i +1 ⊕ (more agreement w/ oracle) u : k u k = 1 k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ w i ⊕

Geometry of the Proof • part 2: bound the norm of the weight vector w i +1 = w i + y i x i k w i +1 k 2 = k w i + y i x i k 2 = k w i k 2 + k x i k 2 + 2 y i ( w i x i ) “mistake on x_i”  k w i k 2 + R 2 (radius)  iR 2 δ δ x i ⊕ Combine with part 1 w i +1 ⊕ k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ u : k u k = 1 i ≤ R 2 / δ 2 w i ⊕

Convergence Bound R 2 / δ 2 w • is independent of: • but test accuracy is • dimensionality dependent of: • order of examples • number of examples (shuffling helps) • starting weight vector • variable learning rate • order of examples (1/total#error helps) • constant learning rate • can you still prove • and is dependent of: convergence? • separation difficulty • feature scale

Hardness margin vs. size hard easy

XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students

Extensions of Perceptron • Problems with Perceptron • doesn’t converge with inseparable data • update might often be too “bold” • doesn’t optimize margin • is sensitive to the order of examples • Ways to alleviate these problems • voted perceptron and average perceptron • MIRA (margin-infused relaxation algorithm)

Voted/Avged Perceptron • motivation: updates on later examples taking over! • voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • (not just after each update) • and vote on a new example using |D| models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently

Voted Perceptron

Voted/Avged Perceptron (low dim - less separable) test error

Voted/Avged Perceptron (high dim - more separable) test error

Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron w 0 = 0 initialize w = 0 and b = 0 repeat c ← c + 1 if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i after each example, not each update w 0 ← w 0 + w end if until all classified correctly w 0 /c output 32

Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • very clever trick from Daume (2006, PhD thesis) ∆ w t w t initialize w = 0 and b = 0 w a = 0 repeat c ← c + 1 w (0) = if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i w (1) = ∆ w (1) w a ← w a + cy i x i end if w (2) = ∆ w (1) ∆ w (2) until all classified correctly output w − w a /c w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 33

MIRA • perceptron often makes too bold updates • but hard to tune learning rate • the smallest update to correct the mistake? w i +1 = w i + y i � w i · x i x i k x i k 2 easy to show: y i ( w i +1 · x i ) = y i ( w i + y i � w i · x i x i ⊕ x i ) · x i = 1 k x i k 2 perceptron � w i · x i k x i k k x i k 1 margin-infused relaxation MIRA algorithm (MIRA) 1 � w i · x i perceptron over- k x i k w i corrects this mistake

Perceptron x i perceptron perceptron under- corrects this mistake w (bias=0)

MIRA w 0 k w 0 � w k 2 min x i MIRA makes sure s.t. w 0 · x � 1 after update, dot- MIRA minimal change product w ∙ x_i = 1 to ensure margin margin of 1/|x_i| perceptron MIRA ≈ 1-step SVM perceptron under- corrects this mistake w (bias=0)

Aggressive MIRA • aggressive version of MIRA • also update if correct but margin isn’t big enough w • functional margin: y i ( w · x i ) • geometric margin: y i ( w · x i ) k w k • update if functional margin is <= p (0<= p <1) • update rule is same as MIRA • called p- aggressive MIRA (MIRA: p =0) • larger p leads to a larger geometric margin • but slower convergence

Aggressive MIRA p=0.9 p=0.2 n o p t r e e r c p

Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive p=0.9 p=0.2 n o p t r e e r c p

Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive • why does this dataset so slow to converge? • perceptron: 22, p=0.2: 87, p=0.9: 2,518 epochs answer: margin shrinks in augmented space! small margin in 2D 1 O big margin in 1D

Machine Learning A Geometric Approach Linear Classification: - PowerPoint PPT Presentation

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron perceptron linear regression SVM CRF

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Real World Subtraction MP4 Model with mathematics. with Manipulatives MP5 Use appropriate tools

First-order theories Gabriele Puppis LaBRI / CNRS Definition Fix a class C of structures (e.g.

Information Ordering Ling573 Systems & Applications May 2, 2017 Roadmap Information

On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

s rs rts

Preliminary Results on n e / n t selection at DUNE FD. CP violation & n t physics perspectives.

Convergence of ensemble Kalman filters in the large ensemble limit and infinite dimension Jan

Machine Learning A Geometric Approach Linear Classification: - PowerPoint PPT Presentation

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron perceptron linear regression SVM CRF

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Real World Subtraction MP4 Model with mathematics. with Manipulatives MP5 Use appropriate tools

First-order theories Gabriele Puppis LaBRI / CNRS Definition Fix a class C of structures (e.g.

Information Ordering Ling573 Systems &amp; Applications May 2, 2017 Roadmap Information

On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

s rs rts

Preliminary Results on n e / n t selection at DUNE FD. CP violation &amp; n t physics perspectives.

Convergence of ensemble Kalman filters in the large ensemble limit and infinite dimension Jan

Information Ordering Ling573 Systems & Applications May 2, 2017 Roadmap Information

Preliminary Results on n e / n t selection at DUNE FD. CP violation & n t physics perspectives.