Machine Learning for NLP Support Vector Machines Aurlie Herbelot - PowerPoint PPT Presentation

Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1

Support Vector Machines: introduction 2

Support Vector Machines (SVMs) • SVMs are supervised algorithms for binary classification tasks. • They are derived from ‘statistical learning theory’. • They are founded on mathematical insights which tell us why the classifier works in practice. 3

Statistical Learning Theory • SLT is a statistical theory of learning (Vapnik 1998). • The main assumption is that there is a certain probability distribution in the training data, which will be found in the test data (the phenomenon is stationary). • The no free lunch theorem: if we don’t make any assumption about how the future is related to the past, we can’t learn. • Different algorithms can be formalised for different types of data distributions. 4

Statistical Learning Theory and SVMs • In the real world, the complexity of the data usually requires more complex models (such as neural nets) which lose interpretability • SVMs give the best of both worlds. They can be analysed mathematically but they also encapsulate several types of more complex algorithms: • polynomial classifiers; • radial basis functions (RBFs); • some neural networks. 5

SVMs: intuition • SVMs let us define a linear ‘no man’s land’ between two classes. • The no man’s land is defined by a separating hyperplane, and its distance to the closest points in space. • The wider the no man’s land, the better. 6

SVMs: intuition Ben-Hur & Weston. ‘A user’s guide to Support Vector Machines’. 7

What are support vectors? • Support vectors are points in the data that lie closest to the classification hyperplane. • Intuitively, they are the points that will be most difficult to classify. 8

The margin • The margin is the no man’s land: the area around the separating hyperplane without points in it. • The bigger the margin is, the better the classification will be (less chance of confusion). • The optimal classification hyperplane is the one with the biggest margin. How will we find it? 9

Finding the separating hyperplane 10

Hyperplanes as dot products • A hyperplane can be expressed in terms of a dot product w .� � x + b = 0. • E.g., let’s take a simple hyperplane in terms of a line: y = − 2 x + 3 • This is also expressible in terms of a dot product: w T � w .� � x = � x = 3 � � � � 2 x � � where � and � w = x = x w T � (because � x = ( 2 1 ) = 2 x + y , right?) 1 y y w T � • In other words, � x − 3 = 0. 11

Hyperplanes as dot products • The ‘normal’ vector � w is perpendicular to the hyperplane. w T � • Points ‘on the right’ of the line give � x − 3 > 0. w T � • Points ‘on the left’ of the line give � x − 3 < 0. 12

Distance of points to hyperplane • The distance of a point to the separating hyperplane is given by its projection onto the hyperplane. • This distance can be expressed in terms of the vector � w (which is normal https://www.svm-tutorial.com/ to the hyperplane). 13

Distance of points to hyperplane � p is λ� w . Its length || p || is the distance of A to the hyperplane. 14

Distance of points to hyperplane • The entire margin is twice the distance of the hyperplane to the nearest point(s). • So margin = 2 || p || , with || p || the length of our ‘projection vector’. • But so far we’ve only considered the distance of a single point to the hyperplane. • By setting margin = 2 || p || for a point in one class, we run the risk of either having points of the other class within the margin, or simply not having an optimal hyperplane. 15

The optimal hyperplane • The optimal hyperplane is in the middle of two hyperplanes H 1 and H 2 passing through two points of two different classes. • The optimal hyperplane is the one that maximises the margin (the distance between H 1 and H 2 ). • So we need to • find H 1 and H 2 so that they linearly separate the data and • the distance between H 1 and H 2 is maximal. 16

SVMs: intuition • The two lines around the thick black line are H 1 and H 2 . Ben-Hur & Weston. ‘A user’s guide to Support Vector Machines’. 17

Defining the hyperplanes • Let H 0 be the optimal hyperplane separating the data, with equation: H 0 : � w .� x + b = 0 • Let H 1 and H 2 be two hyperplanes with H 0 equidistant from H 1 and H 2 : H 1 : � w .� x + b = δ H 2 : � w .� x + b = − δ • For now, those hyperplanes could be anywhere in the space. 18

Defining the hyperplanes • H 1 and H 2 should actually separate the data into classes + 1 and − 1. • We are looking for hyperplanes satisfying the following constraints: H 1 : � w .� x i + b ≥ 1 for x i ∈ + 1 H 2 : � w .� x i + b ≤ − 1 for x i ∈ − 1 • Those conditions mean that there won’t be any points within the margin. • They can be combined into one condition: y i ( � w .� x i + b ) ≥ 1 where y i is the class ( + 1 or − 1) for point x i . because if x i ∈ − 1, then y i (the output) is − 1, and � w .� x i + b ≤ − 1 multiplied by y i is − 1 ( � w .� x i + b ) ≥ 1 19

Defining the hyperplanes https://www.svm-tutorial.com/ 20

Maximising the margin • It can be shown 1 that the margin m between H 1 and H 2 can be computed with 2 || w || • This means that maximising the margin will mean minimising the norm || w || . 1See proof at https://www.svm-tutorial.com/2015/06/svm-understanding-math-part-3/. 21

Solving the optimisation problem • Finding the optimal hyperplane thus involves solving the following optimisation problem: • minimise || w || • subject to y i ( � w .� x i + b ) ≥ 1 • The optimisation computation is complex. But it has a solution � w = � θ s � x s in terms of a set of parameters θ s and s a subset of the data x s lying on the margin (the support vectors). 22

Solving the optimisation problem • We wanted to satisfy the constraint y i ( � w .� x i + b ) ≥ 1. • We now know that � w = � θ s � x s is a solution which also s minimises || w || . • So we can plug in our solution in the constraint equation: � � θ s � x s .� θ s ( � x i .� y i ( x i + b ) ≥ 1 ⇐ ⇒ y i ( x s ) + b ) ≥ 1 s s 23

H 0 , H 1 , H 2 • So we have now found H 1 and H 2 : � θ s ( � x i .� H 1 : x s ) + b = 1 s � θ s ( � x i .� H 2 : x s ) + b = − 1 s • H 0 is in the middle of H 1 and H 2 so that: � θ s ( � x i .� H 0 : x s ) + b = 0 s 24

The final decision function • The final decision function, expressed in terms of the parameters θ s and the support vectors x s , can be written as: � f ( � θ s ( � x .� x ) = sign ( x s ) + b ) s • Now, whenever we encounter a new point, we can put it through f ( � x ) to find out its class. • The most important thing about this function is that it depends only on dot products between points and support vectors. 25

Maximal vs soft margin classifier • A Soft Margin Classifier allows us to accept some misclassifications when using a SVM. • Imagine a case where the data is nearly linearly separable but not quite... • We would still like the classifier to find a separating function, even if some points get misclassified. 26

The trade-off between margin size and error • Generally, there is a trade-off between minimising the number of points falling ‘in the wrong class’ and maximising the margin. 27

The hinge loss function • The hinge loss function � 0 , 1 − y i ( � w · � � max x i − b ) • Remember that y i ( � w · � x i − b ) is the constraint on our hyperplanes. • We want y i ( � w · � x i − b ) ≥ 1 for proper classification. 28

The hinge loss function • If x i lies on the correct side of the hyperplane ( y i ( � w · � x i − b ) ≥ 1), the hinge loss function returns 0: Example: max ( 0 , 1 − 1 . 2 ) = 0 • If x i is on the incorrect side of the hyperplane ( y i ( � w · � x i − b ) < 1), the loss is proportional to the distance of the point to the margin: Examples: max ( 0 , 1 − 0 . 8 ) = 0 . 2 ( margin violation ) max ( 0 , 1 − ( − 1 . 2 )) = 2 . 2 ( misclassification ) 29

Revised optimisation problem • Taking into account the hinge function, our problem has become one where we must solve � n � 1 � w � 2 � � min max 0 , 1 − y i ( � w · � x i − b ) + λ � � n i = 1 where λ regulates how many classification errors are acceptable. 1 • Traditionally, SVM classifiers use a parameter C = 2 λ n 1 instead of λ . Multiplying the function above by 2 λ , we get: � n � + 1 � w � 2 � � min C max 0 , 1 − y i ( � w · � x i − b ) 2 � � i = 1 30

Kernels 31

The kernel trick • Sometimes, data is not linearly separable in the original space, but it would be if we transformed the datapoints. • Let’s take a simple example. We have the following datapoints: ( − 1 , 3 ) 1 ( − 2 , 2 ) − 1 ( 0 . 5 , 1 ) ( 0 , − 1 ) − 1 1 ( 1 , 4 ) 1 ( 1 , 1 ) − 1 • Note that all points of class 1 are ‘inside’ a parabola defined by: y = 2 x 2 while the − 1 points are ‘around’ the parabola. The points are not linearly separable. 32

The kernel trick 33

Machine Learning for NLP Support Vector Machines Aurlie Herbelot - PowerPoint PPT Presentation

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs are supervised algorithms for

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Support Vector Machines (I): Overview and Linear SVM LING 572 Advanced Statistical Techniques

IAML: Support Vector Machines I Nigel Goddard School of Informatics Semester 1 1 / 18 Outline

L15:Microarray analysis (Classification) November 09 Bafna Silly Quiz Social networking

Natural Language Processing and Information Retrieval Support Vector Machines Alessandro

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017

Hypercube locality-sensitive hashing for approximate near neighbors Thijs Laarhoven

Bisectors and foliations in the complex hyperbolic space Maciej Czarnecki Uniwersytet L

Introduction to Support Vector Machines Starting from slides drawn by Ming-Hsuan Yang and Antoine