IN5550: Neural Methods in Natural Language Processing Lecture 2 - PowerPoint PPT Presentation

Linear classifiers Simple linear function f ( x ; W , b ) = x · W + b (1) ◮ Function input: ◮ feature vector x ∈ R d in ; ◮ each training instance is represented with d in features; ◮ for example, some properties of the documents. 11

Linear classifiers Simple linear function f ( x ; W , b ) = x · W + b (1) ◮ Function input: ◮ feature vector x ∈ R d in ; ◮ each training instance is represented with d in features; ◮ for example, some properties of the documents. ◮ Function parameters θ : ◮ matrix W ∈ R d in × d out ◮ d out is the dimensionality of the desired prediction (number of classes) ◮ bias vector b ∈ R d out ◮ bias ‘shifts’ the function output to some direction. 11

Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . 12

Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . ◮ ‘Optimal’ means ‘ producing predictions ˆ y closest to the gold labels y on our n training instances ’. 12

Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . ◮ ‘Optimal’ means ‘ producing predictions ˆ y closest to the gold labels y on our n training instances ’. ◮ Ideally, ˆ y = y 12

Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): 13

Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. 13

Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. 13

Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. 13

Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best? 13

Linear classifiers How can we represent our data ( X )? 14

Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. 14

Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? 14

Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. 14

Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) 14

Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) ◮ The simplest way to represent these words as features is a Bag-of-Words representation 14

Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. 15

Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. 15

Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. ◮ This schema is called ‘bag of words’ (BoW). ◮ for example, if we have 1000 words in the vocabulary: ◮ x i ∈ R 1000 15

Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. ◮ This schema is called ‘bag of words’ (BoW). ◮ for example, if we have 1000 words in the vocabulary: ◮ x i ∈ R 1000 ◮ x i = [20 , 16 , 0 , 10 , 0 , . . . , 3] 15

Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: 16

Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . 16

Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] 16

Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] ◮ o 1 = [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0] ◮ etc... 16

Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] ◮ o 1 = [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0] ◮ etc... ◮ i = [1 , 1 , 1 , 1 , 1 , 2 , 2 , 1 , 1 , 1] (‘ the ’ and ‘ road ’ mentioned 2 times) 16

Linear classifiers Can we interpret the different parts of a learned model as representations of the data? 17

Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). 17

Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). 17

Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). ◮ Together these learned representations form a W matrix, part of θ . ◮ Thus, it contains data both about the instances and their features (more about this later). 17

Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). ◮ Together these learned representations form a W matrix, part of θ . ◮ Thus, it contains data both about the instances and their features (more about this later). ◮ Feature engineering is deciding what features of the instances we will use during the training. 17

Linear classifiers negative positive neutral great best terrible worst Segal the road 18

Linear classifiers Overview of Linear Models 19

Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): 20

Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ 20

Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. 20

Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or − 1 (‘no’). 20

Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or − 1 (‘no’). ◮ NB: the model can output any number, but we convert all negatives to − 1 and all positives to 1 ( sign function). θ = ( W ∈ R d in , b ∈ R 1 ) 20

Linear classifiers 0 1 0 0 1 1 1 0.5 sign(1.5) = 1 0 1 0 21

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification 22

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) 22

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ 22

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. 22

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. 22

Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. ◮ The component corresponding to the correct author has the value of 1, others are zeros, for example: y = [0 , 0 , 1 , 0] (for k = 4) ˆ θ = ( W ∈ R d in × d out , b ∈ R d out ) 22

Linear classifiers 0 1 0 0 1 1 1 argmax( ) 0 1 1 1 0 0 1 2 2 4 0 1 1 1 0 1 1 1 3 1 23

Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: 24

Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0 , 1]... 24

Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0 , 1]... ◮ ...by a squashing function, for example, sigmoid: 1 y = σ ( f ( x )) = ˆ (2) 1 + e − ( f ( x )) ◮ The result is the probability of the prediction! σ ( x ) 24

Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ 25

Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ ◮ We choose the one with the highest score: ˆ y = arg max y [ i ] = ˆ ˆ (3) y [2] i 25

Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ ◮ We choose the one with the highest score: ˆ y = arg max y [ i ] = ˆ ˆ (3) y [2] i ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: y = softmax ( xW + b ) ˆ (4) e ( xW + b ) [ i ] y [ i ] = ˆ (5) j e ( xW + b ) [ j ] � 25

IN5550: Neural Methods in Natural Language Processing Lecture 2 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja vrelid, Stephan Oepen, & Erik Velldal University

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

IN5550: Neural Methods in Natural Language Processing Lecture 4 Dense Representations of

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

Deep Learning-based Short Video Recommendation and Prefetching for Mobile Commuting Users Qian Li

Forecast verification 4th VALUE Training School Jonas Bhend, Sven Kotlarski Forecast verification

Linear Estimation Problem Formulation Basic ideas Goal for much of this class is to

LEARNING Outline Confusion Matrix F1 Score Gain and Lift Charts Kolmogorov Smirnov

JUST THE MATHS SLIDES NUMBER 13.2 INTEGRATION APPLICATIONS 2 (Mean values) & (Root

Probability and Statistics for Computer Science Principal Component Analysis --- Exploring

About interpolation on manifolds... How to interpolate points on curved spaces ? Light fast

Optimization and Simulation Statistical analysis and bootstrapping Michel Bierlaire Transport

IN5550: Neural Methods in Natural Language Processing Lecture 2 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja vrelid, Stephan Oepen, & Erik Velldal University

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

IN5550: Neural Methods in Natural Language Processing Lecture 4 Dense Representations of

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

Deep Learning-based Short Video Recommendation and Prefetching for Mobile Commuting Users Qian Li

Forecast verification 4th VALUE Training School Jonas Bhend, Sven Kotlarski Forecast verification

Linear Estimation Problem Formulation Basic ideas Goal for much of this class is to

LEARNING Outline Confusion Matrix F1 Score Gain and Lift Charts Kolmogorov Smirnov

JUST THE MATHS SLIDES NUMBER 13.2 INTEGRATION APPLICATIONS 2 (Mean values) &amp; (Root

Probability and Statistics for Computer Science Principal Component Analysis --- Exploring

About interpolation on manifolds... How to interpolate points on curved spaces ? Light fast

Optimization and Simulation Statistical analysis and bootstrapping Michel Bierlaire Transport

JUST THE MATHS SLIDES NUMBER 13.2 INTEGRATION APPLICATIONS 2 (Mean values) & (Root