Convolutional Neural Networks for Language CS 6956: Deep Learning - - PowerPoint PPT Presentation
Convolutional Neural Networks for Language CS 6956: Deep Learning - - PowerPoint PPT Presentation
Convolutional Neural Networks for Language CS 6956: Deep Learning for NLP Features from text Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly
Features from text
Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features?
2
Features from text
Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not
3
Features from text
Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not We need to: 1. Identify informative local information 2. Aggregate it into a fixed size vector representation
4
Convolutional Neural Networks
Designed to
- 1. Identify local predictors in a larger input
- 2. Pool them together to create a feature representation
- 3. And possibly repeat this in a hierarchical fashion
In the NLP context, it helps identify predictive ngrams for a task
5
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
6
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
7
Convolutional Neural Networks: Brief history
- Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to
small regions and specific patterns in the visual field
- Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel
– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information
- Le Cun 1989-today, Convolutional Neural Network: A supervised version
– Related to convolution kernels in computer vision – Very successful on handwriting recognition and other computer vision tasks
- Has become better over recent years with more data, computation
– Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision
8
First arose in the context of vision
Convolutional Neural Networks: Brief history
- Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to
small regions and specific patterns in the visual field
9
First arose in the context of vision Nobel Prize in Physiology or Medicine, 1981 David H. Hubel Torsten Wiesel
Convolutional Neural Networks: Brief history
- Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to
small regions and specific patterns in the visual field
- Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel
– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations 1. convolutional layer that reacts to specific patterns and, 2. a down-sampling layer that aggregates information
10
First arose in the context of vision
Convolutional Neural Networks: Brief history
- Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to
small regions and specific patterns in the visual field
- Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel
– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information
- Le Cun 1989-today, Convolutional Neural Network: A supervised version
– Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks
11
First arose in the context of vision
Convolutional Neural Networks: Brief history
- Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to
small regions and specific patterns in the visual field
- Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel
– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information
- Le Cun 1989-today, Convolutional Neural Network: A supervised version
– Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks
- Has become better over recent years with more data, computation
– Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision
12
First arose in the context of vision
Convolutional Neural Networks: Brief history
- Introduced to NLP by Collobert et al, 2011
– Used as a feature extraction system for semantic role labeling
- Since then several other applications such as sentiment
analysis, question classification, etc
– Kalchbrener et al 2014, Kim 2014
13
CNN terminology
- Filter
– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector
- Channel
– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels
- For example, different kinds of word embeddings could be different channels
- Channels could themselves be produced by previous convolutional layers
- Receptive field
– The region of the input that a filter currently focuses on
14
Shows its computer visions and signal processing origins
CNN terminology
- Filter
– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)
- Channel
– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels
- For example, different kinds of word embeddings could be different channels
- Channels could themselves be produced by previous convolutional layers
- Receptive field
– The region of the input that a filter currently focuses on
15
Shows its computer visions and signal processing origins
CNN terminology
- Filter
– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)
- Channel
– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels
- For example, different kinds of word embeddings could be different channels
- Channels could themselves be produced by previous convolutional layers
- Receptive field
– The region of the input that a filter currently focuses on
16
Shows its computer visions and signal processing origins
CNN terminology
- Filter
– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)
- Channel
– In computer vision, color images have red, blue and green channels – In general, a channel represents a “view of the input” that captures information about an input independent of other channels
- For example, different kinds of word embeddings could be different channels
- Channels could themselves be produced by previous convolutional layers
- Receptive field
– The region of the input that a filter currently focuses on
17
Shows its computer visions and signal processing origins
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
18
What is a convolution?
19
Let’s see this using an example for vectors. We will generalize this to matrices and beyond, but the general idea remains the same.
What is a convolution?
20
An example using vectors 2 3 1 3 2 1 A vector 𝐲
What is a convolution?
21
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors Here, the filter size is 3
What is a convolution?
22
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜
The output is also a vector
An example using vectors
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
What is a convolution?
23
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜
The output is also a vector
An example using vectors
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
The filter moves across the vector. At each position, the output is the dot product of the filter with a slice of the vector of that size.
What is a convolution?
24
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors Padding at the beginning
What is a convolution?
25
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 The output is also a vector Padding at the beginning
What is a convolution?
26
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 The output is also a vector
What is a convolution?
27
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 The output is also a vector
What is a convolution?
28
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 The output is also a vector
What is a convolution?
29
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 The output is also a vector
What is a convolution?
30
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector Padding at the end
What is a convolution?
31
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector
What is a convolution?
32
2 3 1 3 2 1 1 2 1
- utput( = * 𝑔
, ⋅ 𝑦(/ 0 1 2,
- ,
A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The filter moves across the vector. At each position, the output is the dot product of the filter with a slice of the vector of that size.
What is a convolution?
33
The same idea applies to matrices as well
An input matrix A filter The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
34
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
35
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
36
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
37
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
38
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
39
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
40
The same idea applies to matrices as well
An input matrix A filter The result of convolution
And so on…
The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
41
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
42
The same idea applies to matrices as well
An input matrix A filter The result of convolution
And so on…
The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
What is a convolution?
43
The same idea applies to matrices as well
An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
44
Pooling: An aggregation operation
- A convolution produces a vector/matrix that captures properties
- f each window
- Pooling combines this information to produce a down-sampled
version vector/matrix
– Typically using the maximum or the average value within a window
- Intuition
– A filter is a feature detector that discovers how well each window matches a feature of interest – The most important features should be recognized regardless of their location – Answer: Pool the information from different windows together
45
What is pooling?
46
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 1: Max pooling with window size 3
What is pooling?
47
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 Example 1: Max pooling with window size 3
What is pooling?
48
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 Example 1: Max pooling with window size 3
What is pooling?
49
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 9 Example 1: Max pooling with window size 3
What is pooling?
50
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 9 8 Example 1: Max pooling with window size 3
What is pooling?
51
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3
What is pooling?
52
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8
What is pooling?
53
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 8 8.6 Example 2: Average pooling with window size 3
What is pooling?
54
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8 8.6 8.3
What is pooling?
55
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8 8.6 8.3 7
What is pooling?
56
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 3: Max pooling with window size = length of the vector 9
What is pooling?
57
2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Important note There are no learned parameters for the pooling operation. It is a deterministic operation.
Typical kinds of pooling
- Max pooling
– Take the maximum value of the results of the convolution
- Average pooling
– Uses average to pool instead of max
- K-max pooling
– Take the top K values (for a fixed k) – Generalization of max pooling
58
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
59
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
This could be extended to general tensors as well
60
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
- The window size defines the receptive field
– We will refer to the window as x5
61
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
- The window size defines the receptive field
– We will refer to the window as x5
- A filter is defined by some parameters (that will be learned)
– In general, a matrix u of the same shape as a the window and a bias b
62
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
- The window size defines the receptive field
– We will refer to the window as x5
- A filter is defined by some parameters (that will be learned)
– In general, a matrix u of the same shape as a the window and a bias b
- Convolution: Iterate over all windows and apply the filter
– Typically has a non-linearity (e.g. ReLU) 𝑞( = (𝑣 ⋅ 𝑦( + 𝑐)
63
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
- The window size defines the receptive field
– We will refer to the window as x5
- A filter is defined by some parameters (that will be learned)
– In general, a matrix u of the same shape as a the window and a bias b
- Convolution: Iterate over all windows and apply the filter
– Typically has a non-linearity (e.g. ReLU) 𝑞( = (𝑣 ⋅ 𝑦( + 𝑐)
- Pooling: Aggregate the 𝑞(’s into a down-sampled version, sometimes a single number
64
Convolution + Pooling = one layer
- Input: a matrix. Convolution will operate over windows of this matrix.
- The window size defines the receptive field
– We will refer to the window as x5
- A filter is defined by some parameters (that will be learned)
– In general, a matrix u of the same shape as a the window and a bias b
- Convolution: Iterate over all windows and apply the filter
– Typically has a non-linearity (e.g. ReLU) 𝑞( = (𝑣 ⋅ 𝑦( + 𝑐)
- Pooling: Aggregate the 𝑞(’s into a down-sampled version, sometimes a single number
- Typically, there are many filters, each of which are pooled independently
65
Hyperparameters
- Filter sizes: How big should the filter be?
– Typically, 3x3, 5x5, etc
- Stride: how does the filter move along the input?
– It could skip some steps, or not.
- How many filters should the be?
- Padding: Should there be padding or not? If so, should the padding be zeros or
random?
- How big should the pooling window be?
- What kind of pooling: Average, Max, L2 norm?
66
Example: LeNet
An example network uses these building block
67
LeNet-5 was proposed by Le Cun 1998 for handwriting recognition Had several levels of convolution-pooling
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
68
Convolutional Neural Networks in NLP
- Goal: To represent a sequence of words as a feature vector
- Approach:
– Represent the sequence of words by sequence(s) of embeddings – Convolve with several filters – Pool across the sequence to get a feature vector of a fixed dimensionality
69
Convolutional Neural Networks in NLP
70
I ate cake today Suppose we want to classify this sentence: Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
71
I ate cake today Word embeddings Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
72
I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
73
I ate cake today Word embeddings padding padding Apply a filter Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
74
I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
75
I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
76
I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
77
I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
78
I ate cake today Word embeddings padding padding Convolution with one filter Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
79
I ate cake today Word embeddings padding padding Convolution with one filter Pooling across the sentence (often max pooling) to get
- ne feature
Goal: To represent a sequence of words as a feature vector
Convolutional Neural Networks in NLP
80
I ate cake today Word embeddings padding padding Convolution with many filters Pooling across the sentence (often max pooling) gets a feature vector
There can be several filters (sometimes called kernels, or feature maps)
Goal: To represent a sequence of words as a feature vector
Convolution + pooling example
81
1. Each word is embedded into a 2d vector, the window concatenates them 2. A 6x3 filter with a tanh non- linearity 3. Max pooling over each dimension to produce a 3 dimensional vector
Examples of convolution + pooling
82
Figure from Goldberg 2017 Think of convolutions as feature extractors A narrow convolution (i.e. without any padding) in the vector concatenation notation A wide convolution (i.e. with padding) in the vector stacking notation
Overview
- Convolutional Neural Networks: A brief history
- The two operations in a CNN
– Convolution – Pooling
- Convolution + Pooling as a building block
- CNNs in NLP
- Recurrent networks vs Convolutional networks
83
Features from text
- If we want to classify text, we need to represent them in
some feature space
- We have (at least) two ways to get features from text using
a neural network:
– Recurrent Neural Networks – Convolutional Neural Networks
84
RNNs vs CNNs
- RNNs model non-Markovian dependencies
– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows
85
RNNs vs CNNs
- RNNs model non-Markovian dependencies
– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows
- CNNs capture informative ngrams
– Also gappy n-grams – But also account for local ordering patterns
86
RNNs vs CNNs
- RNNs model non-Markovian dependencies
– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows
- CNNs capture informative ngrams
– Also gappy n-grams – But also account for local ordering patterns
- How do they compare?
– Both are trained end-to-end with a task loss – RNNs (specifically, BiRNNs) are more popular today…
- … but this can change
– CNNs allow for more parallelism, and so may be better suited for certain hardware/software improvements
87
RNNs and CNNs as building blocks
Think of them as Lego bricks for constructing larger architectures Both are computation graphs
Mix and match with other computation graphs to create larger neural networks General tools that can be used with other ideas that we have seen and will see Eg: contextual embeddings, attention, etc.
88