Convolutional Neural Networks for Language CS 6956: Deep Learning - - PowerPoint PPT Presentation

convolutional neural networks for language
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks for Language CS 6956: Deep Learning - - PowerPoint PPT Presentation

Convolutional Neural Networks for Language CS 6956: Deep Learning for NLP Features from text Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

Convolutional Neural Networks for Language

slide-2
SLIDE 2

Features from text

Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features?

2

slide-3
SLIDE 3

Features from text

Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not

3

slide-4
SLIDE 4

Features from text

Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not We need to: 1. Identify informative local information 2. Aggregate it into a fixed size vector representation

4

slide-5
SLIDE 5

Convolutional Neural Networks

Designed to

  • 1. Identify local predictors in a larger input
  • 2. Pool them together to create a feature representation
  • 3. And possibly repeat this in a hierarchical fashion

In the NLP context, it helps identify predictive ngrams for a task

5

slide-6
SLIDE 6

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

6

slide-7
SLIDE 7

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

7

slide-8
SLIDE 8

Convolutional Neural Networks: Brief history

  • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to

small regions and specific patterns in the visual field

  • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel

– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information

  • Le Cun 1989-today, Convolutional Neural Network: A supervised version

– Related to convolution kernels in computer vision – Very successful on handwriting recognition and other computer vision tasks

  • Has become better over recent years with more data, computation

– Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision

8

First arose in the context of vision

slide-9
SLIDE 9

Convolutional Neural Networks: Brief history

  • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to

small regions and specific patterns in the visual field

9

First arose in the context of vision Nobel Prize in Physiology or Medicine, 1981 David H. Hubel Torsten Wiesel

slide-10
SLIDE 10

Convolutional Neural Networks: Brief history

  • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to

small regions and specific patterns in the visual field

  • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel

– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations 1. convolutional layer that reacts to specific patterns and, 2. a down-sampling layer that aggregates information

10

First arose in the context of vision

slide-11
SLIDE 11

Convolutional Neural Networks: Brief history

  • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to

small regions and specific patterns in the visual field

  • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel

– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information

  • Le Cun 1989-today, Convolutional Neural Network: A supervised version

– Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks

11

First arose in the context of vision

slide-12
SLIDE 12

Convolutional Neural Networks: Brief history

  • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to

small regions and specific patterns in the visual field

  • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel

– Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information

  • Le Cun 1989-today, Convolutional Neural Network: A supervised version

– Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks

  • Has become better over recent years with more data, computation

– Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision

12

First arose in the context of vision

slide-13
SLIDE 13

Convolutional Neural Networks: Brief history

  • Introduced to NLP by Collobert et al, 2011

– Used as a feature extraction system for semantic role labeling

  • Since then several other applications such as sentiment

analysis, question classification, etc

– Kalchbrener et al 2014, Kim 2014

13

slide-14
SLIDE 14

CNN terminology

  • Filter

– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector

  • Channel

– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels

  • For example, different kinds of word embeddings could be different channels
  • Channels could themselves be produced by previous convolutional layers
  • Receptive field

– The region of the input that a filter currently focuses on

14

Shows its computer visions and signal processing origins

slide-15
SLIDE 15

CNN terminology

  • Filter

– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)

  • Channel

– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels

  • For example, different kinds of word embeddings could be different channels
  • Channels could themselves be produced by previous convolutional layers
  • Receptive field

– The region of the input that a filter currently focuses on

15

Shows its computer visions and signal processing origins

slide-16
SLIDE 16

CNN terminology

  • Filter

– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)

  • Channel

– In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels

  • For example, different kinds of word embeddings could be different channels
  • Channels could themselves be produced by previous convolutional layers
  • Receptive field

– The region of the input that a filter currently focuses on

16

Shows its computer visions and signal processing origins

slide-17
SLIDE 17

CNN terminology

  • Filter

– A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map)

  • Channel

– In computer vision, color images have red, blue and green channels – In general, a channel represents a “view of the input” that captures information about an input independent of other channels

  • For example, different kinds of word embeddings could be different channels
  • Channels could themselves be produced by previous convolutional layers
  • Receptive field

– The region of the input that a filter currently focuses on

17

Shows its computer visions and signal processing origins

slide-18
SLIDE 18

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

18

slide-19
SLIDE 19

What is a convolution?

19

Let’s see this using an example for vectors. We will generalize this to matrices and beyond, but the general idea remains the same.

slide-20
SLIDE 20

What is a convolution?

20

An example using vectors 2 3 1 3 2 1 A vector 𝐲

slide-21
SLIDE 21

What is a convolution?

21

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors Here, the filter size is 3

slide-22
SLIDE 22

What is a convolution?

22

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜

The output is also a vector

An example using vectors

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,
slide-23
SLIDE 23

What is a convolution?

23

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜

The output is also a vector

An example using vectors

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

The filter moves across the vector. At each position, the output is the dot product of the filter with a slice of the vector of that size.

slide-24
SLIDE 24

What is a convolution?

24

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors Padding at the beginning

slide-25
SLIDE 25

What is a convolution?

25

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 The output is also a vector Padding at the beginning

slide-26
SLIDE 26

What is a convolution?

26

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 The output is also a vector

slide-27
SLIDE 27

What is a convolution?

27

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 The output is also a vector

slide-28
SLIDE 28

What is a convolution?

28

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 The output is also a vector

slide-29
SLIDE 29

What is a convolution?

29

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 The output is also a vector

slide-30
SLIDE 30

What is a convolution?

30

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector Padding at the end

slide-31
SLIDE 31

What is a convolution?

31

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector

slide-32
SLIDE 32

What is a convolution?

32

2 3 1 3 2 1 1 2 1

  • utput( = * 𝑔

, ⋅ 𝑦(/ 0 1 2,

  • ,

A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The filter moves across the vector. At each position, the output is the dot product of the filter with a slice of the vector of that size.

slide-33
SLIDE 33

What is a convolution?

33

The same idea applies to matrices as well

An input matrix A filter The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-34
SLIDE 34

What is a convolution?

34

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-35
SLIDE 35

What is a convolution?

35

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-36
SLIDE 36

What is a convolution?

36

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-37
SLIDE 37

What is a convolution?

37

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-38
SLIDE 38

What is a convolution?

38

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-39
SLIDE 39

What is a convolution?

39

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-40
SLIDE 40

What is a convolution?

40

The same idea applies to matrices as well

An input matrix A filter The result of convolution

And so on…

The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-41
SLIDE 41

What is a convolution?

41

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-42
SLIDE 42

What is a convolution?

42

The same idea applies to matrices as well

An input matrix A filter The result of convolution

And so on…

The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-43
SLIDE 43

What is a convolution?

43

The same idea applies to matrices as well

An input matrix A filter The result of convolution The filter moves across the matrix. At each position, the output is the dot product of the filter with a slice of the matrix of that size.

slide-44
SLIDE 44

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

44

slide-45
SLIDE 45

Pooling: An aggregation operation

  • A convolution produces a vector/matrix that captures properties
  • f each window
  • Pooling combines this information to produce a down-sampled

version vector/matrix

– Typically using the maximum or the average value within a window

  • Intuition

– A filter is a feature detector that discovers how well each window matches a feature of interest – The most important features should be recognized regardless of their location – Answer: Pool the information from different windows together

45

slide-46
SLIDE 46

What is pooling?

46

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 1: Max pooling with window size 3

slide-47
SLIDE 47

What is pooling?

47

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 Example 1: Max pooling with window size 3

slide-48
SLIDE 48

What is pooling?

48

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 Example 1: Max pooling with window size 3

slide-49
SLIDE 49

What is pooling?

49

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 9 Example 1: Max pooling with window size 3

slide-50
SLIDE 50

What is pooling?

50

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 9 9 9 8 Example 1: Max pooling with window size 3

slide-51
SLIDE 51

What is pooling?

51

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3

slide-52
SLIDE 52

What is pooling?

52

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8

slide-53
SLIDE 53

What is pooling?

53

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well 8 8.6 Example 2: Average pooling with window size 3

slide-54
SLIDE 54

What is pooling?

54

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8 8.6 8.3

slide-55
SLIDE 55

What is pooling?

55

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 2: Average pooling with window size 3 8 8.6 8.3 7

slide-56
SLIDE 56

What is pooling?

56

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Example 3: Max pooling with window size = length of the vector 9

slide-57
SLIDE 57

What is pooling?

57

2 3 1 3 2 1 1 2 1 A vector 𝐲 Filter 𝐠 of size 𝑜 An example using vectors 7 9 8 9 8 4 The output is also a vector The pooling operation can be applied using a window as well Important note There are no learned parameters for the pooling operation. It is a deterministic operation.

slide-58
SLIDE 58

Typical kinds of pooling

  • Max pooling

– Take the maximum value of the results of the convolution

  • Average pooling

– Uses average to pool instead of max

  • K-max pooling

– Take the top K values (for a fixed k) – Generalization of max pooling

58

slide-59
SLIDE 59

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

59

slide-60
SLIDE 60

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.

This could be extended to general tensors as well

60

slide-61
SLIDE 61

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.
  • The window size defines the receptive field

– We will refer to the window as x5

61

slide-62
SLIDE 62

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.
  • The window size defines the receptive field

– We will refer to the window as x5

  • A filter is defined by some parameters (that will be learned)

– In general, a matrix u of the same shape as a the window and a bias b

62

slide-63
SLIDE 63

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.
  • The window size defines the receptive field

– We will refer to the window as x5

  • A filter is defined by some parameters (that will be learned)

– In general, a matrix u of the same shape as a the window and a bias b

  • Convolution: Iterate over all windows and apply the filter

– Typically has a non-linearity (e.g. ReLU) 𝑞( = 𝑕(𝑣 ⋅ 𝑦( + 𝑐)

63

slide-64
SLIDE 64

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.
  • The window size defines the receptive field

– We will refer to the window as x5

  • A filter is defined by some parameters (that will be learned)

– In general, a matrix u of the same shape as a the window and a bias b

  • Convolution: Iterate over all windows and apply the filter

– Typically has a non-linearity (e.g. ReLU) 𝑞( = 𝑕(𝑣 ⋅ 𝑦( + 𝑐)

  • Pooling: Aggregate the 𝑞(’s into a down-sampled version, sometimes a single number

64

slide-65
SLIDE 65

Convolution + Pooling = one layer

  • Input: a matrix. Convolution will operate over windows of this matrix.
  • The window size defines the receptive field

– We will refer to the window as x5

  • A filter is defined by some parameters (that will be learned)

– In general, a matrix u of the same shape as a the window and a bias b

  • Convolution: Iterate over all windows and apply the filter

– Typically has a non-linearity (e.g. ReLU) 𝑞( = 𝑕(𝑣 ⋅ 𝑦( + 𝑐)

  • Pooling: Aggregate the 𝑞(’s into a down-sampled version, sometimes a single number
  • Typically, there are many filters, each of which are pooled independently

65

slide-66
SLIDE 66

Hyperparameters

  • Filter sizes: How big should the filter be?

– Typically, 3x3, 5x5, etc

  • Stride: how does the filter move along the input?

– It could skip some steps, or not.

  • How many filters should the be?
  • Padding: Should there be padding or not? If so, should the padding be zeros or

random?

  • How big should the pooling window be?
  • What kind of pooling: Average, Max, L2 norm?

66

slide-67
SLIDE 67

Example: LeNet

An example network uses these building block

67

LeNet-5 was proposed by Le Cun 1998 for handwriting recognition Had several levels of convolution-pooling

slide-68
SLIDE 68

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

68

slide-69
SLIDE 69

Convolutional Neural Networks in NLP

  • Goal: To represent a sequence of words as a feature vector
  • Approach:

– Represent the sequence of words by sequence(s) of embeddings – Convolve with several filters – Pool across the sequence to get a feature vector of a fixed dimensionality

69

slide-70
SLIDE 70

Convolutional Neural Networks in NLP

70

I ate cake today Suppose we want to classify this sentence: Goal: To represent a sequence of words as a feature vector

slide-71
SLIDE 71

Convolutional Neural Networks in NLP

71

I ate cake today Word embeddings Goal: To represent a sequence of words as a feature vector

slide-72
SLIDE 72

Convolutional Neural Networks in NLP

72

I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector

slide-73
SLIDE 73

Convolutional Neural Networks in NLP

73

I ate cake today Word embeddings padding padding Apply a filter Goal: To represent a sequence of words as a feature vector

slide-74
SLIDE 74

Convolutional Neural Networks in NLP

74

I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector

slide-75
SLIDE 75

Convolutional Neural Networks in NLP

75

I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector

slide-76
SLIDE 76

Convolutional Neural Networks in NLP

76

I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector

slide-77
SLIDE 77

Convolutional Neural Networks in NLP

77

I ate cake today Word embeddings padding padding Goal: To represent a sequence of words as a feature vector

slide-78
SLIDE 78

Convolutional Neural Networks in NLP

78

I ate cake today Word embeddings padding padding Convolution with one filter Goal: To represent a sequence of words as a feature vector

slide-79
SLIDE 79

Convolutional Neural Networks in NLP

79

I ate cake today Word embeddings padding padding Convolution with one filter Pooling across the sentence (often max pooling) to get

  • ne feature

Goal: To represent a sequence of words as a feature vector

slide-80
SLIDE 80

Convolutional Neural Networks in NLP

80

I ate cake today Word embeddings padding padding Convolution with many filters Pooling across the sentence (often max pooling) gets a feature vector

There can be several filters (sometimes called kernels, or feature maps)

Goal: To represent a sequence of words as a feature vector

slide-81
SLIDE 81

Convolution + pooling example

81

1. Each word is embedded into a 2d vector, the window concatenates them 2. A 6x3 filter with a tanh non- linearity 3. Max pooling over each dimension to produce a 3 dimensional vector

slide-82
SLIDE 82

Examples of convolution + pooling

82

Figure from Goldberg 2017 Think of convolutions as feature extractors A narrow convolution (i.e. without any padding) in the vector concatenation notation A wide convolution (i.e. with padding) in the vector stacking notation

slide-83
SLIDE 83

Overview

  • Convolutional Neural Networks: A brief history
  • The two operations in a CNN

– Convolution – Pooling

  • Convolution + Pooling as a building block
  • CNNs in NLP
  • Recurrent networks vs Convolutional networks

83

slide-84
SLIDE 84

Features from text

  • If we want to classify text, we need to represent them in

some feature space

  • We have (at least) two ways to get features from text using

a neural network:

– Recurrent Neural Networks – Convolutional Neural Networks

84

slide-85
SLIDE 85

RNNs vs CNNs

  • RNNs model non-Markovian dependencies

– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows

85

slide-86
SLIDE 86

RNNs vs CNNs

  • RNNs model non-Markovian dependencies

– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows

  • CNNs capture informative ngrams

– Also gappy n-grams – But also account for local ordering patterns

86

slide-87
SLIDE 87

RNNs vs CNNs

  • RNNs model non-Markovian dependencies

– Can look at (effectively) infinite windows around a target word – Can capture sequential patterns in such windows

  • CNNs capture informative ngrams

– Also gappy n-grams – But also account for local ordering patterns

  • How do they compare?

– Both are trained end-to-end with a task loss – RNNs (specifically, BiRNNs) are more popular today…

  • … but this can change

– CNNs allow for more parallelism, and so may be better suited for certain hardware/software improvements

87

slide-88
SLIDE 88

RNNs and CNNs as building blocks

Think of them as Lego bricks for constructing larger architectures Both are computation graphs

Mix and match with other computation graphs to create larger neural networks General tools that can be used with other ideas that we have seen and will see Eg: contextual embeddings, attention, etc.

88