SLIDE 1 CS11-747 Neural Networks for NLP
Convolutional Networks for Text
Pengfei Liu
Site https://phontron.com/class/nn4nlp2020/
With some slides by Graham Neubig
SLIDE 2 Outline
- 1. Feature Combinations
- 2. CNNs and Key Concepts
- 3. Case Study on Sentiment Classification
- 4. CNN Variants and Applications
- 5. Structured CNNs
- 6. Summary
SLIDE 3 An Example Prediction Problem: Sentiment Classification
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
? ?
SLIDE 4 An Example Prediction Problem: Sentiment Classification
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
SLIDE 5 An Example Prediction Problem: Sentiment Classification
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
how does our machine to do this task?
SLIDE 6 Continuous Bag of Words (CBOW)
I hate this movie + bias = scores + + +
lookup lookup lookup lookup
W
=
methods
continuous vectors
SLIDE 7 Continuous Bag of Words (CBOW)
I hate this movie + bias = scores + + +
lookup lookup lookup lookup
W
=
methods
continuous vectors
SLIDE 8 Deep CBOW
I hate this movie + bias = scores
W
+ + + =
tanh( W1*h + b1) tanh( W2*h + b2)
transformations followed by activation functions (Multilayer Perceptron, MLP)
SLIDE 9 What’s the Use of the “Deep”
- Multiple MLP layers allow us easily to learn feature
combinations (a node in the second layer might be “feature 1 AND feature 5 are active”)
- e.g. capture things such as “not” AND “hate”
- BUT! Cannot handle “not hate”
SLIDE 10
Handling Combinations
SLIDE 11 Bag of n-grams
I hate this movie bias sum( ) = scores
softmax
probs
- A contiguous sequence of words
- Concatenate word vectors
SLIDE 12 Why Bag of n-grams?
combination features in a simple way “don’t love”, “not the best”
works pretty well
SLIDE 13 What Problems w/ Bag of n-grams?
- Same as before: parameter explosion
- No sharing between similar words/n-grams
- Lose the global sequence order
SLIDE 14 What Problems w/ Bag of n-grams?
- Same as before: parameter explosion
- No sharing between similar words/n-grams
- Lose the global sequence order
Other solutions?
SLIDE 15
Neural Sequence Models
SLIDE 16 Neural Sequence Models
Most of NLP tasks Sequence representation learning problem
SLIDE 17 Neural Sequence Models
char: i-m-p-o-s-s-i-b-l-e word: I-love-this-movie
SLIDE 18 Neural Sequence Models
CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs
SLIDE 19 Neural Sequence Models
CBOW Bag of n-grams
CNNs
RNNs Transformer GraphNNs
SLIDE 20
Convolutional Neural Networks
SLIDE 21 Convolution -- > mathematical operation
Definition of Convolution
SLIDE 22 Convolution -- > mathematical operation
Definition of Convolution
SLIDE 23 Intuitive Understanding
Input: feature vector Filter: learnable param. Output: hidden vector
SLIDE 24
Priori Entailed by CNNs
SLIDE 25 Priori Entailed by CNNs
Local bias:
Different words could interact with their neighbors
SLIDE 26 Priori Entailed by CNNs
Different words could interact with their neighbors
Local bias:
SLIDE 27 Priori Entailed by CNNs
Parameter sharing:
The parameters of composition function are the same.
SLIDE 28
Basics of CNNs
SLIDE 29 Concept: 2d Convolution
- Deal with 2-dimension signal, i.e., image
SLIDE 30
Concept: 2d Convolution
SLIDE 31
Concept: 2d Convolution
SLIDE 32 Concept: Stride
Stride: the number of units shifts over the input matrix.
SLIDE 33 Concept: Stride
Stride: the number of units shifts over the input matrix.
SLIDE 34 Concept: Stride
Stride: the number of units shifts over the input matrix.
SLIDE 35 Concept: Padding
Padding: dealing with the units at the boundary
SLIDE 36 Concept: Padding
Padding: dealing with the units at the boundary
SLIDE 37 Three Types of Convolutions
Narrow
m=7 n=3 m-n+1=5
SLIDE 38 Three Types of Convolutions
Narrow Equal
m=7 n=3 m-n+1=5 m=7 n=3 m-n+1=5
SLIDE 39 Three Types of Convolutions
Narrow
m=7 n=3 m-n+1=5
SLIDE 40 Three Types of Convolutions
Narrow Equal
m=7 n=3 m-n+1=5 m=7 n=3 m
SLIDE 41 Three Types of Convolutions
Narrow Equal
m=7 n=3 m-n+1=5 m=7 n=3 m
Wide
m=7 n=3 m+n-1=9
SLIDE 42 Concept: Multiple Filters
Motivation: each filter represents a unique feature of the convolution window.
SLIDE 43 Concept: Pooling
- Pooling is an aggregation operation, aiming to select
informative features
SLIDE 44 Concept: Pooling
- Pooling is an aggregation operation, aiming to select
informative features
- Max pooling: “Did you see this feature anywhere in the
range?” (most common)
- Average pooling: “How prevalent is this feature over the
entire range”
- k-Max pooling: “Did you see this feature up to k times?”
- Dynamic pooling: “Did you see this feature in the
beginning? In the middle? In the end?”
SLIDE 45 Max pooling:
Concept: Pooling
SLIDE 46 Max pooling: Mean pooling:
Concept: Pooling
SLIDE 47 Max pooling: Mean pooling: K-max pooling
Concept: Pooling
SLIDE 48 Max pooling: Mean pooling: K-max pooling
Concept: Pooling
Dynamic pooling:
SLIDE 49
Case Study: Convolutional Networks for Text Classification (Kim 2015)
SLIDE 50 CNNs for Text Classification (Kim 2015)
- Task: sentiment classification
- Input: a sentence
- Output: a class label (positive/negative)
SLIDE 51 CNNs for Text Classification (Kim 2015)
- Task: sentiment classification
- Input: a sentence
- Output: a class label (positive/negative)
- Model:
- Embedding layer
- Multi-Channel CNN layer
- Pooling layer/Output layer
SLIDE 52 Overview of the Architecture
Filter Input CNN Pooling Output Dict
SLIDE 53 Embedding Layer
Input Look-up Table
- Build a look-up table (pre-
trained? Fine-tuned?)
SLIDE 56
- Conv. Layer
- Stride size?
- 1
SLIDE 57
- Conv. Layer
- Wide, equal, narrow?
SLIDE 58
- Conv. Layer
- Wide, equal, narrow?
- narrow
SLIDE 59
- Conv. Layer
- How many filters?
SLIDE 60
- Conv. Layer
- How many filters?
- 4
SLIDE 62 Output Layer
- MLP layer
- Dropout
- Softmax
SLIDE 63
CNN Variants
SLIDE 64 Priori Entailed by CNNs
- Local bias
- Parameter sharing
SLIDE 65 Priori Entailed by CNNs
- Local bias
- Parameter sharing
How to handle long-term dependencies? How to handle different types
SLIDE 66 Priori Entailed by CNNs
Priori Advantage Limitation
SLIDE 67 CNN Variants
- Long-term dependency
- increase receptive fields (dilated)
- Complicated Interaction
- dynamic filters
Locality Bias Sharing Parameters
SLIDE 68 Dilated Convolution
(e.g. Kalchbrenner et al. 2016)
i _ h a t e _ t h i s _ f i l m sentence class (classification) next char (language modeling) word class (tagging)
- Long-term dependency with less layers
SLIDE 69 Dynamic Filter CNN (e.g. Brabandere et al. 2016)
- Parameters of filters are static, failing to capture rich
interaction patterns.
- Filters are generated dynamically conditioned on an
input.
SLIDE 70
Common Applications
SLIDE 71 CNN Applications
- Word-level CNNs
- Basic unit: word
- Learn the representation of a sentence
- Phrasal patterns
- Char-level CNNs
- Basic unit: character
- Learn the representation of a word
- Extract morphological patters
SLIDE 72 CNN Applications
- Word-level CNN
- Sentence representation
SLIDE 73 NLP (Almost) from Scratch
(Collobert et al.2011)
- One of the most important
papers in NLP
- Proposed as early as 2008
SLIDE 74 CNN Applications
- Word-level CNN
- Sentence representation
- Char-level CNN
- Text Classification
SLIDE 75 CNN-RNN-CRF for Tagging
(Ma et al. 2016)
- A classic framework and de-facto standard for
tagging
- Char-CNN is used to learn word representations
(extract morphological information).
SLIDE 76
Structured Convolution
SLIDE 77 Why Structured Convolution?
The man ate the egg.
SLIDE 78 Why Structured Convolution?
The man ate the egg. vanilla CNNs
SLIDE 79 Why Structured Convolution?
The man ate the egg.
- Some convolutional
- perations are not
necessary
- e.g. noun-verb pairs very
informative, but not captured by normal CNNs
vanilla CNNs
SLIDE 80 Why Structured Convolution?
The man ate the egg.
- Some convolutional
- perations are not
necessary
- e.g. noun-verb pairs very
informative, but not captured by normal CNNs
would like it to localize features
SLIDE 81 Why Structured Convolution?
The man ate the egg.
- Some convolutional
- perations are not
necessary
- e.g. noun-verb pairs very
informative, but not captured by normal CNNs
would like it to localize features
The “Structure” provides stronger prior!
SLIDE 82 Tree-structured Convolution
(Mou et al. 2014, Ma et al. 2015)
- Convolve over parents, grandparents, siblings
SLIDE 83 Graph Convolution
(e.g. Marcheggiani et al. 2017)
- Convolution is shaped by graph
structure
tree is a graph with 1) Self-loop connection 2) Dependency connections 3) Reverse connections
SLIDE 84
Summary
SLIDE 85 Neural Sequence Models
CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs
SLIDE 86 Neural Sequence Models
How do we make the choices of different neural sequence models?
SLIDE 87 Understand the design philosophy
- f a model
- Inductive bias: the set of assumptions that the
learner uses to predict outputs given inputs that it has not encountered (from wikipedia)
- Structural bias: a set of prior knowledge
incorporated into your model design
SLIDE 88 Structural Bias
- Structural bias: a set of prior knowledge
incorporated into your model design Local Non-local
SLIDE 89 Structural Bias
- Structural bias: a set of prior knowledge
incorporated into your model design
Sequential Tree Graph
SLIDE 90 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph
SLIDE 91 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph CNN RNN
SLIDE 92 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph Structured CNN
SLIDE 93 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph
?
SLIDE 94 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph
?
SLIDE 95 What inductive bias does a neural component entail?
Locality Bias Topological Structure Local Non-local Seq. Tree Graph
?
SLIDE 96
Questions?