IN5550 Neural Methods in Natural Language Processing - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550 Neural Methods in Natural Language Processing - - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2) Erik Velldal & Lilja vrelid University of Oslo 3 March 2020 Agenda Brief recap from last week on CNNs. Extensions of the basic CNN


slide-1
SLIDE 1

– IN5550 – Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

Erik Velldal & Lilja Øvrelid

University of Oslo

3 March 2020

slide-2
SLIDE 2

Agenda

◮ Brief recap from last week on CNNs. ◮ Extensions of the basic CNN design:

◮ Hierarchical convolutions ◮ Multiple channels

◮ Design choices and parameter tuning ◮ Use cases: CNNs beyond sentence classification

2

slide-3
SLIDE 3

Recap, CNNs for sequences

Zhang et al. (2017) 3

slide-4
SLIDE 4

Multiple channels

◮ CNNs for images often have multiple ‘channels’. ◮ E.g. 3 channels for an RGB color encoding. ◮ Corresponds to having 3 image matrices and applying different filters to each, summing the results.

4

slide-5
SLIDE 5

Multichannel architectures in NLP

◮ Yoon Kim, 2014: CNNs for Sentence Classification ◮ Word embeddings provided in two channels. ◮ Each filter is applied to both channels – shares parameters – and the results are added to form a single feature map. ◮ Gradients are back-propagated through only one of the channels: ◮ One copy of the embeddings is kept static, the other is fine-tuned.

5

slide-6
SLIDE 6

Multichannel architectures in NLP

◮ The motivation in Kim (2014) is to prevent overfitting by ensuring that the learned vectors do not deviate too far from the originals. ◮ More generally however, we can view each channel as providing a different representation of the input. ◮ What could correspond to the different channels for text sequences? ◮ E.g. embeddings for full-forms, lemmas, PoS, . . . ◮ or embeddings from different frameworks, corpora, . . .

6

slide-7
SLIDE 7

Context and the receptive field

◮ CNNs improve on CBOW in also capturing ordered context. ◮ But still rather limited; only relationships local to windows of size k. ◮ Due to long-range compositional effects in natural language semantics, we’ll often want to model as much context as feasible. ◮ One option is to just increase the filter size k. ◮ More powerful: a stack of convolution layers applied one after the other: ◮ Hierarchical convolutions.

7

slide-8
SLIDE 8

Hierarchical convolutions

◮ Let p1:m = CONVk

U,b(w1:n) be the result of applying a convolution

(with parameters U and b) across w1:n with window-size k. ◮ Can have a succession of r layers that feed into each other: p1

1:m1 = CONVk1 U 1,b1(w1:n)

p2

1:m2 = CONVk2 U 2,b2(p1 1:m1)

. . . pr

1:mr = CONVkr U r,br(pr−1 1:mr−1)

◮ The vectors pr

1:mr capture increasingly larger effective windows.

8

slide-9
SLIDE 9

Two-layer hierarchical convolution with k = 2

◮ Two different but related effects of adding layers: ◮ Larger receptive field wrt the input at each step: convolutions of successive layers see more of the input. ◮ Can learn more abstract feature combinations.

9

slide-10
SLIDE 10

Stride

◮ The stride size specifies by how much we shift a filter at each step. ◮ So far we’ve considered convolutions with a stride size of 1: we slide the window by increments of 1 across the word sequence. ◮ But using larger strides is possible. ◮ Can slide the window with increments of e.g. 2 or 3 words at the time. ◮ A larger stride size leads to fewer applications of the filter and a shorter

  • utput sequence p1:m.

10

slide-11
SLIDE 11

k = 3 and stride sizes 1, 2, 3

11

slide-12
SLIDE 12

Dilated convolutions

◮ A way to increase the effective window size while keeping the number of layers and parameters low. ◮ With dilated convolutions we skip some of the positions within the filters (or equivalently, introduce zero weights). ◮ I.e. a wider filter region but with the same number of parameters. ◮ When systematically applied there is no loss in coverage or ‘resolution’. ◮ Hierarchical dilated convolutions makes it possible to have large effective receptive fields with just a small number of layers.

12

slide-13
SLIDE 13

3-layer ‘dilated’ hierarchical conv. w/ k = 3, s = k − 1

◮ The same effect can be achieved more efficiently by keeping the filters intact and instead sparsely sample features using a larger stride size. ◮ E.g. by using hierarchical convolutions with a stride size of k − 1.

13

slide-14
SLIDE 14

Other ‘tricks’

◮ Hierarchical convolutions can be combined with parameter tying: ◮ Reusing the same U and b across layers. ◮ Allows for using an unbounded number of layers, to extend the receptive field to arbitrary-sized inputs. ◮ Skip-connections can be useful for deep CNNs: ◮ The output from one layer is passed to not only the next but also subsequent layers in the sequence. ◮ Variations: ResNets, Highway Networks, DenseNets, . . .

14

slide-15
SLIDE 15

Hyperparameters and design choices (1:2)

◮ Hyperparameters: parameters that are specified and not estimated by the learner. Often tuned empirically. CNN specific ◮ Number of filters ◮ Window width(s) ◮ Padding ◮ Stride size ◮ Pooling strategy ◮ Pooling regions? ◮ Multiple conv. layers? ◮ Multiple channels? ◮ . . . NNs in general ◮ Regularization ◮ Activation function ◮ Number of epochs ◮ Batch-size ◮ Choice of optimizer ◮ Loss function ◮ Learning rate schedule ◮ Stopping conditions ◮ . . .

15

slide-16
SLIDE 16

Hyperparameters and design choices (2:2)

Embeddings ◮ Pre-trained vs from scratch ◮ Static vs fine-tuned ◮ Vocab. size ◮ OOV handling ◮ Embedding hyperparameters (dimensionality etc.) ◮ . . . Text pre-processing ◮ Segmentation + tokenization ◮ Lemmatization vs full-forms ◮ Various normalization ◮ Additional layers of linguistic analysis: PoS-tagging, dependency parsing, NER, . . . ◮ . . . Parameter search is important but challenging: ◮ Optimal parametrization usually both data- and task-dependent. ◮ Vast parameter space ◮ Many variables co-dependent ◮ Long training times ◮ Need to control for non-determinism

16

slide-17
SLIDE 17

How to set hyperparameters

◮ Manually specified ◮ Empirically tune a selected set of parameters ◮ Grid search ◮ Random search ◮ Various types of guided automated search, e.g Bayesian optimization. ◮ In the extreme: ENAS (Efficient Neural Architecture Search) ◮ ‘automatically search for architecture and hyperparameters of deep learning models’ ◮ Implemented in Google’s AutoML (= expensive and cloud-based). ◮ Open-source implementations for PyTorch, Keras, etc. available.

17

slide-18
SLIDE 18

Zhang & Wallace (2017)

◮ Ye Zhang & Byron Wallace @ IJCNLP 2017: ◮ A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification

  • ur aim is to identify empirically the settings that practitioners should

expend effort tuning, and those that are either inconsequential with respect to performance or that seem to have a ‘best’ setting independent of the specific dataset, and provide a reasonable range for each hyperparameter ◮ All experiments run 10 times to gauge the effect of non-determinism: ◮ Report mean, min and max scores. ◮ Considers one parameter at the time, keeping the others fixed: ◮ Ignores the problem of co-dependant variables.

18

slide-19
SLIDE 19

CNN use cases

◮ Document and sentence classification

◮ topic classification ◮ authorship attribution ◮ spam detection ◮ abusive language ◮ subjectivity classification ◮ question type detection . . .

◮ CNNs for other types of NLP tasks:

◮ aspect-based sentiment analysis ◮ relation extraction

◮ CNNs over characters instead of words ◮ Understanding CNNs

19

slide-20
SLIDE 20

Aspect-based SA

◮ Sentiment directed at specific aspect of an entity ◮ Subtasks:

◮ aspect category detection (laptop#price, laptop#design) ◮ sentiment polarity

20

slide-21
SLIDE 21

CNN for aspect-based SA

◮ Ruder et. al. (2016) follow the architecture of Kim (2014):

◮ no of filters: 100 ◮ window widths: 3,4,5 (aspect detection) and 4,5,6 (ABSA) ◮ dropout: 0.5 ◮ activation: ReLU ◮ embeddings: 300-d pre-trained GloVe embeddings

◮ aspect detection: multi-label classification (laptop#price, laptop#design) ◮ sentiment classification takes as input an aspect embedding + word embeddings

◮ aspect embedding: averages embeddings for aspect terms (laptop, price)

21

slide-22
SLIDE 22

Relation extraction

◮ Identifying relations between entities in text ◮ Subtask of information extraction pipeline

22

slide-23
SLIDE 23

Relation extraction

◮ Inventory of relations varies ◮ SemEval shared tasks 2008-2010 ◮ SemEval 2010 (Hendrickx et. al., 2010) uses nine “general semantic relations” Cause-Effect those cancers were caused by radiation exposures Produce-Producer a factory manufactures suits Entity-Destination the boy went to bed etc. ◮ Task: Provided entities, determine relations ◮ Traditionally solved using a range of linguistic features (PoS, WordNet, NER, dependency paths, etc.)

23

slide-24
SLIDE 24

Neural relation extraction

◮ Nguyen & Grishman (2015) adapt the CNN architecture of Kim (2014): ◮ Pre-trained embeddings (word2vec) ◮ Position embeddings:

◮ embed the relative distances of each word xi in the sentence to the two entities of interest xi1 and xi2: i − i1 and i − i2 into real-valued vectors di1 and di2 ◮ initialized randomly ◮ concatenated with word embeddings

24

slide-25
SLIDE 25

Neural relation extraction

(from Nguyen & Grishman, 2015)

25

slide-26
SLIDE 26

Neural relation extraction

(from Nguyen & Grishman, 2015)

26

slide-27
SLIDE 27

Neural relation extraction

◮ CNNs pick up on local relationships ◮ Challenge: long-distance relations Cause-Effect The singer, who performed three of the nominated songs, also caused a commotion on the red carpet

27

slide-28
SLIDE 28

Character-level CNNs

◮ Zhang,Zhao & LeCun (2015) apply character-level CNNs to text classification

◮ Learn directly from characters without any pre-trained word embeddings ◮ Initially, each character is represented by a one hot encoding. ◮ Relatively deep network, 9 layers

◮ Compare several architectures: Bag-of-Words, Bag-of-ngram, word-based CNN, word-based LSTM ◮ On large-scale datasets (120k to 3000k) ◮ Results indicate that character-level CNNs work best on the largest datasets (several millions of instances)

28

slide-29
SLIDE 29

Character-level CNNs

◮ More recently, it is common to input both word embeddings and process the characters of each word ◮ Result of character processing is concatenated with the pre-trained word embeddings ◮ Commonly used for various sequence labeling tasks, e.g. PoS-tagging, Named Entity (Chiu & Nichols, 2016)

29

slide-30
SLIDE 30

Blackbox NLP

◮ Neural network = black box? ◮ Recent line of research aimed at understanding the representations and computations learnt by neural models ◮ Series of influential workshops (2018, 2019, 2020) ◮ Example topics:

◮ Analyzing of the network’s response to strategically chosen inputs ◮ Modifying neural architectures to make them more interpretable ◮ Explaining model predictions

30

slide-31
SLIDE 31

Understanding CNNs

Jacovi, Shalom & Goldberg (2019): Understanding Convolutional Neural Networks for Text Classification ◮ Analyze a standard CNN with max-pooling applied to three sentiment datasets ◮ Question: what information about ngrams is captured in the max-pooled vector and how is it used for the final classification? ◮ Method: analyze the scores assigned to ngrams by the pooled vector which classification is based on

◮ Decompose ngram score as a sum of individual word scores ◮ Slot activation vector: captures how much each word contributes to its activation

31

slide-32
SLIDE 32

Understanding CNNs

Top-scoring ngrams almost never fully activate all slots in a filter.

32

slide-33
SLIDE 33

Understanding CNNs

◮ Each filter captures multiple semantic classes of ngrams (with different slot activation patterns) ◮ A slot may not be maximized because it is not used to detect word existence, but rather lack of existence (negative ngrams)

33

slide-34
SLIDE 34

CNN pros and cons

◮ Can learn to represent large n-grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates

  • n is independent of the others; the entire input can be processed
  • concurrently. (Each filter also independent.)

◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is indeed calculated sequentially. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies. ◮ Next week: RNNs.

34