IN5550 Neural Methods in Natural Language Processing - - PowerPoint PPT Presentation

▶

Jun 16, 2023 232 likes •595 views

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2) Erik Velldal & Lilja vrelid University of Oslo 3 March 2020 Agenda Brief recap from last week on CNNs. Extensions of the basic CNN

SLIDE 1

– IN5550 – Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

Erik Velldal & Lilja Øvrelid

University of Oslo

3 March 2020

SLIDE 2

Agenda

◮ Brief recap from last week on CNNs. ◮ Extensions of the basic CNN design:

◮ Hierarchical convolutions ◮ Multiple channels

◮ Design choices and parameter tuning ◮ Use cases: CNNs beyond sentence classification

SLIDE 3

Recap, CNNs for sequences

Zhang et al. (2017) 3

SLIDE 4

Multiple channels

◮ CNNs for images often have multiple ‘channels’. ◮ E.g. 3 channels for an RGB color encoding. ◮ Corresponds to having 3 image matrices and applying different filters to each, summing the results.

SLIDE 5

Multichannel architectures in NLP

◮ Yoon Kim, 2014: CNNs for Sentence Classification ◮ Word embeddings provided in two channels. ◮ Each filter is applied to both channels – shares parameters – and the results are added to form a single feature map. ◮ Gradients are back-propagated through only one of the channels: ◮ One copy of the embeddings is kept static, the other is fine-tuned.

SLIDE 6

Multichannel architectures in NLP

◮ The motivation in Kim (2014) is to prevent overfitting by ensuring that the learned vectors do not deviate too far from the originals. ◮ More generally however, we can view each channel as providing a different representation of the input. ◮ What could correspond to the different channels for text sequences? ◮ E.g. embeddings for full-forms, lemmas, PoS, . . . ◮ or embeddings from different frameworks, corpora, . . .

SLIDE 7

Context and the receptive field

◮ CNNs improve on CBOW in also capturing ordered context. ◮ But still rather limited; only relationships local to windows of size k. ◮ Due to long-range compositional effects in natural language semantics, we’ll often want to model as much context as feasible. ◮ One option is to just increase the filter size k. ◮ More powerful: a stack of convolution layers applied one after the other: ◮ Hierarchical convolutions.

SLIDE 8

Hierarchical convolutions

◮ Let p1:m = CONVk

U,b(w1:n) be the result of applying a convolution

(with parameters U and b) across w1:n with window-size k. ◮ Can have a succession of r layers that feed into each other: p1

1:m1 = CONVk1 U 1,b1(w1:n)

p2

1:m2 = CONVk2 U 2,b2(p1 1:m1)

. . . pr

1:mr = CONVkr U r,br(pr−1 1:mr−1)

◮ The vectors pr

1:mr capture increasingly larger effective windows.

SLIDE 9

Two-layer hierarchical convolution with k = 2

◮ Two different but related effects of adding layers: ◮ Larger receptive field wrt the input at each step: convolutions of successive layers see more of the input. ◮ Can learn more abstract feature combinations.

SLIDE 10

Stride

◮ The stride size specifies by how much we shift a filter at each step. ◮ So far we’ve considered convolutions with a stride size of 1: we slide the window by increments of 1 across the word sequence. ◮ But using larger strides is possible. ◮ Can slide the window with increments of e.g. 2 or 3 words at the time. ◮ A larger stride size leads to fewer applications of the filter and a shorter

utput sequence p1:m.

SLIDE 11

k = 3 and stride sizes 1, 2, 3

SLIDE 12

Dilated convolutions

◮ A way to increase the effective window size while keeping the number of layers and parameters low. ◮ With dilated convolutions we skip some of the positions within the filters (or equivalently, introduce zero weights). ◮ I.e. a wider filter region but with the same number of parameters. ◮ When systematically applied there is no loss in coverage or ‘resolution’. ◮ Hierarchical dilated convolutions makes it possible to have large effective receptive fields with just a small number of layers.

SLIDE 13

3-layer ‘dilated’ hierarchical conv. w/ k = 3, s = k − 1

◮ The same effect can be achieved more efficiently by keeping the filters intact and instead sparsely sample features using a larger stride size. ◮ E.g. by using hierarchical convolutions with a stride size of k − 1.

SLIDE 14

Other ‘tricks’

◮ Hierarchical convolutions can be combined with parameter tying: ◮ Reusing the same U and b across layers. ◮ Allows for using an unbounded number of layers, to extend the receptive field to arbitrary-sized inputs. ◮ Skip-connections can be useful for deep CNNs: ◮ The output from one layer is passed to not only the next but also subsequent layers in the sequence. ◮ Variations: ResNets, Highway Networks, DenseNets, . . .

SLIDE 15

Hyperparameters and design choices (1:2)

◮ Hyperparameters: parameters that are specified and not estimated by the learner. Often tuned empirically. CNN specific ◮ Number of filters ◮ Window width(s) ◮ Padding ◮ Stride size ◮ Pooling strategy ◮ Pooling regions? ◮ Multiple conv. layers? ◮ Multiple channels? ◮ . . . NNs in general ◮ Regularization ◮ Activation function ◮ Number of epochs ◮ Batch-size ◮ Choice of optimizer ◮ Loss function ◮ Learning rate schedule ◮ Stopping conditions ◮ . . .

SLIDE 16

Hyperparameters and design choices (2:2)

Embeddings ◮ Pre-trained vs from scratch ◮ Static vs fine-tuned ◮ Vocab. size ◮ OOV handling ◮ Embedding hyperparameters (dimensionality etc.) ◮ . . . Text pre-processing ◮ Segmentation + tokenization ◮ Lemmatization vs full-forms ◮ Various normalization ◮ Additional layers of linguistic analysis: PoS-tagging, dependency parsing, NER, . . . ◮ . . . Parameter search is important but challenging: ◮ Optimal parametrization usually both data- and task-dependent. ◮ Vast parameter space ◮ Many variables co-dependent ◮ Long training times ◮ Need to control for non-determinism

SLIDE 17

How to set hyperparameters

◮ Manually specified ◮ Empirically tune a selected set of parameters ◮ Grid search ◮ Random search ◮ Various types of guided automated search, e.g Bayesian optimization. ◮ In the extreme: ENAS (Efficient Neural Architecture Search) ◮ ‘automatically search for architecture and hyperparameters of deep learning models’ ◮ Implemented in Google’s AutoML (= expensive and cloud-based). ◮ Open-source implementations for PyTorch, Keras, etc. available.

SLIDE 18

Zhang & Wallace (2017)

◮ Ye Zhang & Byron Wallace @ IJCNLP 2017: ◮ A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification

ur aim is to identify empirically the settings that practitioners should

expend effort tuning, and those that are either inconsequential with respect to performance or that seem to have a ‘best’ setting independent of the specific dataset, and provide a reasonable range for each hyperparameter ◮ All experiments run 10 times to gauge the effect of non-determinism: ◮ Report mean, min and max scores. ◮ Considers one parameter at the time, keeping the others fixed: ◮ Ignores the problem of co-dependant variables.

SLIDE 19

CNN use cases

◮ Document and sentence classification

◮ topic classification ◮ authorship attribution ◮ spam detection ◮ abusive language ◮ subjectivity classification ◮ question type detection . . .

◮ CNNs for other types of NLP tasks:

◮ aspect-based sentiment analysis ◮ relation extraction

◮ CNNs over characters instead of words ◮ Understanding CNNs

SLIDE 20

Aspect-based SA

◮ Sentiment directed at specific aspect of an entity ◮ Subtasks:

◮ aspect category detection (laptop#price, laptop#design) ◮ sentiment polarity

SLIDE 21

CNN for aspect-based SA

◮ Ruder et. al. (2016) follow the architecture of Kim (2014):

◮ no of filters: 100 ◮ window widths: 3,4,5 (aspect detection) and 4,5,6 (ABSA) ◮ dropout: 0.5 ◮ activation: ReLU ◮ embeddings: 300-d pre-trained GloVe embeddings

◮ aspect detection: multi-label classification (laptop#price, laptop#design) ◮ sentiment classification takes as input an aspect embedding + word embeddings

◮ aspect embedding: averages embeddings for aspect terms (laptop, price)

SLIDE 22

Relation extraction

◮ Identifying relations between entities in text ◮ Subtask of information extraction pipeline

SLIDE 23

Relation extraction

◮ Inventory of relations varies ◮ SemEval shared tasks 2008-2010 ◮ SemEval 2010 (Hendrickx et. al., 2010) uses nine “general semantic relations” Cause-Effect those cancers were caused by radiation exposures Produce-Producer a factory manufactures suits Entity-Destination the boy went to bed etc. ◮ Task: Provided entities, determine relations ◮ Traditionally solved using a range of linguistic features (PoS, WordNet, NER, dependency paths, etc.)

SLIDE 24

Neural relation extraction

◮ Nguyen & Grishman (2015) adapt the CNN architecture of Kim (2014): ◮ Pre-trained embeddings (word2vec) ◮ Position embeddings:

◮ embed the relative distances of each word xi in the sentence to the two entities of interest xi1 and xi2: i − i1 and i − i2 into real-valued vectors di1 and di2 ◮ initialized randomly ◮ concatenated with word embeddings

SLIDE 25

Neural relation extraction

(from Nguyen & Grishman, 2015)

SLIDE 26

Neural relation extraction

(from Nguyen & Grishman, 2015)

SLIDE 27

Neural relation extraction

◮ CNNs pick up on local relationships ◮ Challenge: long-distance relations Cause-Effect The singer, who performed three of the nominated songs, also caused a commotion on the red carpet

SLIDE 28

Character-level CNNs

◮ Zhang,Zhao & LeCun (2015) apply character-level CNNs to text classification

◮ Learn directly from characters without any pre-trained word embeddings ◮ Initially, each character is represented by a one hot encoding. ◮ Relatively deep network, 9 layers

◮ Compare several architectures: Bag-of-Words, Bag-of-ngram, word-based CNN, word-based LSTM ◮ On large-scale datasets (120k to 3000k) ◮ Results indicate that character-level CNNs work best on the largest datasets (several millions of instances)

SLIDE 29

Character-level CNNs

◮ More recently, it is common to input both word embeddings and process the characters of each word ◮ Result of character processing is concatenated with the pre-trained word embeddings ◮ Commonly used for various sequence labeling tasks, e.g. PoS-tagging, Named Entity (Chiu & Nichols, 2016)

SLIDE 30

Blackbox NLP

◮ Neural network = black box? ◮ Recent line of research aimed at understanding the representations and computations learnt by neural models ◮ Series of influential workshops (2018, 2019, 2020) ◮ Example topics:

◮ Analyzing of the network’s response to strategically chosen inputs ◮ Modifying neural architectures to make them more interpretable ◮ Explaining model predictions

SLIDE 31

Understanding CNNs

Jacovi, Shalom & Goldberg (2019): Understanding Convolutional Neural Networks for Text Classification ◮ Analyze a standard CNN with max-pooling applied to three sentiment datasets ◮ Question: what information about ngrams is captured in the max-pooled vector and how is it used for the final classification? ◮ Method: analyze the scores assigned to ngrams by the pooled vector which classification is based on

◮ Decompose ngram score as a sum of individual word scores ◮ Slot activation vector: captures how much each word contributes to its activation

SLIDE 32

Understanding CNNs

Top-scoring ngrams almost never fully activate all slots in a filter.

SLIDE 33

Understanding CNNs

◮ Each filter captures multiple semantic classes of ngrams (with different slot activation patterns) ◮ A slot may not be maximized because it is not used to detect word existence, but rather lack of existence (negative ngrams)

SLIDE 34

CNN pros and cons

◮ Can learn to represent large n-grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates

n is independent of the others; the entire input can be processed
concurrently. (Each filter also independent.)

◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is indeed calculated sequentially. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies. ◮ Next week: RNNs.