IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - - PowerPoint PPT Presentation

in4080 2020 fall
SMART_READER_LITE
LIVE PREVIEW

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation Lecture 13, 9 Nov. Today 3 Feedforward neural networks


slide-1
SLIDE 1

IN4080 – 2020 FALL

NATURAL LANGUAGE PROCESSING

Jan Tore Lønning

1

slide-2
SLIDE 2

Lecture 13, 9 Nov.

Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation

2

slide-3
SLIDE 3

Today

 Feedforward neural networks

 Neural Language Models

 Recurrent networks  Information Extraction  Named Entity Recognition  Evaluation

3

slide-4
SLIDE 4

Last week

 Feedforward neural networks (partly recap)

 Model  Training  Computational graphs  Neural Language Models

 Recurrent networks  Information Extraction

4

slide-5
SLIDE 5

Neural NLP

 (Multi-layered) neural networks  Using embeddings as word

representations

 Example: Neural language

model (k-gram)

 𝑄 𝑥𝑗| 𝑥𝑗−𝑙

𝑗−1

 Use embeddings for

representing the 𝑥𝑗-s

 Use neural network for

estimating 𝑄 𝑥𝑗| 𝑥𝑗−𝑙

𝑗−1

5

slide-6
SLIDE 6

6

From J&M, 3.ed., 2019

slide-7
SLIDE 7

Pretrained embeddings

 The last slide uses pretrained embeddings

 Trained with some method, SkipGram, CBOW, Glove, …  On some specific corpus  Can be downloaded from the web

 Pretrained embeddings can aslo be the input to other tasks, e.g. text

classification

 The task of neural language modeling was also the basis for training

the embeddings

7

slide-8
SLIDE 8

Training the embeddings

 Alternatively we may start with one-hot representations of words and

train the embeddings as the first layer in our models (=the way we trained the embeddings)

 If the goal is a task different from language modeling, this may result

in embeddings better suited for the specific tasks.

 We may even use two set of embeddings for each word – one

pretrained and one which is trained during the task.

8

slide-9
SLIDE 9

Computational graph

10

𝒚2 E W 𝒄[1] 𝒄[2] 𝒗 = 𝑑𝑝𝑜𝑑𝑏𝑢( 𝒗1

[1], 𝒗1 [1], 𝒗1 [1])

𝑏 = 𝑆𝑉(𝒜) 𝒘 = 𝑋𝒗 𝒜 = 𝒘 + 𝒄[1] ෝ 𝒛 = 𝑡𝑝𝑔𝑢− 𝑛𝑏𝑦(𝒜𝟑) 𝒚𝟒 𝒚𝟐 U 𝒙 = 𝑉𝒃 𝒗1

[1]=𝐹𝒚𝟐

𝒗2

[1]=𝐹𝒚𝟑

𝒗3

[1]=𝐹𝒚𝟒

𝒜𝟑 = 𝒙 + 𝒄[2] This picture is if we train the embeddings E With pretrained embeddings, we look up 𝒗1

[1]in a table for

each word

slide-10
SLIDE 10

Recurrent networks

11

slide-11
SLIDE 11

Today

 Feedforward neural networks  Recurrent networks

 Model  Language Model  Sequence Labeling  Advanced architecture

 Information Extraction  Named Entity Recognition  Evaluation

12

slide-12
SLIDE 12

Recurrent neural nets

 Model sequences/temporal phenomena  A cell may send a signal back to itself – at the next moment in time

13

https://en.wikipedia.org/wiki/Recurrent_neural_network The network The processing during time

slide-13
SLIDE 13

Forward

 Each U, V and W are edges

with weights (matrices)

 𝑦1, 𝑦2, … , 𝑦𝑜 is the input

sequence

 Forward:

1.

Calculate ℎ1 from ℎ0 and 𝑦1.

2.

Calculate 𝑧1 from ℎ1.

3.

Calculate ℎ𝑗 from ℎ𝑗−1 and 𝑦𝑗, and 𝑧𝑗 from 𝑗, for 𝑗 = 1, … , 𝑜

14

From J&M, 3.ed., 2019

slide-14
SLIDE 14

Forward

 𝒊𝑢 = 𝑕 𝑉𝒊𝑢−1 + 𝑋𝒚𝑢  𝒛𝑢 = 𝑔 𝑊𝒊𝑢  𝑕 and are activation functions  (There are also bias which we

didn't include in the formulas)

15

From J&M, 3.ed., 2019

slide-15
SLIDE 15

Training

 At each output node:

 Calculate the loss and the  𝜀-term

 Backpropagate the error, e.g.

 the 𝜀-term at ℎ2is calculated

 from the 𝜀-term at ℎ3 by U and  the 𝜀-term at 𝑧2 by V  Update

 V from the 𝜀-terms at the 𝑧𝑗-s and  U and W from the 𝜀-terms at the

ℎ𝑗-s

16

From J&M, 3.ed., 2019

slide-16
SLIDE 16

Remark

 J&M, 3. ed., 2019, sec 9.1.2

explain this at a high-level using vectors and matrices, OK

 The formulas, however, are not

correct:

 Describing derivatives of

matrices and vectors demand a little more care, e.g. one has to transpose matrices

 It is beyond this course to

explain how this can be done in detail

 But you should be able to do

the actual calculations if you stick to the entries of the vectors and matrices, as we did above (ch. 7).

17

slide-17
SLIDE 17

Today

 Feedforward neural networks  Recurrent networks

 Model  Language Model  Sequence Labeling  Advanced architecture

 Information Extraction  Named Entity Recognition  Evaluation

18

slide-18
SLIDE 18

RNN Language model

 ො

𝑧 = 𝑄 𝑥𝑜 𝑥1

𝑜−1 =

𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊𝑜)

 In principle:

 unlimited history  a word depends on all preceding

words

 The word 𝑥𝑗 is represented by an

embedding

 or a one-hot and the embedding is

made by the LM

19

<s> w1 w2 From J&M, 3.ed., 2019

slide-19
SLIDE 19

Autoregressive generation

 Generated by

probabilities:

 Choose word in

accordance with prob.distribution

 Part of more complex

models

 Encoder-decoder

models

 Translation

20

From J&M, 3.ed., 2019

slide-20
SLIDE 20

Today

 Feedforward neural networks  Recurrent networks

 Model  Language Model  Sequence Labeling  Sequence Labeling  Advanced architecture

 Information Extraction  Named Entity Recognition  Evaluation

21

slide-21
SLIDE 21

Neural sequence labeling: tagging

 ො

𝑧 = 𝑄 𝑢𝑜 𝑥1

𝑜 =

𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊𝑜)

22

From J&M, 3.ed., 2019

slide-22
SLIDE 22

Sequence labeling

 Actual models for sequence labeling, e.g. tagging, are more complex  For example, that it may take words after the tag into consideration.

23

slide-23
SLIDE 23

Today

 Feedforward neural networks  Recurrent networks

 Model  Language Model  Sequence Labeling  Advanced architecture

 Information Extraction  Named Entity Recognition  Evaluation

24

slide-24
SLIDE 24

Stacked RNN

 Can yield better

results than single- layers

 Reason?

 Higher-layers of

abstraction

 similar to image

processing (convolutional nets)

25

From J&M, 3.ed., 2019

slide-25
SLIDE 25

Bidirectional RNN

 Example: Tagger  Considers both

preceding and following words

26

From J&M, 3.ed., 2019

slide-26
SLIDE 26

LSTM

 Problems for RNN

 Keep track of distant information  Vanishing gradient

 During backpropagation going

backwards through several layers, the gradient approaches 0

 Long Short-Term Memory

 An advanced architecture with

additional layers and weights

 Not consider the details here  Bi-LSTM (Binary LSTM)

 Popular standard architecture in

NLP

27

slide-27
SLIDE 27

Information extraction

28

slide-28
SLIDE 28

Today

 Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE

 Chunking

 Named Entity Recognition  Evaluation

29

slide-29
SLIDE 29

IE basics

 Bottom-Up approach  Start with unrestricted texts, and do the best you can  The approach was in particular developed by the Message Understanding

Conferences (MUC) in the 1990s

 Select a particular domain and task

30

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. (Wikipedia)

slide-30
SLIDE 30

A typical pipeline

31

From NLTK

slide-31
SLIDE 31

Some example systems

32

 Stanford core nlp: http://corenlp.run/  SpaCy (Python): https://spacy.io/docs/api/  OpenNLP (Java): https://opennlp.apache.org/docs/  GATE (Java): https://gate.ac.uk/

 https://cloud.gate.ac.uk/shopfront

 UDPipe: http://ufal.mff.cuni.cz/udpipe

 Online demo: http://lindat.mff.cuni.cz/services/udpipe/

 Collection of tools for NER:

 https://www.clarin.eu/resource-families/tools-named-entity-recognition

slide-32
SLIDE 32

Today

 Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE

 Chunking

 Named Entity Recognition  Evaluation

33

slide-33
SLIDE 33

Next steps

 Chunk together words to phrases

34

slide-34
SLIDE 34

NP-chunks

 Exactly what is an NP-chunk?  It is an NP  But not all NPs are chunks  Flat structure: no NP-chunk is part

  • f another NP chunk

 Maximally large  Opposing restrictions

35

[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

slide-35
SLIDE 35

Chunking methods

 Hand-written rules  Regular expressions  Supervised machine learning

36

slide-36
SLIDE 36

Regular Expression Chunker

 Input POS-tagged sentences  Use a regular expression over POS to identify NP-chunks  NLTK example:  It inserts parentheses

37

grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} {<NNP>+} """

slide-37
SLIDE 37

IOB-tags

 B-NP: First word in NP  I-NP: Part of NP

, not first word

 O: Not part of NP (phrase)  Properties

 One tag per token  Unambiguous  Does not insert anything in the

text itself

38

slide-38
SLIDE 38

Assigning IOB-tags

 The process can be considered a form for tagging

 POS-tagging: Word to POS-tag  IOB-tagging: POS-tag to IOB-tag

 But one may in addition use additional features, e.g. words  Can use various types of classifiers

 NLTK uses a MaxEnt Classifier (=LogReg, but the implementation is slow)  We can modify along the lines of mandatory assignment 2, using scikit-learn

39

slide-39
SLIDE 39

40

J&M, 3. ed.

slide-40
SLIDE 40

Today

 Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE

 Chunking

 Named Entity Recognition  Evaluation

41

slide-41
SLIDE 41

Named entities

42

 Named entity:

 Anything you can refer

to by a proper name

 i.e. not all NP (chunks):

 high fuel prices

 Maybe longer NP than

just chunk:

 Bank of America  Find the phrases  Classify them

Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost

  • carriers. [ORG American Airlines], a unit of

[ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].

slide-42
SLIDE 42

Types of NE

 The set of types vary between different systems  Which classes are useful depend on application

43

slide-43
SLIDE 43

Ambiguities

44

slide-44
SLIDE 44

Gazetteer

 Useful: List of names,

e.g.

 Gazetteer: list of

geographical names

 But does not remove all

ambiguities

 cf. example

45

slide-45
SLIDE 45

Representation (IOB)

46

slide-46
SLIDE 46

Feature-based NER

 Similar to tagging and chunking  You will need features from several layers  Features may include

 Words, POS-tags, Chunk-tags, Graphical prop.  and more (See J&M, 3.ed)

47

slide-47
SLIDE 47

Neural sequence labeling: NER

 We can use IOB-tags  IOB-tagged training

data

 RNN

 Similarly to POS-

tagging

48

From J&M, 3.ed., 2019

slide-48
SLIDE 48

A more advanced model

 Bi-LSTM  CRF top-layer

 Optimize the sequence

  • f tags

 In contrast to

  • ptimizing individual

tags (as we did it in mandatory 2)

49

From J&M, 3.ed., 2019

slide-49
SLIDE 49

Today

 Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Named Entity Recognition  Evaluation

 in general  chunkers and NER

50

slide-50
SLIDE 50

Evaluation measure: Accuracy

51

 What does accuracy 0.81 tell us?  Given a test set of 500 documents:

 The classifier will classify 405 correctly  And 95 incorrectly

 A good measure given:

 The 2 classes are equally important  The 2 classes are roughly equally sized  Example:

 Woman/man  Movie reviews: pos/neg

slide-51
SLIDE 51

But

52

 For some tasks, the classes aren't equally important

 Worse to loose an important mail than to receive yet another spam mail

 For some tasks the different classes have different sizes.

slide-52
SLIDE 52

Information retrieval (IR)

53

 Traditional IR, e.g. a library

 Goal: Find all the documents on a particular topic out of 100 000 documents,

 Say there are 5

 The system delivers 10 documents: all irrelevant

 What is the accuracy?  For these tasks, focus on

 The relevant documents  The documents returned by the system

 Forget the

 Irrelevant documents which are not returned

slide-53
SLIDE 53

IR - evaluation

54

slide-54
SLIDE 54

Confusion matrix

 Beware what the rows

and columns are:

 NLTKs

ConfusionMatrix swaps them compared to this table

55

slide-55
SLIDE 55

Evaluation measures

 Accuracy: (tp+tn)/N  Precision:tp/(tp+fp)  Recall: tp/(tp+fn)  F-score combines P and R  𝐺

1 = 2𝑄𝑆 𝑄+𝑆 = 1

1 𝑆+1 𝑄 2  F1 called ‘’harmonic mean’’  General form

 𝐺 =

1 𝛽1

𝑄+(1−𝛽)1 𝑆

 for some 0 < 𝛽 < 1

56

Is in C Yes NO Class ifier Yes tp fp No fn tn

slide-56
SLIDE 56

Confusion matrix

 Precision, recall and

f-score can be calculated for each class against the rest

57

slide-57
SLIDE 57

Today

 Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Named Entity Recognition  Evaluation

 in general  chunkers and NER

58

slide-58
SLIDE 58

Evaluation

 Have we found the correct NERs?

 Evaluate precision and recall as for chunking

 For the correctly identified NERs, have we labelled them correctly?

59

slide-59
SLIDE 59

Evaluating (IOB-)chunkers

 cp = nltk.RegexpParser("")  test_sents = conll ('test',

chunks=['NP'])

 IOB Accuracy: 43.4%  Precision: 0.0%  Recall: 0.0%  F-Measure: 0.0%  What do we evaluate?

 IOB-tags? or  Whole chunks?  Yields different results

 For IOB-tags:

 Baseline:

 majority class O,  yields > 33%  Whole chunks:

 Which chunks did we find?  Harder  Lower numbers

60

slide-60
SLIDE 60

Evaluating (IOB-)chunkers

 cp = nltk.RegexpParser("")  test_sents = conll ('test',

chunks=['NP'])

 IOB Accuracy: 43.4%  Precision: 0.0%  Recall: 0.0%  F-Measure: 0.0%

>> cp = nltk.RegexpParser( r"NP: {<[CDJNP].*>+}")

 IOB Accuracy: 87.7%  Precision: 70.6%  Recall: 67.8%  F-Measure: 69.2%

61

slide-61
SLIDE 61

62

slide-62
SLIDE 62

Next week

 Relation extraction (sec. 17.2)  Encoder-Decoder Models (sec. 10.1-10.2)

63