in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation Lecture 13, 9 Nov. Today 3 Feedforward neural networks


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation Lecture 13, 9 Nov.

  3. Today 3  Feedforward neural networks  Neural Language Models  Recurrent networks  Information Extraction  Named Entity Recognition  Evaluation

  4. Last week 4  Feedforward neural networks (partly recap)  Model  Training  Computational graphs  Neural Language Models  Recurrent networks  Information Extraction

  5. Neural NLP 5  (Multi-layered) neural networks  Example: Neural language model (k- gram)  Using embeddings as word 𝑗−1 representations  𝑄 𝑥 𝑗 | 𝑥 𝑗−𝑙  Use embeddings for representing the 𝑥 𝑗 -s  Use neural network for 𝑗−1 estimating 𝑄 𝑥 𝑗 | 𝑥 𝑗−𝑙

  6. From J&M, 3.ed., 2019 6

  7. Pretrained embeddings 7  The last slide uses pretrained embeddings  Trained with some method, SkipGram , CBOW, Glove, …  On some specific corpus  Can be downloaded from the web  Pretrained embeddings can aslo be the input to other tasks, e.g. text classification  The task of neural language modeling was also the basis for training the embeddings

  8. Training the embeddings 8  Alternatively we may start with one-hot representations of words and train the embeddings as the first layer in our models (=the way we trained the embeddings)  If the goal is a task different from language modeling, this may result in embeddings better suited for the specific tasks.  We may even use two set of embeddings for each word – one pretrained and one which is trained during the task.

  9. Computational graph 10 [1] = 𝐹𝒚𝟐 𝒚𝟐 𝒗 1 𝒗 = 𝑑𝑝𝑜𝑑𝑏𝑢( 𝒘 = 𝒜 = 𝑏 = 𝒙 𝒜𝟑 = ෝ 𝒛 = 𝑡𝑝𝑔𝑢 − [1] = 𝐹𝒚𝟑 𝒚 2 𝒗 2 [1] , 𝒗 1 [1] , 𝒗 1 [1] ) 𝒘 + 𝒄 [1] 𝒙 + 𝒄 [2] 𝑋𝒗 𝑆𝑉(𝒜) = 𝑉𝒃 𝑛𝑏𝑦(𝒜𝟑) 𝒗 1 [1] = 𝐹𝒚𝟒 𝒚𝟒 𝒗 3 𝒄 [1] 𝒄 [2] W U This picture is if we train the E embeddings E With pretrained embeddings, [1] in a table for we look up 𝒗 1 each word

  10. 11 Recurrent networks

  11. Today 12  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  12. Recurrent neural nets 13  Model sequences/temporal phenomena  A cell may send a signal back to itself – at the next moment in time The processing during time The network https://en.wikipedia.org/wiki/Recurrent_neural_network

  13. Forward 14  Each U, V and W are edges with weights (matrices)  𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 is the input sequence  Forward: Calculate ℎ 1 from ℎ 0 and 𝑦 1 . 1. Calculate 𝑧 1 from ℎ 1 . 2. Calculate ℎ 𝑗 from ℎ 𝑗−1 and 𝑦 𝑗 , 3. and 𝑧 𝑗 from 𝑗 , for 𝑗 = 1, … , 𝑜 From J&M, 3.ed., 2019

  14. Forward 15  𝒊 𝑢 = 𝑕 𝑉𝒊 𝑢−1 + 𝑋𝒚 𝑢  𝒛 𝑢 = 𝑔 𝑊𝒊 𝑢  𝑕 and are activation functions  (There are also bias which we didn't include in the formulas) From J&M, 3.ed., 2019

  15. Training 16  At each output node:  Calculate the loss and the  𝜀 -term  Backpropagate the error, e.g.  the 𝜀 -term at ℎ 2 is calculated  from the 𝜀 -term at ℎ 3 by U and  the 𝜀 -term at 𝑧 2 by V  Update  V from the 𝜀 -terms at the 𝑧 𝑗 -s and  U and W from the 𝜀 -terms at the ℎ 𝑗 -s From J&M, 3.ed., 2019

  16. Remark 17  J&M, 3. ed., 2019, sec 9.1.2  It is beyond this course to explain this at a high-level explain how this can be done in using vectors and matrices, OK detail  The formulas, however, are not  But you should be able to do correct: the actual calculations if you stick to the entries of the  Describing derivatives of vectors and matrices, as we did matrices and vectors demand a little more care, e.g. one has to above (ch. 7). transpose matrices

  17. Today 18  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  18. RNN Language model 19 𝑜−1 =  ො 𝑧 = 𝑄 𝑥 𝑜 𝑥 1 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊 𝑜 )  In principle:  unlimited history  a word depends on all preceding w2 words w1  The word 𝑥 𝑗 is represented by an embedding <s>  or a one-hot and the embedding is made by the LM From J&M, 3.ed., 2019

  19. Autoregressive generation 20  Generated by probabilities:  Choose word in accordance with prob.distribution  Part of more complex models  Encoder-decoder models  Translation From J&M, 3.ed., 2019

  20. Today 21  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  21. Neural sequence labeling: tagging 22 𝑜 =  ො 𝑧 = 𝑄 𝑢 𝑜 𝑥 1 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊 𝑜 ) From J&M, 3.ed., 2019

  22. Sequence labeling 23  Actual models for sequence labeling, e.g. tagging, are more complex  For example, that it may take words after the tag into consideration.

  23. Today 24  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  24. Stacked RNN 25  Can yield better results than single- layers  Reason?  Higher-layers of abstraction  similar to image processing (convolutional nets) From J&M, 3.ed., 2019

  25. Bidirectional RNN 26  Example: Tagger  Considers both preceding and following words From J&M, 3.ed., 2019

  26. LSTM 27  Problems for RNN  Long Short-Term Memory  Keep track of distant information  An advanced architecture with additional layers and weights  Vanishing gradient  Not consider the details here  During backpropagation going backwards through several layers,  Bi-LSTM (Binary LSTM) the gradient approaches 0  Popular standard architecture in NLP

  27. 28 Information extraction

  28. Today 29  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

  29. IE basics 30 Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. (Wikipedia)  Bottom-Up approach  Start with unrestricted texts, and do the best you can  The approach was in particular developed by the Message Understanding Conferences (MUC) in the 1990s  Select a particular domain and task

  30. A typical pipeline 31 From NLTK

  31. Some example systems 32  Stanford core nlp: http://corenlp.run/  SpaCy (Python): https://spacy.io/docs/api/  OpenNLP (Java): https://opennlp.apache.org/docs/  GATE (Java): https://gate.ac.uk/  https://cloud.gate.ac.uk/shopfront  UDPipe: http://ufal.mff.cuni.cz/udpipe  Online demo: http://lindat.mff.cuni.cz/services/udpipe/  Collection of tools for NER:  https://www.clarin.eu/resource-families/tools-named-entity-recognition

  32. Today 33  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

  33. Next steps 34  Chunk together words to phrases

  34. NP-chunks 35 [ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.  Exactly what is an NP-chunk?  Flat structure: no NP-chunk is part of another NP chunk  It is an NP  Maximally large  But not all NPs are chunks  Opposing restrictions

  35. Chunking methods 36  Hand-written rules  Regular expressions  Supervised machine learning

  36. Regular Expression Chunker 37  Input POS-tagged sentences  Use a regular expression over POS to identify NP-chunks  NLTK example:  It inserts parentheses grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} {<NNP>+} """

  37. IOB-tags 38  B-NP: First word in NP  Properties  One tag per token  I-NP: Part of NP , not first word  Unambiguous  O: Not part of NP (phrase)  Does not insert anything in the text itself

  38. Assigning IOB-tags 39  The process can be considered a form for tagging  POS-tagging: Word to POS-tag  IOB-tagging: POS-tag to IOB-tag  But one may in addition use additional features, e.g. words  Can use various types of classifiers  NLTK uses a MaxEnt Classifier (=LogReg, but the implementation is slow)  We can modify along the lines of mandatory assignment 2, using scikit-learn

  39. J&M, 3. ed. 40

  40. Today 41  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend