information extraction
play

Information Extraction Philipp Koehn 28 October 2019 Philipp Koehn - PowerPoint PPT Presentation

Information Extraction Philipp Koehn 28 October 2019 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019 Text Knowledge 1 Human knowledge is stored in text How can we extract this to make


  1. Information Extraction Philipp Koehn 28 October 2019 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  2. Text → Knowledge 1 • Human knowledge is stored in text • How can we extract this to make it available for processing by machines? Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  3. 2 examples Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  4. Goal: Build Database of World Leaders 3 Country Position Person United States president George Walker Bush United States president Barack Hussein Obama United States president Donald Trump Germany chancellor Gerhard Schr¨ oder Germany chancellor Angela Merkel United Kingdom prime minister Theresa May United Kingdom prime minister Alexander Boris de Pfeffel Johnson China president Hu Jintao China president Xi Jinping India prime minister Manmohan Singh India prime minister Narendra Modi Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  5. Extracting Relations 4 • From this snippet, we can extract: (United States, president, Barack Hussein Obama) • Why is this a hard problem? Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  6. Extracting Events 5 • Report of soccer game – when? where? who? what? why? – players involved, information about each player, each goal, audience size, ...? • Multiple data base tables, connection between entities Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  7. 6 structural knowledge Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  8. Ontologies 7 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  9. Knowledge Graphs 8 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  10. Frames 9 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  11. Scripts 10 Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  12. 11 named entities Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  13. Named Entities 12 • Essential processing step: identifying named entities • Types – persons – geo-political entities (GPE) – events – dates – numbers Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  14. Example 13 [PERSON Boris Johnson ] ’s [GPE cabinet ] is divided over how to proceed with [EVENT Brexit ] , as the [PERSON prime minister ] faces the stark choice of pressing ahead with his deal or gambling his premiership on a [DATE pre-Christmas ] general election. The [PERSON prime minister ] told [PERSON MPs ] at [DATE Wednesday ] ’s [EVENT PMQs ] that he was awaiting the decision of the [GPE EU27 ] over whether to grant an extension before settling his next move. Some [PERSON cabinet ministers ] , including the [PERSON [GPE Northern Ireland ] secretary, Julian Smith ] , believe the majority of [NUMBER 30 ] achieved by the [GPE government ] on the second reading of the [EVENT Brexit ] bill on [DATE Tuesday ] suggests [PERSON Johnson ] ’s deal has enough support to carry it through all its stages in [GPE parliament ] . Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  15. Named Entity Tagging 14 • Problem broken up into two parts • Tagging where named entities start and end [NE Boris Johnson ] ’s [NE cabinet ] is divided over how to proceed with [NE Brexit ] , as the [NE prime minister ] faces the stark • Classification of types [PERSON Boris Johnson ] ’s [GPE cabinet ] is divided over how to proceed with [EVENT Brexit ] , as the [PERSON prime minister ] faces the stark Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  16. Tagging 15 • Convert into BIO sequence (begin / intermediate / other) Boris B Johnson I ’s O cabinet B is O divided O over O how O to O proceed O with O Brexit B , O Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  17. Bayes Rule 16 • We want to find the best part-of-speech tag sequence T for a sentence S argmax T p ( T | S ) • Bayes rule gives us p ( T | S ) = p ( S | T ) p ( T ) p ( S ) • We can drop p ( S ) if we are only interested in argmax T argmax T p ( T | S ) = argmax T p ( S | T ) p ( T ) Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  18. Decomposing the Model 17 • The mapping p ( S | T ) can be decomposed into � p ( S | T ) = p ( w i | t i ) i • p ( T ) could be called a part-of-speech language model , for which we can use an n-gram model: p ( T ) = p ( t 1 ) p ( t 2 | t 1 ) p ( t 3 | t 1 , t 2 ) ...p ( t n | t n − 2 , t n − 1 ) • We can estimate p ( S | T ) and p ( T ) with maximum likelihood estimation (and maybe some smoothing) Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  19. Hidden Markov Model (HMM) 18 • The model we just developed is a Hidden Markov Model • Elements of an HMM model: – a set of states (here: the tags) – an output alphabet (here: words) – initial state (here: beginning of sentence) – state transition probabilities (here: p ( t n | t n − 2 , t n − 1 ) ) – symbol emission probabilities (here: p ( w i | t i ) ) Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  20. Search for the Best Tag Sequence 19 • We have defined a model, but how do we use it? – given: word sequence – wanted: tag sequence • If we consider a specific tag sequence, it is straight-forward to compute its probability � p ( S | T ) p ( T ) = p ( w i | t i ) p ( t i | t i − 2 , t i − 1 ) i • Problem: if we have on average c choices for each of the n words, there are c n possible tag sequences, maybe too many to efficiently evaluate Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  21. Walking through the States 20 • First, we go to state B to emit Boris : B START I O Boris Johnson ‘s cabinet Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  22. Walking through the States 21 • Then, we go to state I to emit Johnson : B B START I I O O Boris Johnson ‘s cabinet Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  23. Walking through the States 22 • Of course, there are many possible paths: B B B B START I I I I O O O O Boris Johnson ‘s cabinet Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  24. Viterbi Algorithm 23 • Intuition: Since state transition out of a state only depend on the current state (and not previous states), we can record for each state the optimal path • We record: – cheapest cost to state j at step s in δ j ( s ) – backtrace from that state to best predecessor ψ j ( s ) • Stepping through all states at each time steps allows us to compute – δ j ( s + 1) = max 1 ≤ i ≤ N δ i ( s ) p ( t i | t j ) p ( w s | t j ) – ψ j ( s + 1) = argmax 1 ≤ i ≤ N δ i ( s ) p ( t i | t j ) p ( w s | t j ) • Best final state is argmax 1 ≤ i ≤ N δ i ( S + 1) , we can backtrack from there Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  25. 24 entity linking Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

  26. Same Person 25 [PERSON Boris Johnson ] ’s cabinet is divided over how to proceed with Brexit, as the [PERSON prime minister ] faces the stark choice of pressing ahead with his deal or gambling his premiership on a The [PERSON prime minister ] told MPs pre-Christmas general election. at Wednesday’s PMQs that he was awaiting the decision of the EU27 over whether to grant an extension before settling his next move. Some cabinet ministers, including the secretary, Julian Smith, believe the majority of 30 achieved by the government on the second reading of the Brexit bill on Tuesday suggests [PERSON Johnson ] ’s deal has enough support to carry it through all its stages in parliament. • Same person referred to 4 times in 3 different ways Philipp Koehn Introduction to Human Language Technology: Information Extraction 28 October 2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend