statistical natural language processing
play

Statistical Natural Language Processing Dr. Besnik Fetahu Lecture - PowerPoint PPT Presentation

Statistical Natural Language Processing Dr. Besnik Fetahu Lecture Lecture: Thursdays: 10:00 11:30 Location: MultiMedia Raum, L3S, Appelstr. 9a Contact: Dr. Besnik Fetahu Tel: 17797 E-mail:


  1. Statistical Natural Language Processing Dr. Besnik Fetahu

  2. Lecture • Lecture: • Thursdays: 10:00 – 11:30 • Location: MultiMedia Raum, L3S, Appelstr. 9a • Contact: • Dr. Besnik Fetahu • Tel: 17797 • E-mail: fetahu@L3S.uni-hannover.de 2

  3. Exercises • Exercises: • Thursdays: 11:30 – 13:00 • Location: MultiMedia Raum, L3S, Appelstr. 9a • Contact: • Lijun Lyu • E-mail: lyu@L3S.uni-hannover.de • Finish an average of > 50% of exercises for a 1.0 grade point improvement 3

  4. Course Info • http://l3s.de/~fetahu/courses/nlp_course_ws2018/ • Administrative Information • Course Slides • Exercise Sheets • Google Group: nlp_luh_2018? • Purpose: • Announcements • Discussions • Questions 4

  5. Exam Info • Written exam on: • Date & Time: TBA • Duration: 90 minutes • Location: TBA 5

  6. Literature • Christopher D. Manning and Hinrich Schütze: “Foundations of statistical natural language processing” . MIT press, 1999. • Dan Jurafsky: “Speech and Language Processing” . Pearson Education, 2000. • Christopher Bishop: “ Pattern Recognition and Machine Learning ”, 2006. • Ian Goodfellow, Yoshua Bengio, and Aaron Courville: “Deep Learning”. MIT Press, 2016. 6

  7. Literature https://www.tib.eu/en/search/id/TIBKAT%3A188854029/Natural- language-engineering/ https://www.tib.eu/en/search/id/TIBKAT%3A577240269/Speech-and- language-processing-an-introduction/ https://www.tib.eu/en/search/id/TIBKAT%3A627718655/Pattern- recognition-and-machine-learning/ https://www.tib.eu/en/search/id/springer%3Adoi~10.1007%252Fs107 10-017-9314-z/Ian-Goodfellow-Yoshua-Bengio-and-Aaron-Courville/ 7

  8. Course Topics • Mathematical Foundations • Linguistic Essentials • Language Models • Hidden Markov Models • Logistic Regression • Part of Speech Tagging • Grammars • Dependency Parsing • Word Representations and Evaluation • Recurrent Neural Networks • Named Entity Recognition, Named Entity Disambiguation, Relation Extraction • Other topics of interest? 8

  9. 1. Introduction 9

  10. Natural Language Processing (NLP) • NLP is the task of processing natural language in an automated manner • Language is inherently difficult to automatically process and understand due to: • Ambiguity • Genre/Domain • Spatial, context, and temporal aspect • Prior information (speaker background, common sense etc.) 10

  11. Natural Language Processing • Fundamental questions in the study of language: 1. What kinds of things do people say? 2. What do these things say/ask/request about the world? 11

  12. Natural Language Processing 1. What kinds of things do people say? • Analyze if something is grammatically correct (structurally well formed) • Measure the frequency of utterances (words, phrases etc.) to determine conventionality 12

  13. Natural Language Processing • Language is filled with non-categorical phenomena: • Language change are gradual and can be traced by analyzing the word frequency and its context: • “while” used as a noun to indicate time, now it is used as complementizer (subordinate clauses) • “ gay” used to indicate happiness (emotional state), now used to indicate sexual preference. • Words can have multiple syntactic and semantic senses: • “bank” can be refer to the river bank, financial institution etc. • “can” can be a verb or a noun • Probabilistic approaches are best suitable for natural language understanding: • Incorporate priors (world priors, contextualized priors) • Incomplete information from a language utterance 13

  14. Natural Language Processing – cases of ambiguity • Lexical: “I saw a bat ” • Syntactic: “Our company is training workers” • Semantic: " John kissed his wife , and so did Sam ” • Anaphoric: "Margaret invited Susan for a visit, and she gave her a good lunch." (she = Margaret; her = Susan) • Non-literal speech: "The price of tomatoes in Des Moines has gone through the roof " (= increased greatly) • Ellipsis: "I am allergic to tomatoes. Also fish." 14

  15. NLP – cases of ambiguity • Polysemy – words having multiple senses (e.g., ”book”, “bank”, “can” etc.) • Hyponym – represents a typeOf relationship with its hypernym (e.g. ”pigeon”, “crow” as “birds”) • Synonyms – words or phrases that mean exactly or nearly the same thing as another lexeme. 15

  16. NLP – cases of ambiguity https://en.wikipedia.org/wiki/Homonym 16

  17. 2. NLP Corpora & Infrastructure 17

  18. NLP corpora • Brown Corpus - ~1 million tagged words in American English. • Balanced representation of different genres (e.g. politics, sports, etc.) • Penn Treebank – annotated text from the Wall Street Journal. • WordNet - lexical database of English words, where nouns, verbs, adjectives and adverbs are grouped into synsets. • Wikipedia – large corpus of articles for a wide range of topics 18

  19. NLP corpora • SQuAD - S tanford Qu estion A nswering D ataset (SQuAD) is a reading comprehension dataset. • Twenty Newsgroups - The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. • MultiNLI - is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment , contradiction , and neutral . 19

  20. NLP Infrastructure • Programming languages: • Python (preferred), Java • NLP dedicated libraries • NLTK (good for toy examples, use Stanford CoreNLP for better accuracy) • Gensim • Data manipulation libraries • Pandas 20

  21. 21

  22. 2. NLP tasks 22

  23. Part of Speech Tagging • POS tagging is the task of labeling each word in a sentence with its appropriate part of speech. • 36 POS tags in Penn Treebank: • Nouns, verbs, prepositions, adjectives etc. 23

  24. Named Entity Recognition • NER is the process of resolving words/surface forms into a predefined class of named entity categories (e.g. Person, Location, Organization): 24

  25. Word Sense Disambiguation • Word sense disambiguation (WSD): determines the correct sense of a word given its context. The robot that can recycle a can is useful for the environment. 25

  26. Phrase Structure Parsing • Phrase structure parsing organizes syntax into constituents or brackets • In general, this involves nested trees 26

  27. Named Entity Disambiguation • Named entity disambiguation (NED) is the task of resolving surface forms based on their context to entities from a reference database. 27 Credit to: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/

  28. Text Categorization/Classification • Text categorization: for a pre-determined set of categories determine to which one a piece of text belongs to? (e.g. spam or not spam for e-mails) 28

  29. Textual Entailment • Entailment is the task of determining if a premise supports/rejects/has no info for a given hypothesis. ENTAIL- TEXT HYPOTHESIS TASK MENT Regan attended a ceremony in Washington is located in 1 Washington to commemorate the IE False Normandy. landings in Normandy. 2 Google files for its long awaited IPO. Google goes public. IR True The SPD got just 21.5% of the vote The SPD is defeated by in the European Parliament elections, 4 IE True the opposition parties. while the conservative opposition parties polled 44.5%. Credit to: http://www.cs.biu.ac.il/~dagan/TE-Tutorial-ACL07.ppt 29

  30. Machine Translation • Machine translation (MT) is the task of converting one piece of natural language from one language to another, while preserving the meaning and producing fluent text in the target language. 30

  31. Question Answering • Question answering (QA) can be open or close domain, where for a given question the task is to find a textual snippet which answer the question. Credit to: https://arxiv.org/pdf/1806.03822.pdf 31

  32. Other NLP tasks • Co-reference resolution : resolve pronouns to the proper nouns they refer to. • Relation Extraction : extract binary(n-ary) relations from text. • Sentiment analysis : determine if a piece of text has positive or negative sentiment. • Keyword extraction : determine which are salient words in a piece of text. • Language Models : models that are able to generate text for a given set of seed words. • Topic Modelling : extract the topics in a document. • Word Collocations/Co-occurrences 32

  33. 3. Statistical NLP 33

  34. What is Statistical NLP? • P( to | Sarah drove ) • P( time is a verb | S = Time flies like an arrow) • It involves deriving numerical data from text • Use probabilities to describe events, text, phrase occurrences, tagging etc. • No hard constraints as in categorical grammars • Use of approximation techniques for hard problems 34

  35. What is Statistical NLP? • Human cognition has a probabilistic nature • In Language (written or speech) we are faced with incomplete, uncertain information, and thus, interpretation has to be based on probabilities • Humans resolve the high level of ambiguity in real time, by incorporating diverse sources of evidence, including frequency information • Goal of Computational Linguistics is to mimic similar behavior and interpret language in terms of probabilities 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend