Statistical Natural Language Processing Dr. Besnik Fetahu Lecture - - PowerPoint PPT Presentation

statistical natural language processing
SMART_READER_LITE
LIVE PREVIEW

Statistical Natural Language Processing Dr. Besnik Fetahu Lecture - - PowerPoint PPT Presentation

Statistical Natural Language Processing Dr. Besnik Fetahu Lecture Lecture: Thursdays: 10:00 11:30 Location: MultiMedia Raum, L3S, Appelstr. 9a Contact: Dr. Besnik Fetahu Tel: 17797 E-mail:


slide-1
SLIDE 1

Statistical Natural Language Processing

  • Dr. Besnik Fetahu
slide-2
SLIDE 2

Lecture

  • Lecture:
  • Thursdays: 10:00 – 11:30
  • Location: MultiMedia Raum, L3S, Appelstr. 9a
  • Contact:
  • Dr. Besnik Fetahu
  • Tel: 17797
  • E-mail: fetahu@L3S.uni-hannover.de

2

slide-3
SLIDE 3

Exercises

  • Exercises:
  • Thursdays: 11:30 – 13:00
  • Location: MultiMedia Raum, L3S, Appelstr. 9a
  • Contact:
  • Lijun Lyu
  • E-mail: lyu@L3S.uni-hannover.de
  • Finish an average of > 50% of exercises for a

1.0 grade point improvement

3

slide-4
SLIDE 4

Course Info

  • http://l3s.de/~fetahu/courses/nlp_course_ws2018/
  • Administrative Information
  • Course Slides
  • Exercise Sheets
  • Google Group: nlp_luh_2018?
  • Purpose:
  • Announcements
  • Discussions
  • Questions

4

slide-5
SLIDE 5

Exam Info

  • Written exam on:
  • Date & Time: TBA
  • Duration: 90 minutes
  • Location: TBA

5

slide-6
SLIDE 6

Literature

  • Christopher D. Manning and Hinrich Schütze: “Foundations
  • f statistical natural language processing”. MIT press, 1999.
  • Dan Jurafsky: “Speech and Language Processing”. Pearson

Education, 2000.

  • Christopher Bishop: “Pattern Recognition and Machine

Learning”, 2006.

  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville: “Deep

Learning”. MIT Press, 2016.

6

slide-7
SLIDE 7

Literature

7 https://www.tib.eu/en/search/id/TIBKAT%3A188854029/Natural- language-engineering/ https://www.tib.eu/en/search/id/TIBKAT%3A577240269/Speech-and- language-processing-an-introduction/ https://www.tib.eu/en/search/id/TIBKAT%3A627718655/Pattern- recognition-and-machine-learning/ https://www.tib.eu/en/search/id/springer%3Adoi~10.1007%252Fs107 10-017-9314-z/Ian-Goodfellow-Yoshua-Bengio-and-Aaron-Courville/

slide-8
SLIDE 8

Course Topics

  • Mathematical Foundations
  • Linguistic Essentials
  • Language Models
  • Hidden Markov Models
  • Logistic Regression
  • Part of Speech Tagging
  • Grammars
  • Dependency Parsing
  • Word Representations and Evaluation
  • Recurrent Neural Networks
  • Named Entity Recognition, Named Entity Disambiguation,

Relation Extraction

  • Other topics of interest?

8

slide-9
SLIDE 9
  • 1. Introduction

9

slide-10
SLIDE 10

Natural Language Processing (NLP)

  • NLP is the task of processing natural language in an

automated manner

  • Language is inherently difficult to automatically process

and understand due to:

  • Ambiguity
  • Genre/Domain
  • Spatial, context, and temporal aspect
  • Prior information (speaker background, common sense etc.)

10

slide-11
SLIDE 11

Natural Language Processing

  • Fundamental questions in the study of language:

1. What kinds of things do people say? 2. What do these things say/ask/request about the world?

11

slide-12
SLIDE 12

Natural Language Processing

  • 1. What kinds of things do people say?
  • Analyze if something is grammatically correct (structurally well

formed)

  • Measure the frequency of utterances (words, phrases etc.) to

determine conventionality

12

slide-13
SLIDE 13

Natural Language Processing

  • Language is filled with non-categorical phenomena:
  • Language change are gradual and can be traced by analyzing the

word frequency and its context:

  • “while” used as a noun to indicate time, now it is used as

complementizer (subordinate clauses)

  • “gay” used to indicate happiness (emotional state), now used to indicate

sexual preference.

  • Words can have multiple syntactic and semantic senses:
  • “bank” can be refer to the river bank, financial institution etc.
  • “can” can be a verb or a noun
  • Probabilistic approaches are best suitable for natural language

understanding:

  • Incorporate priors (world priors, contextualized priors)
  • Incomplete information from a language utterance

13

slide-14
SLIDE 14

Natural Language Processing – cases of ambiguity

  • Lexical: “I saw a bat”
  • Syntactic: “Our company is training workers”
  • Semantic: "John kissed his wife, and so did Sam”
  • Anaphoric: "Margaret invited Susan for a visit, and she gave

her a good lunch." (she = Margaret; her = Susan)

  • Non-literal speech: "The price of tomatoes in Des Moines has

gone through the roof" (= increased greatly)

  • Ellipsis: "I am allergic to tomatoes. Also fish."

14

slide-15
SLIDE 15

NLP – cases of ambiguity

  • Polysemy – words having multiple senses (e.g.,

”book”, “bank”, “can” etc.)

  • Hyponym– represents a typeOf relationship

with its hypernym (e.g. ”pigeon”, “crow” as “birds”)

  • Synonyms – words or phrases that mean

exactly or nearly the same thing as another lexeme.

15

slide-16
SLIDE 16

NLP – cases of ambiguity

https://en.wikipedia.org/wiki/Homonym

16

slide-17
SLIDE 17
  • 2. NLP Corpora &

Infrastructure

17

slide-18
SLIDE 18

NLP corpora

  • Brown Corpus - ~1 million tagged words in

American English.

  • Balanced representation of different genres (e.g.

politics, sports, etc.)

  • Penn Treebank – annotated text from the Wall

Street Journal.

  • WordNet - lexical database of English words,

where nouns, verbs, adjectives and adverbs are grouped into synsets.

  • Wikipedia – large corpus of articles for a wide

range of topics

18

slide-19
SLIDE 19

NLP corpora

  • SQuAD - Stanford Question Answering Dataset

(SQuAD) is a reading comprehension dataset.

  • Twenty Newsgroups - The 20 Newsgroups data

set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

  • MultiNLI - is a collection of 570k human-written

English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.

19

slide-20
SLIDE 20

NLP Infrastructure

  • Programming languages:
  • Python (preferred), Java
  • NLP dedicated libraries
  • NLTK (good for toy examples, use Stanford

CoreNLP for better accuracy)

  • Gensim
  • Data manipulation libraries
  • Pandas

20

slide-21
SLIDE 21

21

slide-22
SLIDE 22
  • 2. NLP tasks

22

slide-23
SLIDE 23

Part of Speech Tagging

23

  • POS tagging is the task of labeling each word in

a sentence with its appropriate part of speech.

  • 36 POS tags in Penn Treebank:
  • Nouns, verbs, prepositions, adjectives etc.
slide-24
SLIDE 24

Named Entity Recognition

  • NER is the process of resolving words/surface

forms into a predefined class of named entity categories (e.g. Person, Location, Organization):

24

slide-25
SLIDE 25

Word Sense Disambiguation

  • Word sense disambiguation (WSD): determines

the correct sense of a word given its context.

25

The robot that can recycle a can is useful for the environment.

slide-26
SLIDE 26

Phrase Structure Parsing

  • Phrase structure parsing organizes syntax into

constituents or brackets

  • In general, this involves nested trees

26

slide-27
SLIDE 27

Named Entity Disambiguation

  • Named entity disambiguation (NED) is the task
  • f resolving surface forms based on their

context to entities from a reference database.

27 Credit to: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/

slide-28
SLIDE 28

Text Categorization/Classification

  • Text categorization: for a pre-determined set of

categories determine to which one a piece of text belongs to? (e.g. spam or not spam for e-mails)

28

slide-29
SLIDE 29

Textual Entailment

  • Entailment is the task of determining if a premise

supports/rejects/has no info for a given hypothesis.

29

TEXT HYPOTHESIS TASK ENTAIL- MENT 1 Regan attended a ceremony in Washington to commemorate the landings in Normandy. Washington is located in Normandy. IE False 2 Google files for its long awaited IPO. Google goes public. IR True 4 The SPD got just 21.5% of the vote in the European Parliament elections, while the conservative opposition parties polled 44.5%. The SPD is defeated by the opposition parties. IE True

Credit to: http://www.cs.biu.ac.il/~dagan/TE-Tutorial-ACL07.ppt

slide-30
SLIDE 30

Machine Translation

  • Machine translation (MT) is the task of converting
  • ne piece of natural language from one language

to another, while preserving the meaning and producing fluent text in the target language.

30

slide-31
SLIDE 31

Question Answering

  • Question answering (QA) can be open or close

domain, where for a given question the task is to find a textual snippet which answer the question.

31

Credit to: https://arxiv.org/pdf/1806.03822.pdf

slide-32
SLIDE 32

Other NLP tasks

  • Co-reference resolution: resolve pronouns to the

proper nouns they refer to.

  • Relation Extraction: extract binary(n-ary) relations

from text.

  • Sentiment analysis: determine if a piece of text

has positive or negative sentiment.

  • Keyword extraction: determine which are salient

words in a piece of text.

  • Language Models: models that are able to

generate text for a given set of seed words.

  • Topic Modelling: extract the topics in a document.
  • Word Collocations/Co-occurrences

32

slide-33
SLIDE 33
  • 3. Statistical NLP

33

slide-34
SLIDE 34

What is Statistical NLP?

  • P(to |Sarah drove)
  • P(time is a verb | S = Time flies like an arrow)
  • It involves deriving numerical data from text
  • Use probabilities to describe events, text, phrase
  • ccurrences, tagging etc.
  • No hard constraints as in categorical grammars
  • Use of approximation techniques for hard problems

34

slide-35
SLIDE 35

What is Statistical NLP?

  • Human cognition has a probabilistic nature
  • In Language (written or speech) we are faced with

incomplete, uncertain information, and thus, interpretation has to be based on probabilities

  • Humans resolve the high level of ambiguity in real

time, by incorporating diverse sources of evidence, including frequency information

  • Goal of Computational Linguistics is to mimic

similar behavior and interpret language in terms of probabilities

35

slide-36
SLIDE 36

Zipf’s Law

  • The frequency of any word is inversely proportional to its rank
  • There is a constant ,which explains the relation between the

rank and the frequency.

  • Zips law is shown to hold for randomly generated text
  • This allows us to model the probability distribution of a language and

a language generation model (Mandelbrot’s law)

36

slide-37
SLIDE 37

Common words in Shakespeare's Hamlet

37

the 1148 and 970 to 771

  • f

671 i 635 you 554 a 550 my 514 hamlet 494 in 451

Unigrams

my lord hamlet 62 my lord i 21 rosencrantz and guildenstern 18 good my lord 15 i pray you 13 exeunt hamlet act 12 in the castle 12 the castle enter 12 enter king claudius 11 that i have 11

Trigrams

my lord 180 king claudius 121 in the 93 lord polonius87 queen gertrude 82 lord hamlet 78 to the 72

  • f the 61

it is 58 i ll 56

Bigrams

The Tragedy of Hamlet, Prince of Denmark, often shortened to Hamlet (/ˈhæmlɪt/), is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Set in Denmark, the play dramatises the revenge Prince Hamlet is called to wreak upon his uncle, Claudius, by the ghost of Hamlet's father, King Hamlet. Claudius had murdered his own brother and seized the throne, also marrying his deceased brother's widow.

https://en.wikipedia.org/wiki/Hamlet

slide-38
SLIDE 38

Common words in Shakespeare's Hamlet

38

slide-39
SLIDE 39

Frames in Shakespeare writings

  • Frames for “Claudius”

39

NNP 107 VB NNP 11 NNP CC 3 NNP NNP 2 NNP 24 PRP VBP 7 VB, PRP 7 PRP VBZ 5

Claudius Syntactic frame

king 105 enter king 10 varro and 3 exeunt king 2 queen 7

  • 5

laertes 3

Claudius Semantic frame

KING/NNP 105 Enter/VB KING/NNP 10 VARRO/NNP and/CC 2 Exeunt/NNP KING/NNP 2

Claudius

QUEEN/NNP 7 O/NNP 4 Laertes/NNP 3

Syntactic/Semantic frame

slide-40
SLIDE 40

Basic Issues in text

  • Upper vs. lower case:
  • When is it useful to treat black, Black, and BLACK the same or

differently?

  • Tokenization: What is a token?
  • Whitespace separated words
  • How about compound words?
  • Sentence boundaries:
  • [.\s+]
  • What about acronyms and other forms of abbreviations?
  • Single apostrophes:
  • How can we treat “I’ll”, “Isn’t”, “dog’s”
  • Hyphenation
  • Traditionally for text line breaks
  • E-mail or co-operate vs. cooperate?
  • Homographs:
  • Saw, Lead etc.
  • ….

40

slide-41
SLIDE 41

Next Lecture

  • Mathematical foundations for statistical NLP
  • Review of probability theory
  • Expectation and variance
  • Joint and conditional distributions
  • Standard distributions
  • Information Theory

41

slide-42
SLIDE 42

Materials

  • Jupyter Notebook for the analysis is available

from the course web page.

  • Datasets are available for download from the

course web page

42