Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT - - PowerPoint PPT Presentation
Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT - - PowerPoint PPT Presentation
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 18-19-20 Natural Language Processing (ambiguities and parsing) Importance of NLP Text based computation needs NLP Linguistics+Computation
Importance of NLP
Text based computation needs NLP
Machine translation High Quality Information Retrieval Linguistics+Computation
Perpectivising NLP: Areas of AI and their inter-dependencies
Search Vision Planning Machine Learning Knowledge Representation Logic Expert Systems Robotics NLP
AI is the forcing function for Computer Science, and NLP of AI
Languages and the speaker population
Language Population (2001 census; rounded to most significant digit)
Hindi 450, 000, 000 Marathi 72, 000, 000 Konkani 7, 000, 000 Sanskrit 6000 Nepali 13, 000, 000
Languages and the speaker population (contd.)
Language Population (2001 census; rounded to most significant digit)
Kashmiri 5, 000, 000 Assamese 13, 000, 000 Tamil 60, 000, 000 Malayalam 33, 000, 000 Bodo 1, 000, 000 Manipuri 1, 000, 000
Great Linguistic Diversity
Major streams
Indo European Dravidian Sino Tibetan Austro-Asiatic
Some languages are
ranked within 20 in the world in terms of the populations speaking them
Interesting “mixed-race” languages
Marathi and Oriya: confluence of
Indo Aryan and Dravidian families
Urdu: structure from Indo Aryan
(Hindi), vocabulary from Persian and Semitic (Arabic)
आज मेरी परीक्सा है (aaj merii pariikshaa
hai) {today I have my examination}
आज मेरा इमॎतहान है (aaj meraa imtahaan
hai)
3 Language Formula
Every state has to implement
Hindi The state language
(Marathi, Gujarathi, Bengali etc.)
English
Big time translation requirement, e.g.,during the financial year ends
Multilingual Information Access needed for large GoI sector
Legislature Judiciary Education Employment Agriculture Healthcare Cultural
Provide one-stop access and insight into information related to key Government bodies and execution areas Enable citizens exercise their fundamental rights and duties
Science Housing Taxes Travel & Tourism Banking & Insurance International Sports
Need for NLP
Machine Translation Information Retrieval and Extraction with NLP
Better precision and recall
Summarization Question Answering Cross Lingual Search (very relevant for India) Intelligent interfaces (to Robots, Databases) Combined image and text based search
Automatic Humour analysis and
generation
Last but not the least, window into
human mind; language and brain
Roles of Broca’s and Wernicke’s areas
Broadly, Broca’s area is concerned with Grammar while Wernick’s area is concerned with semantics
Damage to former interferes with grammar, e.g. role confusion with voice change: “Ram was seen by Shyam” interpreted as Ram is the seer
Damage to Wernick’s area: finds it difficult to put a name to an entity (which is a tough categorization task)
Evidence of difference between humans and apes in the complexity of language processing: Frontal lobe heavily used in humans ("The brain differentiates human and non-human grammars: Functional localization and structural connectivity" (Volume 103, Number 7, Pages 2458-2463, February 14, 2006)).
MT is needed: Internet Accessibility Pattern
User Type (script) % of World Population % access to the Internet Latin 39 84 Kanzi (CJK) 22 13 Arabic 9 1.2 Brahmi and Indic 22 0.3
Number of Potential users of Internet
50 100 150 200 250 300 350 400 450 English Japanese Chinese French Spanish German Hindi Indian Languages Languages Population in million Series1 Series2 No of Internet Users in the year 2001 No of Internet Users in the year 2010 (Projected)
Living Languages
Continent No of languages Africa 2092 Americas 1002 Asia 2269 Europe 239 Pacific 1310 Total 6912
Stages and Challenges of NLP
NLP is concerned with Grounding
Ground the language into perceptual, motor and cognitive capacities.
Grounding
Chair Computer
Grounding faces 3 challenges
Ambiguity. Co-reference resolution (anaphora is a
kind of it).
Elipsis.
Ambiguity
Chair
Co-reference Resolution
Sequence of commands to the robot: Place the wrench on the table. Then paint it. What does it refer to?
Elipsis
Sequence of command to the Robot: Move the table to the corner. Also the chair. Second command needs completing by using the first part of the previous command.
Stages of processing (traditional view)
Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse
Phonetics
Processing of speech
Challenges
Homophones: bank (finance) vs. bank (river
bank)
Near Homophones: maatraa vs. maatra (hin) Word Boundary
aajaayenge (aa jaayenge (will come) or aaj aayenge (will
come today)
I got [ua]plate
Phrase boundary
Milind Sohoni’s mail announcing this seminar: mtech1
students are especially exhorted to attend as such seminars are integral to one's post-graduate education
Disfluency: ah, um, ahem etc.
Morphology
Word formation rules from root words
Nouns: Plural (boy-boys); Gender marking (czar-czarina)
Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat); Modality (e.g. request khaanaa khaaiie)
First crucial first step in NLP
Languages rich in morphology: e.g., Dravidian, Hungarian, Turkish
Languages poor in morphology: Chinese, English
Languages with rich morphology have the advantage of easier processing at higher stages of processing
A task of interest to computer science: Finite State Machines for Word Morphology
Lexical Analysis
Essentially refers to dictionary access and obtaining the properties of the word e.g. dog noun (lexical property) take-’s’-in-plural (morph property) animate (semantic property) 4-legged (-do-) carnivore (-do) Challenge: Lexical or word sense disambiguation
Lexical Disambiguation
First step: part of Speech Disambiguation
Dog as a noun (animal) Dog as a verb (to pursue)
Sense Disambiguation
Dog (as animal) Dog (as a very detestable person)
Needs word relationships in a context
The chair emphasised the need for adult education
Very common in day to day communications and can occur in the form of single or multiword expressions e.g., Ground breaking ceremony (Prof. Ranade’s email to faculty 14/9/07)
Technological developments bring in new terms, additional meanings/nuances for existing terms
Justify as in justify the right margin (word
processing context)
Xeroxed: a new verb Digital Trace: a new expression Communifaking: pretending to talk on
mobile when you are actually not
Discomgooglation: anxiety/discomfort at
not being able to access internet
Helicopter Parenting: over parenting
Syntax
Structure Detection
S NP VP V NP I like mangoes
Parsing Strategy
Driven by grammar
S-> NP VP NP-> N | PRON VP-> V NP | V PP N-> Mangoes PRON-> I V-> like
Challenges: Structural Ambiguity
Scope
The old men and women were taken to safe locations
(old men and women) vs. ((old men) and women) Seen in Amman airport: No smoking areas will allow Hookas inside
Preposition Phrase Attachment
I saw the boy with a telescope
(who has the telescope?)
I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of seeing)
I saw the boy with the pony-tail
(world knowledge: pony-tail cannot be an instrument of seeing) Very ubiquitous: today’s newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”
Structural Ambiguity…
Overheard
I did not know my PDA had a phone for 3
months
An actual sentence in the newspaper
The camera man shot the man with the
gun when he was near Tendulkar
Headache for parsing: Garden Path sentences
Consider
The horse raced past the garden (sentence
complete)
The old man (phrase complete) Twin Bomb Strike in Baghdad (news paper
heading: complete)
Headache for Parsing
Garden Pathing
The horse raced past the garden fell The old man the boat Twin Bomb Strike in Baghdad kill 25
(Times of India 5/9/07)
Semantic Analysis
Representation in terms of
Predicate calculus/Semantic
Nets/Frames/Conceptual Dependencies and Scripts
John gave a book to Mary
Give action: Agent: John, Object: Book,
Recipient: Mary
Challenge: ambiguity in semantic role labeling
(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii
(ambiguous in Marathi and Bengali too; not in Dravidian languages)
Pragmatics
Very hard problem Model user intention
Tourist (in a hurry, checking out of the hotel,
motioning to the service boy): Boy, go upstairs and see if my sandals are under the divan. Do not be late. I just have 15 minutes to catch the train.
Boy (running upstairs and coming back panting):
yes sir, they are there.
World knowledge
WHY INDIA NEEDS A SECOND OCTOBER (ToI,
2/10/07, yesterday)
Discourse
Processing of sequence of sentences Mother to John: John go to school. It is open today. Should you bunk? Father will be very angry. Ambiguity of open bunk what? Why will the father be angry? Complex chain of reasoning and application of world knowledge (father will not be angry if somebody else’s son bunks the school) Ambiguity of father father as parent
- r
father as headmaster
Complexity of Connected Text
John was returning from school dejected – today was the math test He couldn’t control the class Teacher shouldn’t have made him responsible After all he is just a janitor
ML-NLP
NLP as an ML task
France beat Brazil by 1 goal to 0 in the
quarter-final of the world cup football
- tournament. (English)
braazil ne phraans ko vishwa kap
phutbal spardhaa ke kwaartaar phaainal me 1-0 gol ke baraabarii se haraayaa. (Hindi)
Categories of the Words in the Sentence
France beat Brazil by 1 goal to 0 in the quarter final of the world cup football tournament by to in the
- f
Brazil beat France 1 goal quarter final world cup Football tournament content words function words
Further Classification 1/2
Brazil beat France 1 goal quarter final world cup football tournament Brazil France 1 goal quarter final world cup football tournament beat Brazil France 1 goal quarter final world cup Football tournament noun verb proper noun common noun
Further Classification 2/2
by to In the
- f
the by to in
- f
determiner preposition
Why all this?
Fundamental and ubiquitous
information need
who did what to whom by what when where in what manner
Semantic roles
beat France Brazil world cup football quarter finals 1 goal to 0 agent patient/theme manner time modifier
Semantic Role Labeling: a classification task
France beat Brazil by 1 goal to 0 in the
quarter-final of the world cup football tournament
Brazil: agent or object? Agent: Brazil or France or Quarter Final or
World Cup?
Given an entity, what role does it play? Given a role, it is played by which
entity?
A lower level of classification: Part of Speech (POS) Tag Labeling
France beat Brazil by 1 goal to 0 in the
quarter-final of the world cup football tournament
beat: verb of noun (heart beat, e.g.)? Final: noun or adjective?
Uncertainty in classification: Ambiguity
Visiting aunts can be a nuisance
Visiting:
adjective or gerund (POS tag ambiguity)
Role of aunt:
agent of visit (aunts are visitors) object of visit (aunts are being visited)
Minimize uncertainty of classification
with cues from the sentence
What cues?
Position with respect to the verb:
France to the left of beat and Brazil to the right: agent-
- bject role marking (English)
Case marking:
France ne (Hindi); ne (Marathi): agent role Brazil ko (Hindi); laa (Marathi): object role
Morphology: haraayaa (hindi); haravlaa (Marathi):
verb POS tag as indicated by the distinctive suffixes
Cues are like attribute-value pairs prompting machine learning from NL data
Constituent ML tasks
Goal: classification or clustering Features/attributes (word position, morphology, word label etc.) Values of features Training data (corpus: annotated or un-annotated) Test data (test corpus) Accuracy of decision (precision, recall, F-value, MAP etc.) Test of significance (sample space to generality)
What is the output of an ML-NLP System
(1/2)
Option 1: A set of rules, e.g.,
If the word to the left of the verb is a noun and has animacy
feature, then it is the likely agent of the action denoted by the verb.
The child broke the toy (child is the agent) The window broke (window is not the agent; inanimate)
What is the output of an ML-NLP System
(2/2)
Option 2: a set of probability values
P(agent|word is to the left of verb and has animacy) >
P(object|word is to the left of verb and has animacy)> P(instrument|word is to the left of verb and has animacy) etc.
How is this different from classical NLP
The burden is on the data as opposed
to the human.
corpus Text data Linguist Computer rules rules/probabilities Classical NLP Statistical NLP
Classification appears as sequence labeling
A set of Sequence Labeling Tasks: smaller to larger units
Words:
Part of Speech tagging Named Entity tagging Sense marking
Phrases: Chunking Sentences: Parsing Paragraphs: Co-reference annotating
Example of word labeling: POS Tagging
<s> Come September, and the IIT campus is abuzz with new and returning students. </s> <s> Come_VB September_NNP ,_, and_CC the_DT IIT_NNP campus_NN is_VBZ abuzz_JJ with_IN new_JJ and_CC returning_VBG students_NNS ._. </s>
Example of word labeling: Named Entity Tagging
<month_name> September </month_name> <org_name> IIT </org_name>
Example of word labeling: Sense Marking
Word Synset WN-synset-no come {arrive, get, come} 01947900 . . . abuzz {abuzz, buzzing, droning} 01859419
Example of phrase labeling: Chunking
Come July, and is abuzz with .
the IIT campus new and returning students
Example of Sentence labeling: Parsing
[S1[S[S[VP[VBCome][NP[NNPJuly]]]] [,,] [CC and] [S [NP [DT the] [JJ UJF] [NN campus]] [VP [AUX is] [ADJP [JJ abuzz] [PP[IN with] [NP[ADJP [JJ new] [CC and] [ VBG returning]] [NNS students]]]]]] [..]]]
Parsing of Sentences
Are sentences flat linear structures? Why tree?
Is there a principle in branching When should the constituent give rise
to children?
What is the hierarchy building principle?
Structure Dependency: A Case Study
- Interrogative Inversion
(1) John will solve the problem. Will John solve the problem? Declarative Interrogative (2) a. Susan must leave. Must Susan leave?
- b. Harry can swim.
Can Harry swim?
- c. Mary has read the book. Has Mary read the book?
d.
Bill is sleeping. Is Bill sleeping?
……………………………………………………….
The section, “Structure dependency a case study” here is adopted from a talk given by Howard Lasnik (2003) in Delhi university.
Interrogative inversion Structure Independent (1st attempt)
(3)Interrogative inversion process Beginning with a declarative, invert the first and second words to construct an interrogative. Declarative Interrogative (4) a. The woman must leave. *Woman the must leave?
- b. A sailor can swim.
*Sailor a can swim?
- c. No boy has read the book.
*Boy no has read the book?
- d. My friend is sleeping.
*Friend my is sleeping?
Interrogative inversion correct pairings
Compare the incorrect pairings in (4) with the correct pairings in (5):
Declarative Interrogative
(5) a. The woman must leave. Must the woman leave?
- b. A sailor can swim.
Can a sailor swim?
- c. No boy has read the book. Has no boy read the book?
- d. My friend is sleeping.
Is my friend sleeping?
Interrogative inversion Structure Independent (2nd attempt)
(6) Interrogative inversion process:
Beginning with a declarative, move the auxiliary
verb to the front to construct an interrogative.
Declarative Interrogative (7) a. Bill could be sleeping. *Be Bill could sleeping? Could Bill be sleeping?
- b. Mary has been reading.
*Been Mary has reading? Has Mary been reading?
- c. Susan should have left.
*Have Susan should left? Should Susan have left?
Structure independent (3rd attempt):
(8) Interrogative inversion process
Beginning with a declarative, move the first auxiliary verb to the front to construct an interrogative.
Declarative Interrogative
(9) a. The man who is here can swim. *Is the man who here can swim?
- b. The boy who will play has left.
*Will the boy who play has left?
Structure Dependent Correct Pairings
For
the above examples, fronting the second auxiliary verb gives the correct form:
Declarative Interrogative (10) a.The man who is here can swim. Can the man who is here swim?
b.The boy who will play has left. Has the boy who will play left?
Natural transformations are structure dependent
(11) Does the child acquiring English learn these properties?
(12) We are not dealing with a peculiarity of English. No known human language has a transformational process that would produce pairings like those in (4), (7) and (9), repeated below:
(4) a. The woman must leave. *Woman the must leave? (7) a. Bill could be sleeping. *Be Bill could sleeping?
(9) a. The man who is here can swim. *Is the man who here can swim?
Deeper trees needed for capturing sentence structure
NP PP AP big The
- f poems
with the blue cover [The big book of poems with the Blue cover] is on the table. book This wont do! Flat structure! PP
Other languages
NP PP AP big The
- f poems
with the blue cover [niil jilda vaalii kavita kii kitaab] book English NP PP AP niil jilda vaalii kavita kii kitaab PP badii Hindi PP
Other languages: contd
NP PP AP big The
- f poems
with the blue cover [niil malaat deovaa kavitar bai ti] book English NP PP AP niil malaat deovaa kavitar bai PP motaa Bengali PP ti
PPs are at the same level: flat with respect to the head word “book”
NP PP AP big The
- f poems
with the blue cover [The big book of poems with the Blue cover] is on the table. book No distinction in terms of dominance or c-command PP
“Constituency test of Replacement” runs into problems
One-replacement:
I bought the big [book of poems with the
blue cover] not the small [one]
One-replacement targets book of poems
with the blue cover
Another one-replacement:
I bought the big [book of poems] with the
blue cover not the small [one] with the red cover
One-replacement targets book of poems
More deeply embedded structure
NP PP AP big The
- f poems
with the blue cover N’1 N book PP N’2 N’3
To target N1’
I want [NPthis [N’big book of poems with
the red cover] and not [Nthat [None]]
Bar-level projections
Add intermediate structures
NP (D) N’ N’ (AP) N’ | N’ (PP) | N (PP)
() indicates optionality
New rules produce this tree
NP PP AP big The
- f poems
with the blue cover N’1 N book PP N’2 N’3 N-bar
As opposed to this tree
NP PP AP big The
- f poems
with the blue cover book PP
V-bar
What is the element in verbs
corresponding to one-replacement for nouns
do-so or did-so
As opposed to this tree
NP PP AP big The
- f poems
with the blue cover book PP
I [eat beans with a fork]
VP NP beans eat with a fork PP No constituent that groups together V and NP and excludes PP
Need for intermediate constituents
I [eat beans] with a fork but Ram [does
so] with a spoon
V2’ NP beans eat with a fork PP VP V1’ V VPV’ V’ V’ (PP) V’ V (NP)
How to target V1’
I [eat beans with a fork], and Ram
[does so] too.
V2’ NP beans eat with a fork PP VP V1’ V VPV’ V’ V’ (PP) V’ V (NP)
Parsing Algorithms
A simplified grammar
S NP VP
NP DT N | N VP V ADV | V
A segment of English Grammar
S’(C) S S{NP/S’} VP VP(AP+) (VAUX) V (AP+)
({NP/S’}) (AP+) (PP+) (AP+)
NP(D) (AP+) N (PP+) PPP NP AP(AP) A
Example Sentence
People laugh
1
2 3 Lexicon: People - N, V Laugh - N, V
These are positions This indicate that both Noun and Verb is possible for the word “People”
Top-Down Parsing
State Backup State Action
- 1.
((S) 1) -
- 2. ((NP VP)1) -
- 3a. ((DT N VP)1) ((N VP) 1) -
- 3b. ((N VP)1) -
- 4. ((VP)2) -
Consume “People”
- 5a. ((V ADV)2) ((V)2) -
- 6. ((ADV)3) ((V)2) Consume “laugh”
- 5b. ((V)2) -
- 6. ((.)3) -
Consume “laugh” Termination Condition : All inputs over. No symbols remaining. Note: Input symbols can be pushed back.
Position of input pointer
Discussion for Top-Down Parsing
This kind of searching is goal driven. Gives importance to textual precedence (rule
precedence).
No regard for data, a priori (useless expansions
made).
Bottom-Up Parsing
Some conventions: N12 S1? -> NP12 ° VP2?
Represents positions End position unknown Work on the LHS done, while the work on RHS remaining
Bottom-Up Parsing (pictorial representation)
S -> NP12 VP23 °
People Laugh 1 2 3
N12 N23 V12 V23 NP12 -> N12 ° NP23 -> N23 ° VP12 -> V12 ° VP23 -> V23 ° S1? -> NP12 ° VP2?
Problem with Top-Down Parsing
- Left Recursion
- Suppose you have A-> AB rule.
Then we will have the expansion as follows:
- ((A)K) -> ((AB)K) -> ((ABB)K) ……..
Combining top-down and bottom-up strategies
Top-Down Bottom-Up Chart Parsing
Combines advantages of top-down & bottom-
up parsing.
Does not work in case of left recursion.
e.g. – “People laugh”
People – noun, verb Laugh – noun, verb
Grammar –
S NP VP
NP DT N | N VP V ADV | V
Transitive Closure
People laugh 1 2 3
S NP VP NP N VP V NP DT N S NPVP S NP VP NP N VP V ADV success VP V
Arcs in Parsing
Each arc represents a chart which
records
Completed work (left of ) Expected work (right of )
Example
People laugh loudly 1 2 3 4
S NP VP NP N VP V VP V ADV NP DT N S NPVP VP VADV S NP VP NP N VP V ADV S NP VP VP V
Advantage of Combination of Bottom Up & Top Down
parsing over either of top down / bottom down
In top down bottom up parsing
- 1. Like top down parsing productions are brought, but
inline top down parsing rules are not necessarily expanded
- 2. Unlike bottom up parsing uncontrolled lexical options
(parts of speech) are not even considered.
Dealing With Structural Ambiguity
Multiple parses for a sentence
The man saw the boy with a telescope. The man saw the mountain with a
telescope.
The man saw the boy with the ponytail.
At the level of syntax, all these sentences are ambiguous. But semantics can disambiguate 2nd & 3rd sentence.
Prepositional Phrase (PP) Attachment Problem
V – NP1 – P – NP2 (Here P means preposition) NP2 attaches to NP1 ?
- r NP2 attaches to V ?
Parse Trees for a Structurally Ambiguous Sentence
Let the grammar be – S NP VP NP DT N | DT N PP PP P NP VP V NP PP | V NP For the sentence, “I saw a boy with a telescope”
Parse Tree - 1
S NP VP N V NP Det N PP P NP Det N
I saw a boy with a telescope
Parse Tree -2
S NP VP N V NP Det N PP P NP Det N
I saw a boy with a telescope