NLP: Two pictures Wordnet and Word Sense Problem NLP - - PDF document

nlp two pictures
SMART_READER_LITE
LIVE PREVIEW

NLP: Two pictures Wordnet and Word Sense Problem NLP - - PDF document

7/13/2012 NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity Parsing Part of Speech Lecture delivered at the summer school on NLP, IIIT Hyderabad, Tagging 10July, 2012 Vision Speech Morph


slide-1
SLIDE 1

7/13/2012 1 Wordnet and Word Sense Disambiguation

Lecture delivered at the summer school on NLP, IIIT Hyderabad, 10July, 2012 by Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay {pb}@cse.iitb.ac.in

Background NLP: Two pictures

NLP Vision Speech

Algorithm Problem Language

Hindi Marathi English French

Morph Analysis

Statistics and Probability + Knowledge Based

Part of Speech Tagging Parsing Semantics CRF HMM MEMM

NLP Trinity

NLP: Thy Name is Disambiguation

  • I went with my friend to the bank to withdraw some money, but

was disappointed to find it closed

  • POS disambiguation: bank (NN/VM), withdraw (NN/VM), closed

(JJ/VM)

  • Sense Disambiguation: bank (finance/place), withdraw (take
  • ut/go_away)
  • Co-reference disambiguation: it (bank/money)/I/friend)
  • Pro-drop disambiguation: “…<I/friend> was disappointed…”
  • Scope disambiguation: for with (my _friend/my_friend_to_the bank)
slide-2
SLIDE 2

7/13/2012 2

Where there is a will, Where there is a will,

There are hundreds of relatives

Where there is a will

There is a way There are hundreds of relatives

Stages of processing

  • Phonetics and phonology
  • Morphology
  • Lexical Analysis
  • Syntactic Analysis
  • Semantic Analysis
  • Pragmatics
  • Discourse
slide-3
SLIDE 3

7/13/2012 3

Lexical Disambiguation

First step: part of Speech Disambiguation

  • Dog as a noun (animal)
  • Dog as a verb (to pursue)

Sense Disambiguation

  • Dog (as animal)
  • Dog (as a very detestable person)

Needs word relationships in a context

  • The chair emphasised the need for adult education

Very common in day to day communications Satellite Channel Ad: Watch what you want, when you want (two senses of watch) e.g., Ground breaking ceremony/research

Lexical Analysis

  • Essentially refers to dictionary access and
  • btaining the properties of the word

e.g. dog noun (lexical property) take-’s’-in-plural (morph property) animate (semantic property) 4-legged (-do-) carnivore (-do) Challenge: Lexical or word sense disambiguation

Ambiguity of Multiwords

  • The grandfather kicked the bucket after suffering from

cancer.

  • This job is a piece of cake
  • Put the sweater on
  • He is the dark horse of the match

Google Translations of above sentences:

दादा क सर से पीड़त होने क े बाद बाट लात मार. इस काम क े क े क का एक टुकड़ा है. ःवेटर पर रखो. वह मैच क े अंधेरे घोड़ा है.

  • Bengali:

English: Government is restless at home. (*) Chanchal Sarkar is at home

  • Hindi: दैिनक दबंग दुिनया

English: everyday bold world Actually name of a Hindi newspaper in Indore

  • High degree of overlap between NEs and MWEs
  • Treat differently - transliterate do not translate

Ambiguity of Named Entities

slide-4
SLIDE 4

7/13/2012 4

Challenges in Syntactic Processing: Structural Ambiguity

  • Scope

1.The old men and women were taken to safe locations (old men and women) vs. ((old men) and women)

  • 2. No smoking areas will allow Hookas inside
  • Preposition Phrase Attachment
  • I saw the boy with a telescope

(who has the telescope?)

  • I saw the mountain with a telescope

(world knowledge: mountain cannot be an instrument of seeing)

  • I saw the boy with the pony-tail

(world knowledge: pony-tail cannot be an instrument of seeing) Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”

Textual Humour (1/2)

  • 1. Teacher (angrily): did you miss the class yesterday?

Student: not much

  • 2. A man coming back to his parked car sees the

sticker "Parking fine". He goes and thanks the policeman for appreciating his parking skill.

  • 3. Son: mother, I broke the neighbour's lamp shade.

Mother: then we have to give them a new one. Son: no need, aunty said the lamp shade is irreplaceable.

  • 4. Ram: I got a Jaguar car for my unemployed

youngest son. Shyam: That's a great exchange!

  • 5. Shane Warne should bowl maiden overs, instead of

bowling maidens over

Textual Humour (2/2)

  • It is not hard to meet the expenses now a day, you find

them everywhere

  • Teacher: What do you think is the capital of Ethiopia?

Student: What do you think? Teacher: I do not think, I know Student: I do not think I know

Example of WSD

  • Operation, surgery, surgical operation, surgical procedure, surgical

process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1

  • Operation, military operation -- (activity by a military or naval force (as

a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1

  • Operation -- ((computer science) data processing in which the result is

completely specified by a rule (especially the processing that results from a single instruction); "it can perform millions of operations per second") TOPIC->(noun) computer science#1, computing#1

  • mathematical process, mathematical operation, operation --

((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1 IS WSD NEEDED IN LARGE APPLICATIONS?

slide-5
SLIDE 5

7/13/2012 5

Word ambiguitytopic drift in IR

Query word: “Madrid bomb blast case” {case, container} {case, suit, lawsuit} {suit, apparel} Drifted topic due to expanded term!!! Drifted topic due to inapplicable sense!!!

43.75 43.75 25 31.25 18.75 6.25 12.5

5 10 15 20 25 30 35 40 45 50 Hindi-English Marathi-English Error Percentage Transliteration Translation Disambiguation Stemmer Dictionary Ranking

Our observations On error Percentages Due to various Factors CLEF 2007

How about WSD and MT?

Zaheer Khan, the India fast bowler, has been ruled out of the remainder

  • f the series against England.

He will return to India and will be replaced by left-arm seamer RP Singh. Zaheer picked up a hamstring injury during the first Test at Lord's. He had been withdrawn from the squad for India's recent Test series in the West Indies due to a right ankle injury. भारत क े तेज गदबाज, जहर खान, इंलड क े खलाफ ौृंखला क े शेष क े बाहर शासन कया गया है. (ruled in the administrative sense??) वह भारत लौटने और बाएँ हाथ क े तेज गदबाज आरपी िसंह ारा ूितःथापत कया जाएगा. जहर लॉस म पहले टेःट क े दौरान हैमःशंग चोट उठाया. (lifted??) वह भारत क वेःट इंडज म हाल ह म एक सह (correct??) टखने क चोट क े कारण टेःट ौृंखला क े िलए टम से वापस ले िलया गया था.

Wordnet

slide-6
SLIDE 6

7/13/2012 6

Psycholinguistic Theory

  • Human lexical memory for nouns as a hierarchy.
  • Can canary sing? - Pretty fast response.
  • Can canary fly? - Slower response.
  • Does canary have skin? – Slowest response.

(can move, has skin) (can fly) (can sing) Wordnet - a lexical reference system based on psycholinguistic theories of human lexical memory. Animal Bird canary

Essential Resource for WSD: Wordnet

Word Meanings Word Forms F1 F2 F3 … Fn M1

(depend) E1,1 (bank) E1,2 (rely) E1,3

M2

(bank) E2,2 (embankme nt) E2,…

M3

(bank) E3,2 E3,3

Mm

Em,n

Wordnet: History

  • The first wordnet in the world was for English developed

at Princeton over 15 years.

  • The Eurowordnet- linked structure of European language

wordnets was built in 1998 over 3 years with funding from the EC as a a mission mode project.

  • Wordnets for Hindi and Marathi being built at IIT Bombay

are amongst the first IL wordnets.

  • All these are proposed to be linked into the

IndoWordnet which eventually will be linked to the English and the Euro wordnets.

Basic Principle

  • Words in natural languages are polysemous.
  • However, when synonymous words are put

together, a unique meaning often emerges.

  • Use is made of Relational Semantics.
slide-7
SLIDE 7

7/13/2012 7

Lexical and Semantic relations in wordnet

1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. Troponymy 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset).

Gloss study Hyponymy Hyponymy Dwelling,abode bedroom kitchen

house,home

A place that serves as the living quarters of one or mor efamilies guestroom veranda bckyard hermitage cottage Meronymy Hyponymy M e r

  • n

y m y Hypernymy

WordNet Sub-Graph Fundamental Design Question

  • Syntagmatic vs. Paradigmatic relations?
  • Psycholinguistics is the basis of the design.
  • When we hear a word, many words come to our

mind by association.

  • For English, about half of the associated words

are syntagmatically related and half are paradignatically related.

  • For cat

– animal, mammal- paradigmatic – mew, purr, furry- syntagmatic

Stated Fundamental Application of Wordnet: Sense Disambiguation

Determination of the correct sense of the word The crane ate the fish vs. The crane was used to lift the load bird vs. machine

slide-8
SLIDE 8

7/13/2012 8

The problem of Sense tagging

  • Given a corpora To Assign correct sense to the

words.

  • This is sense tagging. Needs Word Sense

Disambiguation (WSD)

  • Highly important for Question Answering, Machine

Translation, Text Mining tasks.

Classification of Words

Word

Content Word Function Word

Verb Noun Adjective Adverb Prepo sition Conjun ction Pronoun Interjection

Example of sense marking: its need

एक_4187 नए शोध_1138 क े अनुसार_3123 जन लोग_1189 का सामाजक_43540 जीवन_125623 यःत_48029 होता है उनक े दमाग_16168 क े एक_4187 हःसे_120425 म अिधक_42403 जगह_113368 होती है। (According to a new research, those people who have a busy social life, have larger space in a part of their brain). नेचर यूरोसाइंस म छपे एक_4187 शोध_1138 क े अनुसार_3123 कई_4118 लोग_1189 क े दमाग_16168 क े ःक ै न से पता_11431 चला क दमाग_16168 का एक_4187 हःसा_120425 एिमगडाला सामाजक_43540 यःतताओं_1438 क े साथ_328602 सामंजःय_166 क े िलए थोड़ा_38861 बढ़_25368 जाता है। यह शोध_1138 58 लोग_1189 पर कया गया जसम उनक उॆ_13159 और दमाग_16168 क साइज़ क े आँकड़े_128065 िलए गए। अमरक_413405 टम_14077 ने पाया_227806 क जन लोग_1189 क सोशल नेटवक ग अिधक_42403 है उनक े दमाग_16168 का एिमगडाला वाला हःसा_120425 बाक_130137 लोग_1189 क तुलना_म_38220 अिधक_42403 बड़ा_426602 है। दमाग_16168 का एिमगडाला वाला हःसा_120425 भावनाओं_1912 और मानिसक_42151 ःथित_1652 से जुड़ा हुआ माना_212436 जाता है।

Ambiguity of लोग (People)

  • लोग, जन, लोक, जनमानस, पलक - एक से अिधक

य "लोग क े हत म काम करना चाहए"

– (English synset) multitude, masses, mass, hoi_polloi, people, the_great_unwashed - the common people generally "separate the warriors from the mass" "power to the people"

  • दुिनया, दुिनयाँ, संसार, व, जगत, जहाँ, जहान, ज़माना,

जमाना, लोक, दुिनयावाले, दुिनयाँवाले, लोग - संसार म रहने वाले लोग "महामा गाँधी का समान पूर दुिनया करती है / म इस दुिनया क परवाह नहं करता / आज क दुिनया पैसे क े पीछे भाग रह है"

– (English synset) populace, public, world - people in general considered as a whole "he is a hero in the eyes

  • f the public”
slide-9
SLIDE 9

7/13/2012 9

Basic Principle

  • Words in natural languages are polysemous.
  • However, when synonymous words are put

together, a unique meaning often emerges.

  • Use is made of Relational Semantics.
  • Componential Semantics where each word is a

bundle of semantic features (as in the Schankian Conceptual Dependency system or Lexical Componential Semantics) is to be examined as a viable alternative.

Componential Semantics

  • Consider cat and tiger.

Decide on componential attributes.

  • For cat (Y, Y, N, Y)
  • For tiger (Y,Y,Y,N)

Complete and correct Attributes are difficult to design.

Furry Furry Carnivorous Carnivorous Heavy Heavy Domesticable Domesticable

Semantic relations in wordnet

1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. Troponymy 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset).

Synset: the foundation (house)

  • 1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape

Cod"; "she felt she had to get out of the house")

  • 2. house -- (an official assembly having legislative powers; "the legislature has two houses")
  • 3. house -- (a building in which something is sheltered or located; "they had a large carriage house")
  • 4. family, household, house, home, menage -- (a social unit living together; "he moved his family to

Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")

  • 5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be

presented; "the house was full")

  • 6. firm, house, business firm -- (members of a business organization that owns or operates one or

more establishments; "he worked for a brokerage house")

  • 7. house -- (aristocratic family line; "the House of York")
  • 8. house -- (the members of a religious community living together)
  • 9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he

counted the house")

  • 10. house -- (play in which children take the roles of father or mother or children and pretend to interact

like adults; "the children were playing house")

  • 11. sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal

areas into which the zodiac is divided)

  • 12. house -- (the management of a gambling house or casino; "the house gets a percentage of every

bet")

slide-10
SLIDE 10

7/13/2012 10

Creation of Synsets Three principles:

  • Minimality
  • Coverage
  • Replacability

Synset creation (continued)

Home

John’s home was decorated with lights on the occasion of Christmas. Having worked for many years abroad, John Returned home.

House

John’s house was decorated with lights on the occasion of Christmas. Mercury is situated in the eighth house of John’s horoscope.

Synsets (continued)

{house} is ambiguous. {house, home} has the sense of a social unit living together; Is this the minimal unit? {family, house , home} will make the unit completely unambiguous. For coverage: {family, household, house, home} ordered according to frequency. Replacability of the most frequent words is a requirement.

Synset creation

From first principles – Pick all the senses from good standard dictionaries. – Obtain synonyms for each sense. – Needs hard and long hours of work.

slide-11
SLIDE 11

7/13/2012 11

Synset creation (continued)

From the wordnet of another language in the same family – Pick the synset and obtain the sense from the gloss. – Get the words of the target language. – Often same words can be used- especially for words. – Translation, Insertion and deletion.

Synset+Gloss+Example

Crucially needed for concept explication, wordnet building using another wordnet and wordnet linking. English Synset: {earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

Hindi Synset: भूक ं प, भूचाल, भूडोल, जलजला, भूकप, भू-क ं प, भू-कप, ज़लज़ला, भूिमक ं प, भूिमकप - ूाक ृ ितक कारण से पृवी क े भीतर भाग म क ु छ उथल-पुथल होने से ऊपर भाग क े सहसा हलने क बया "२००१ म गुज़रात म आये भूक ं प म काफ़ लोग मारे गये थे" (shaking of the surface of earth; many were killed in the earthquake in Gujarat) Marathi Synset: धरणीक ं प,भूक ं प - पृवीया पोटात ियोभ होऊन पृभाग हालयाची बया "२००१ साली गुजरातमये झालेया धरणीक ं पात अनेक लोक मृयुमुखी पडले"

Semantic Relations

  • Hypernymy and Hyponymy

– Relation between word senses (synsets) – X is a hyponym of Y if X is a kind of Y – Hyponymy is transitive and asymmetrical – Hypernymy is inverse of Hyponymy

(lion->animal->animate entity->entity)

Semantic Relations (continued)

  • Meronymy and Holonymy

– Part-whole relation, branch is a part of tree – X is a meronymy of Y if X is a part of Y – Holonymy is the inverse relation of Meronymy {kitchen} ………………………. {house}

slide-12
SLIDE 12

7/13/2012 12

Lexical Relation

  • Antonymy

– Oppositeness in meaning – Relation between word forms – Often determined by phonetics, word length etc. ({rise, ascend} vs. {fall, descend})

Gloss study Hyponymy Hyponymy Dwelling,abode bedroom kitchen

house,home

A place that serves as the living quarters of one or mor efamilies guestroom veranda bckyard hermitage cottage Meronymy Hyponymy M e r

  • n

y m y Hypernymy

WordNet Sub-Graph Troponym and Entailment

  • Entailment

{snoring – sleeping}

  • Troponym

{limp, strut – walk} {whisper – talk}

Entailment

Snoring entails sleeping. Buying entails paying.

  • Proper Temporal Inclusion.

Inclusion can be in any way. Sleeping temporally includes snoring. Buying temporally includes paying.

  • Co-extensiveness. (Troponymy)

Limping is a manner of walking.

slide-13
SLIDE 13

7/13/2012 13

Opposition among verbs.

  • {Rise,ascend} {fall,descend}

Tie-untie (do-undo) Walk-run (slow,fast) Teach-learn (same activity different perspective) Rise-fall (motion upward or downward)

  • Opposition and Entailment.

Hit or miss (entail aim) . Backward presupposition. Succeed or fail (entail try.)

The causal relationship.

Show- see. Give- have. Causation and Entailment. Giving entails having. Feeding entails eating.

Kinds of Antonymy

Size Size

Small Small -

  • Big

Big

Quality Quality

Good Good – – Bad Bad

State State

Warm Warm – – Cool Cool

Personality Personality

  • Dr. Jekyl
  • Dr. Jekyl-
  • Mr. Hyde
  • Mr. Hyde

Direction Direction

East East-

  • West

West

Action Action

Buy Buy – – Sell Sell

Amount Amount

Little Little – – A lot A lot

Place Place

Far Far – – Near Near

Time Time

Day Day -

  • Night

Night

Gender Gender

Boy Boy -

  • Girl

Girl

slide-14
SLIDE 14

7/13/2012 14

Kinds of Meronymy

Component Component-

  • object
  • bject Head

Head -

  • Body

Body

Staff Staff-

  • object
  • bject

Wood Wood -

  • Table

Table

Member Member-

  • collection

collection Tree

Tree -

  • Forest

Forest

Feature Feature-

  • Activity

Activity

Speech Speech -

  • Conference

Conference

Place Place-

  • Area

Area

Palo Alto Palo Alto -

  • California

California

Phase Phase-

  • State

State

Youth Youth -

  • Life

Life

Resource Resource-

  • process

process

Pen Pen -

  • Writing

Writing

Actor Actor-

  • Act

Act

Physician Physician -

  • Treatment

Treatment

Gradation

State State Childhood, Youth, Old Childhood, Youth, Old age age Temperature Temperature Hot, Warm, Cold Hot, Warm, Cold Action Action Sleep, Doze, Wake Sleep, Doze, Wake

Metonymy

  • Associated with Metaphors which are epitomes of

semantics

  • Oxford Advanced Learners Dictionary definition: “The

use of a word or phrase to mean something different from the literal meaning”

  • Does it mean Careless Usage?!

Insight from Sanskritic Tradition

  • Power of a word

– Abhidha, Lakshana, Vyanjana

  • Meaning of Hall:

– The hall is packed (avidha) – The hall burst into laughing (lakshana) – The Hall is full (unsaid: and so we cannot enter) (vyanjana)

slide-15
SLIDE 15

7/13/2012 15

Metaphors in Indian Tradition

  • upamana and upameya

– Former: object being compared – Latter: object being compared with – Puru was like a lion in the battle with Alexander (Puru: upameya; Lion: upamana)

Upamana, rupak, atishayokti

  • upamana: Explicit comparison

– Puru was like a lion in the battle with Alexander

  • rupak: Implicit comparison

– Puru was a lion in the battle with Alexander

  • Atishayokti (exaggeration): upamana and upameya

dropped – Puru’s army fled. But the lion fought on.

Modern study (1956 onwards, Richards et. al.)

  • Three constituents of metaphor

– Vehicle (items used metaphorically) – Tenor (the metaphorical meaning of the former) – Ground (the basis for metaphorical extension)

  • “The foot of the mountain”

– Vehicle: :foot” – Tenor: “lower portion” – Ground: “spatial parallel between the relationship between the foot to the human body and the lower portion of the mountain with the rest of the mountain”

Interaction of semantic fields

(Haas)

  • Core vs. peripheral semantic fields
  • Interaction of two words in metonymic relation brings in

new semantic fields with selective inclusion of features

  • Leg of a table

– Does not stretch or move – Does stand and support

slide-16
SLIDE 16

7/13/2012 16

Lakoff’s (1987) contribution

  • Source Domain
  • Target Domain
  • Mapping Relations

Mapping Relations: ontological correspondences

  • Anger is heat
  • f fluid in

container

Heat Heat (i) Container (i) Container (ii) Agitation of (ii) Agitation of fluid fluid (iii) Limit of (iii) Limit of resistence resistence (iv) Explosion (iv) Explosion Anger Anger Body Body Agitation of Agitation of mind mind Limit of ability Limit of ability to suppress to suppress Loss of control Loss of control

Image Schemas

  • Categories: Container Contained
  • Quantity

– More is up, less is down: Outputs rose dramatically; accidents rates were lower – Linear scales and paths: Ram is by far the best performer

  • Time

– Stationary event: we are coming to exam time – Stationary observer: weeks rush by

  • Causation: desperation drove her to extreme

steps

Patterns of Metonymy

  • Container for contained

– The kettle boiled (water)

  • Possessor for possessed/attribute

– Where are you parked? (car)

  • Represented entity for representative

– The government will announce new targets

  • Whole for part

– I am going to fill up the car with petrol

slide-17
SLIDE 17

7/13/2012 17

Patterns of Metonymy (contd)

  • Part for whole

– I noticed several new faces in the class

  • Place for institution

– Lalbaug witnessed the largest Ganapati Question: Can you have part-part metonymy

Purpose of Metonymy

  • More idiomatic/natural way of expression

– More natural to say the kettle is boiling as opposed to the water in the kettle is boiling

  • Economy

– Room 23 is answering (but not *is asleep)

  • Ease of access to referent

– He is in the phone book (but not *on the back of my hand)

  • Highlighting of associated relation

– The car in the front decided to turn right (but not *to smoke a cigarette)

Feature sharing not necessary

  • In a restaurant:

– Jalebii ko abhi dudh chaiye (no feature sharing) – The elephant now wants some coffee (feature sharing)

Proverbs

  • Describes a specific event or state of affairs which is

applicable metaphorically to a range of events or states

  • f affairs provided they have the same or sufficiently

similar image-schematic structure

slide-18
SLIDE 18

7/13/2012 18

IndoWordNet

Linked Indian Language Wordnets

Hindi Wordnet Dravidian Language Wordnet North East Language Wordnet Marathi Wordnet Sanskrit Wordnet English Wordnet Bengali Wordnet Punjabi Wordnet Konkani Wordnet Urdu Wordnet INDOWORDNET Gujarati Wordnet Oriya Wordnet Kashmiri Wordnet

Size of Indian Language wordnets (June, 2012) 1/2

Assamese 14958 Guahati University, Guahati, Assam Bengali 23765 Indian Statistical Institute, Kolkata, West Bengal Bodo 15785 Guahati University, Guahati, Assam Gujarati 26580 Dharmsingh Desai University, Nadiad, Gujarat Kannada 4408 Mysore University, Mysore, Karnataka Kashmiri 23982 Kashmir University, Srinagar, Jammu and Kashmir Konkani 25065 Goa University, Panji, Goa Malayalam 8557 Amrita University, Coimbatore, Tamilnadu Manipuri 16351 Manipur University, Imphal, Manipur Marathi 24954 IIT Bombay, Mumbai, Maharastra

Size of Indian Language wordnets (June, 2012) 2/2

Nepali 11713 Assam University, Silchar, Assam Oriya 31454 Hyderabad Central University, Hyderabad, Andhra Pradesh Punjabi 22332 Thapar University and Punjabi University, Patiala, Punjab Sanskrit 18980 IIT Bombay, Mumbai Tamil 8607 Tamil University, Thanjavur, Tamilnadu Telugu 14246 Dravidian University, Kuppam, Andhra Pradesh Urdu 23071 Jawaharlal Nehru University, New Delhi

slide-19
SLIDE 19

7/13/2012 19

Categories of Synsets (1/2)

  • Universal: Synsets which have an indigenous lexeme in

all the languages (e.g. Sun ,Earth).

  • Pan Indian: Synsets which have indigenous lexeme in all

the Indian languages but no English equivalent (e.g. Paapad).

  • In-Family: Synsets which have indigenous lexeme in the

particular language family (e.g. the term for Bhatija in Dravidian languages).

Categories of Synsets (2/2)

  • Language specific: Synsets which are unique to a

language (e.g. Bihu in Assamese language)

  • Rare: Synsets which express technical terms (e.g. ngram).
  • Synthesized: Synsets created in the language due to

influence of another language (e.g. Pizza).

Need for categorization

  • To bring systematicity in the way the wordnet synsets

are linked – UniversalPan IndianLanguage FamilyLanguageSynthesisedRare

  • All members have finished the Universal and Pan

Indian synsets

Categorization methodology

34378 Hindi synsets were sent to all Indo-wordnet groups in the tool, in which they had these options to categorize:

  • Yes
  • No

Universal synsets:- The synsets which were categorized Yes and also have equivalent English words

  • r synsets.

Pan-Indian :- The synsets which were categorized Yes and did not have equivalent English words or synsets.

slide-20
SLIDE 20

7/13/2012 20

Expansion approach: linking is a subtle and difficult process

  • To link or not to link
  • While linking:

– face lexical and semantic chasms – Syntactic divergences in the example sentences

  • Change of POS
  • Copula drop (HindiBangla)

Case of kashmiri

Linking kinship relations and fine grained concepts

Relative Uncle Mama Chacha

पानी direct आब पानी hypernym ऽेश

Important decision

  • TWO kinds of linkages

– Direct – Hypernymy

Case of kashmiri पानी direct आब पानी hypernym ऽेश

How to express a concept not present in the language?

slide-21
SLIDE 21

7/13/2012 21

Transliteration: often employed

  • Synset ID : 39 POS : adjective Synonyms : सनाथ,

(sanaatha)

  • Gloss : जसका कोई पालन-पोषण या देखभाल करने वाला हो

(opposite of orphan)

  • Example statement : "सनाथ बालक को अनाथ बालक क

मदद करनी चाहए (children who are looked after should help the orphans)/ साधक ूभु का हो जाने पर अनाथ नहं रहता, सनाथ हो जाता है”

  • Transliterated and adopted by Bangla and Gujarati

Short phrase: often employed

Bangla Urdu (meaning Inauspicious)

Linking synsets across languages: Linking synsets across languages: Influence on Hindi Wordnet Influence on Hindi Wordnet

Hindi wordnet has to add new synsets to accommodate language specific concepts, e.g., in Gujarati

ભૈરવજપ (bhairav jap) ID :: 103040 CAT :: NOUN CONCEPT :: मो क े िलए जप करते हुए पवत पर से अपने आप को िगराना (Taking God’s name and throwing oneself from atop a mountain to attain liberation) EXAMPLE :: िगरनार क े िशखर पर से याऽक भैरवजप करते थे एसा माना जाता है। (it is thought that pilgrms used to do bhairav jap atop Girnar mountain) SYNSET-HINDI :: भैरवजप

Overview of WSD techniques

slide-22
SLIDE 22

7/13/2012 22

Bird’s eye view

85 WSD Approaches Machine Learning Supervised Unsupervised Semi- supervised Knowledge Based

CFILT - IITB

Hybrid

OVERLAP BASED APPROACHES

  • Require a Machine Readable Dictionary (MRD).
  • Find the overlap between the features of different senses of an

ambiguous word (sense bag) and the features of the words in its context (context bag).

  • These features could be sense definitions, example sentences,

hypernyms etc.

  • The features could also be given weights.
  • The sense which has the maximum overlap is selected as the

contextually appropriate sense.

86

86

CFILT - IITB

LESK’S ALGORITHM

From Wordnet

  • The noun ash has 3 senses (first 2 from tagged texts)
  • 1. (2) ash -- (the residue that remains when something is burned)
  • 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
  • rnamental or timber trees of the genus Fraxinus)
  • 3. ash -- (strong elastic wood of any of various ash trees; used for

furniture and tool handles and sporting goods such as baseball bats)

  • The verb ash has 1 sense (no senses from tagged texts)
  • 1. ash -- (convert into ashes)

87

Sense Bag: contains the words in the definition of a candidate sense of the ambiguous word. Context Bag: contains the words in the definition of each sense of each context word.

E.g. “On burning coal we get ash.”

CRITIQUE

  • Proper nouns in the context of an ambiguous word can act as strong

disambiguators.

E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”. Sachin Tendulkar plays cricket.

  • Proper nouns are not present in the thesaurus. Hence this approach

fails to capture the strong clues provided by proper nouns.

  • Accuracy

– 50% when tested on 10 highly polysemous English words. 88

slide-23
SLIDE 23

7/13/2012 23 Extended Lesk’s algorithm

– Original algorithm is sensitive towards exact words in the definition. – Extension includes glosses of semantically related senses from WordNet (e.g. hypernyms, hyponyms, etc.). – The scoring function becomes:

  • where,

– gloss(S) is the gloss of sense S from the lexical resource. – Context(W) is the gloss of each sense of each context word. – rel(s) gives the senses related to s in WordNet under some relations.

| ) ( ) ( | ) (

) (

s gloss w context S score

s s

  • r

s rel s ext

′ =

′ ≡ ∈ ′

I

Gloss study Hyponymy Hyponymy Dwelling,abode bedroom kitchen

house,home

A place that serves as the living quarters of one or mor efamilies guestroom veranda bckyard hermitage cottage Meronymy Hyponymy M e r

  • n

y m y Hypernymy

WordNet Sub-Graph Example: Extended Lesk

  • “On combustion of coal we get ash”

From Wordnet

  • The noun ash has 3 senses (first 2 from tagged texts)
  • 1. (2) ash -- (the residue that remains when something is burned)
  • 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
  • rnamental or timber trees of the genus Fraxinus)
  • 3. ash -- (strong elastic wood of any of various ash trees; used for

furniture and tool handles and sporting goods such as baseball bats)

  • The verb ash has 1 sense (no senses from tagged texts)
  • 1. ash -- (convert into ashes)

Example: Extended Lesk (cntd)

  • “On combustion of coal we get ash”

From Wordnet (through hyponymy)

  • ash -- (the residue that remains when something is burned)

=> fly ash -- (fine solid particles of ash that are carried into the air when fuel is combusted) => bone ash -- (ash left when bones burn; high in calcium phosphate; used as fertilizer and in bone china)

slide-24
SLIDE 24

7/13/2012 24

Critique of Extended Lesk

  • Larger region of matching in WordNet

– Increased chance of Matching BUT – Increased chance of Topic Drift

WALKER’S ALGORITHM

Sense1: Finance Sense2: Location

Money +1 Interest +1 Fetch Annum +1 Total 3

94

  • A Thesaurus Based approach.
  • Step 1: For each sense of the target word find the thesaurus category to which

that sense belongs.

  • Step 2: Calculate the score for each sense by using the context words. A

context word will add 1 to the score of the sense if the thesaurus category of the word matches that of the sense. – E.g. The money in this bank fetches an interest of 8% per annum – Target word: bank – Clue words from the context: money, interest, annum, fetch

Context words add 1 to the sense when the topic

  • f

the word matches that

  • f

the sense

CFILT - IITB

WSD USING CONCEPTUAL DENSITY (Agirre

and Rigau, 1996)

  • Select a sense based on the relatedness of that word-

sense to the context.

  • Relatedness is measured in terms of conceptual

distance

– (i.e. how close the concept represented by the word and the concept represented by its context words are)

  • This approach uses a structured hierarchical semantic

net (WordNet) for finding the conceptual distance.

  • Smaller the conceptual distance higher will be the

conceptual density.

– (i.e. if all words in the context are strong indicators of a particular concept then that concept will have a higher density.)

95

CONCEPTUAL DENSITY FORMULA

96

Wish list

The conceptual distance between two words

should be proportional to the length of the path between the two words in the hierarchical tree (WordNet).

The conceptual distance between two words

should be proportional to the depth of the concepts in the hierarchy.

where, c= concept nhyp = mean number of hyponyms h= height of the sub-hierarchy m= no. of senses of the word and senses of context words contained in the sub-hierarchy CD= Conceptual Density and 0.2 is the smoothing factor entity finance location money bank-1 bank-2 d (depth) h (height) of the concept “location” Sub-Tree

slide-25
SLIDE 25

7/13/2012 25 CONCEPTUAL DENSITY (cntd)

97

The dots in the figure represent

the senses of the word to be disambiguated or the senses of the words in context.

The CD formula will yield highest

density for the sub-hierarchy containing more senses.

The sense of W contained in the

sub-hierarchy with the highest CD will be chosen.

CONCEPTUAL DENSITY (EXAMPLE)

The jury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1) Step 1: Make a lattice of the nouns in the context, their senses and hypernyms. Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies). Step 3: The concept with the highest CD is selected. Step 4: Select the senses below the selected concept as the correct sense for the respective words.

  • peration

division administrative_unit jury committee police department local department government department department jury administration body

CD = 0.256 CD = 0.062

98

CRITIQUE

  • Resolves lexical ambiguity of nouns by finding a combination of senses that

maximizes the total Conceptual Density among senses.

  • The Good

– Does not require a tagged corpus.

  • The Bad

– Fails to capture the strong clues provided by proper nouns in the context.

  • Accuracy

– 54% on Brown corpus.

99

WSD USING RANDOM WALK ALGORITHM (Page Rank) (sinha and

Mihalcea, 2007)

Bell ring church Sunday S3 S2 S1 S3 S2 S1 S3 S2 S1 S1

a c b e f g h i j k l 0.46 a 0.49 0.92 0.97 0.35 0.56 0.42 0.63 0.58 0.67

Step 1: Add a vertex for each possible sense of each word in the text. Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method). Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense). Step 4: Select the vertex (sense) which has the highest score. 100

slide-26
SLIDE 26

7/13/2012 26

A look at Page Rank (from Wikipedia)

Developed at Stanford University by Larry Page (hence the name Page- Rank) and Sergey Brin as part of a research project about a new kind of search engine The first paper about the project, describing PageRank and the initial prototype of the Google search engine, was published in 1998 Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools

A look at Page Rank (cntd)

PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25. If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A. PR(A)=PR(B)+PR(C)+PR(D) This is 0.75.

A look at Page Rank (cntd)

Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D's PageRank is counted for A's PageRank (approximately 0.083). PR(A)=PR(B)/2+PR(C)/1+PR(D)/3 In general, PR(U)= ΣPR(V)/L(V), where B(u) is the set of pages u is linked to, and VεB(U) L(V) is the number of links from V

A look at Page Rank (damping factor)

The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. PR(U)= (1-d)/N + d.ΣPR(V)/L(V), VεB(U) N=size of document collection

slide-27
SLIDE 27

7/13/2012 27 For WSD: Page Rank

  • Given a graph G = (V,E)
  • In(Vi) = predecessors of Vi
  • Out(Vi) = successors of Vi
  • In a weighted graph, the walker randomly selects an outgoing

edge with higher probability of selecting edges with higher weight.

105

Other Link Based Algorithms

  • HITS algorithm invented by Jon Kleinberg (used by

Teoma and now Ask.com)

  • IBM CLEVER project
  • TrustRank algorithm.

CRITIQUE

  • Relies on random walks on graphs encoding label dependencies.
  • The Good

– Does not require any tagged data (a WordNet is sufficient). – The weights on the edges capture the definition based semantic similarities. – Takes into account global data recursively drawn from the entire graph.

  • The Bad

– Poor accuracy

  • Accuracy

– 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%.

107

KB Approaches – Comparisons

Algorithm Accuracy WSD using Selectional Restrictions 44% on Brown Corpus Lesk’s algorithm 50-60% on short samples of “Pride and Prejudice” and some “news stories”. Extended Lesk’s algorithm 32% on Lexical samples from Senseval 2 (Wider coverage). WSD using conceptual density 54% on Brown corpus. WSD using Random Walk Algorithms 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%. Walker’s algorithm 50% when tested

  • n

10 highly polysemous English words.

slide-28
SLIDE 28

7/13/2012 28

KB Approaches –Conclusions

  • Drawbacks of WSD using Selectional Restrictions

– Needs exhaustive Knowledge Base.

  • Drawbacks of Overlap based approaches

– Dictionary definitions are generally very small. – Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. cigarette and ash never co-occur in a dictionary). – Suffer from the problem of sparse match. – Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns.

SUPERVISED APPROACHES

NAÏVE BAYES

111

  • The Algorithm find the winner sense using

sˆ= argmax s ε senses Pr(s|Vw)

‘Vw’ is a feature vector consisting of:

  • POS of w
  • Semantic & Syntactic features of w
  • Collocation vector (set of words around it) typically consists of next

word(+1), next-to-next word(+2), -2, -1 & their POS's

  • Co-occurrence vector (number of times w occurs in bag of words around

it)

Applying Bayes rule and naive independence assumption

sˆ= argmax s ε senses Pr(s).Πi=1

nPr(Vw i|s)

BAYES RULE AND INDEPENDENCE ASSUMPTION

sˆ= argmax s ε senses Pr(s|Vw) where Vw is the feature vector.

  • Apply Bayes rule:

Pr(s|Vw)=Pr(s).Pr(Vw|s)/Pr(Vw)

  • Pr(Vw|s) can be approximated by independence assumption:

Pr(Vw|s) = Pr(Vw

1|s).Pr(Vw 2|s,Vw 1)...Pr(Vw n|s,Vw 1,..,Vw n-1)

= Πi=1

nPr(Vw i|s)

Thus,

sˆ= argmax sÎsenses Pr(s).Πi=1

nPr(Vw i|s)

sˆ= argmax s ε senses Pr(s|Vw)

slide-29
SLIDE 29

7/13/2012 29 ESTIMATING PARAMETERS

  • Parameters in the probabilistic WSD are:

– Pr(s) – Pr(Vw

i|s)

  • Senses are marked with respect to sense repository (WORDNET)

Pr(s) = count(s,w) / count(w) Pr(Vw

i|s)

= Pr(Vw

i,s)/Pr(s)

= c(Vw

i,s,w)/c(s,w)

DECISION LIST ALGORITHM

  • Based on ‘One sense per collocation’ property.

– Nearby words provide strong and consistent clues as to the sense of a target word.

  • Collect a large set of collocations for the ambiguous word.
  • Calculate word-sense probability distributions for all such

collocations.

  • Calculate the log-likelihood ratio
  • Higher log-likelihood = more predictive evidence
  • Collocations are ordered in a decision list, with most predictive

collocations ranked highest.

114

Pr(Sense-A| Collocationi) Pr(Sense-B| Collocationi)

Log( )

114

Assuming there are only two senses for the word. Of course, this can easily be extended to ‘k’ senses.

Training Data Resultant Decision List

DECISION LIST ALGORITHM (CONTD.)

Classification of a test sentence is based on the highest ranking collocation found in the test sentence. E.g. …plucking flowers affects plant growth…

115

CRITIQUE

  • Harnesses powerful, empirically-observed properties of

language.

  • The Good

– Does not require large tagged corpus. Simple implementation. – Simple semi-supervised algorithm which builds on an existing supervised algorithm. – Easy understandability of resulting decision list. – Is able to capture the clues provided by Proper nouns from the corpus.

  • The Bad

– The classifier is word-specific. – A new classifier needs to be trained for every word that you want to disambiguate.

  • Accuracy

– Average accuracy of 96% when tested on a set of 12 highly polysemous words.

116

slide-30
SLIDE 30

7/13/2012 30

Exemplar Based WSD (k-nn)

  • An exemplar based classifier is constructed for each word to be

disambiguated.

  • Step1: From each sense marked sentence containing the

ambiguous word , a training example is constructed using:

– POS of w as well as POS of neighboring words. – Local collocations – Co-occurrence vector – Morphological features – Subject-verb syntactic dependencies

  • Step2: Given a test sentence containing the ambiguous word, a test

example is similarly constructed.

  • Step3: The test example is then compared to all training examples

and the k-closest training examples are selected.

  • Step4: The sense which is most prevalent amongst these “k”

examples is then selected as the correct sense.

WSD Using SVMs

  • SVM is a binary classifier which finds a hyperplane with the largest

margin that separates training examples into 2 classes.

  • As SVMs are binary classifiers, a separate classifier is built for each

sense of the word

  • Training Phase: Using a tagged corpus, f or every sense of the word a

SVM is trained using the following features:

– POS of w as well as POS of neighboring words. – Local collocations – Co-occurrence vector – Features based on syntactic relations (e.g. headword, POS of headword, voice of head word etc.)

  • Testing Phase: Given a test sentence, a test example is constructed

using the above features and fed as input to each binary classifier.

  • The correct sense is selected based on the label returned by each

classifier.

WSD Using Perceptron Trained HMM

  • WSD is treated as a sequence labeling task.
  • The class space is reduced by using WordNet’s super senses instead of

actual senses.

  • A discriminative HMM is trained using the following features:

– POS of w as well as POS of neighboring words. – Local collocations – Shape of the word and neighboring words

E.g. for s = “Merrill Lynch & Co shape(s) =Xx*Xx*&Xx

  • Lends itself well to NER as labels like “person”, location”, "time” etc are

included in the super sense tag set.

Supervised Approaches – Comparisons

Approach Average Precision Average Recall Corpus Average Baseline Accuracy Naïve Bayes 64.13% Not reported Senseval3 – All Words Task 60.90% Decision Lists 96% Not applicable Tested on a set of 12 highly polysemous English words 63.9% Exemplar Based disambiguation (k- NN) 68.6% Not reported WSJ6 containing 191 content words 63.7% SVM 72.4% 72.4% Senseval 3 – Lexical sample task (Used for disambiguation of 57 words) 55.2% Perceptron trained HMM 67.60 73.74% Senseval3 – All Words Task 60.90%

slide-31
SLIDE 31

7/13/2012 31 Supervised Approaches – Conclusions

  • General Comments
  • Use corpus evidence instead of relying of dictionary defined senses.
  • Can capture important clues provided by proper nouns because proper

nouns do appear in a corpus.

  • Naïve Bayes

– Suffers from data sparseness. – Since the scores are a product of probabilities, some weak features might pull down the overall score for a sense. – A large number of parameters need to be trained.

  • Decision Lists

– A word-specific classifier. A separate classifier needs to be trained for each word. – Uses the single most predictive feature which eliminates the drawback of Naïve Bayes.

Multilingual resource constrained WSD Long line of work…

  • Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, It takes two to Tango: A Bilingual

Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.

  • Mitesh Khapra, Salil Joshi, Arindam Chatterjee and Pushpak Bhattacharyya, Together We

Can: Bilingual Bootstrapping for WSD, Annual Meeting of the Association of Computational Linguistics (ACL 2011), Oregon, USA, June 2011.

  • Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for

Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.

  • Mitesh Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words

Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.

  • Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Domain-Specific

Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010.

  • Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Projecting

Parameters for Multilingual Word Sense Disambiguation, Empirical Methods in Natural Language Prfocessing (EMNLP09), Singapore, August, 2009.

  • Mitesh Khapra, Pushpak Bhattacharyya, Shashank Chauhan, Soumya Nair and Aditya

Sharma, Domain Specific Iterative Word Sense Disambiguation in a Multilingual Setting, International Conference on NLP (ICON08), Pune, India, December, 2008.

Algorithm for WSD

slide-32
SLIDE 32

7/13/2012 32

Iterative WSD

Motivated by the Energy expression in Hopfield network

Scoring function

Neuron

  • Synset

Self- activation

  • Corpus Sense

Distribution Weight of connection between two neurons

  • Weight as a function of

corpus co-occurrence and Wordnet distance measures between synsets

Iterative WSD

Algorithm 1: performIterativeWSD(sentence)

  • 1. Tag all monosemous words in the sentence.
  • 2. Iteratively disambiguate the remaining words in the sentence in increasing
  • rder of their degree of polysemy.
  • 3. At each stage select that sense for a word which maximizes the score given

by the Equation below

Data

slide-33
SLIDE 33

7/13/2012 33

#Polysemous words (tokens) #monosemous words #Polysemous unique words (types) Token to Type ratio Average degree of WN polysemy Average degree of and corpus polysemy

H S H H H H H S

Performance of different algorithms: monolingual WSD

slide-34
SLIDE 34

7/13/2012 34

WSD is costly!1

WordNets

  • Princeton Wordnet: ~80000 synsets: 30 man years
  • Eurowordnet: 12 man years on the average for 12

languages

  • Hindi wordnet: 24 man years

– http://www.cfilt.iitb.ac.in/wordnet/webhwn/

  • Indowordnet: getting created; 15 languages; 4 people
  • n the average; in 1 year close to 15000 synsets

done

  • Scale of effort really huge
  • Tricky too: when it comes to expanding from one

wordnet to another

Machine Learnng based WSD is costly!2

Sense Annotated corpora for Machine Learning

  • SemCor: ~200000 sense marked words
  • SemEval/Senseval competition: to generate sense

marked corpora

  • Sense marked corpora created at IIT Bombay

– http://www.cfilt.iitb.ac.in/wsd/annotated_corpus – English: Tourism (~170000), Health (~150000) – Hindi: Tourism (~170000), Health (~80000) – Marathi: Tourism (~120000), Health (~50000) – 12 man years for each <L,D> combination

Cost-accuracy trade off This is the dream! spread from one <L,D> combination to others

slide-35
SLIDE 35

7/13/2012 35

Language Adaptation scenarios

Related Work

(Not mentioning references, because they are too many)

Knowledge Based Approaches Supervised Approaches Unsupervised Approaches Semi-supervised Approaches Hybrid Approaches

No single existing solution to WSD completely meets our requirements of multilinguality, high domain accuracy and good performance in the face of limited annotation

Scenario 3: EM based unsupervised Approach

140

slide-36
SLIDE 36

7/13/2012 36

141

ESTIMATING SENSE DISTRIBUTIONS

If sense tagged Marathi corpus were available, we could have estimated But such a corpus is not available

Framework: Figure 1 and Figure 2 E-M steps Points to note…

  • Symmetric formulation
  • E and M steps are identical except for the change in

language

  • Either can be treated as the E-step, making the other as the

M-step

  • A back-and-forth traversal over translation correspondences

in the two languages

  • Does not require parallel corpus – only in-domain corpus is

needed

144

slide-37
SLIDE 37

7/13/2012 37

In General..

145

Experimental Setup

  • Languages: Hindi, Marathi
  • Domains: Tourism and Health (largest domain-specific sense tagged corpus)

146

Algorithms Being Compared

  • EM (our approach)
  • Personalized PageRank (Agirre and Soroa, 2009)
  • State-of-the-art bilingual approach (using Mutual

Information) (Kaji and Morimoto, 2002)

  • Random Baseline
  • Wordnet First sense baseline (supervised baseline)

147

Results

  • Performs better than other state-of-the-art knowledge based

and unsupervised approaches

  • Does not beat the Wordnet First Sense Baseline which is a

supervised baseline

148

slide-38
SLIDE 38

7/13/2012 38

Error Analysis – Non-Progressiveness estimation

  • Some words have the same translations in the

target language across senses

– saagar(hindi) samudra (marathi) (“large water body” as well as “limitless”)

  • Such words thus form a closed loop of translations
  • In such cases the algorithm does not progress and

gets stuck with the initial values

  • Same is the case for some language specific words

for which corresponding synsets were not available in the other language

  • Such words accounted for 17-19% of the total words

in the test corpus

149

Results – Eliminating words which have problem of Non Progressive Estimation

  • Results are now closer to Wordnet First Sense Baseline
  • For 2 out of the 4 language domain pairs the results are slightly

better than WFS – remarkable for an unsupervised approach

150

Demo

  • IIT Bombay’s system:

www.cfilt.iitb.ac.in/UNL_enco

  • Textual Entailment:

http://10.14.26.15:8084/TextualEntailmentInterface/index .jsp

Conclusions (1/2)

  • NLP is all about processing ambiguity,

with WSD as a fundamental task

  • Resource constraint and multilinguality

brings additional challenge

  • Wordnet: Great unifier of India (similar to

Adi Shankaracharya, Bollywood films…)

  • Getting linked with English WN; would like

to link with Eurowordnet

  • Application in MT, Search, Language

teaching, e-commerce

slide-39
SLIDE 39

7/13/2012 39

Future work

  • Closer study needed for familialy close languages
  • Usage of language specific properties, in particular,

morphology

  • The projection idea can be used in other NLP

problems like POS tagging and Parsing

URLs

  • For resources

www.cfilt.iitb.ac.in

  • For publications

www.cse.iitb.ac.in/~pb

Thank you

Questions and comments?