Language Modeling Martin Saveski, Igor Trajkovski Information - - PowerPoint PPT Presentation

language modeling
SMART_READER_LITE
LIVE PREVIEW

Language Modeling Martin Saveski, Igor Trajkovski Information - - PowerPoint PPT Presentation

Automatic Construction of WordNets by Using Machine Translation and Language Modeling Martin Saveski, Igor Trajkovski Information Society Language Technologies Ljubljana 2010 1 Outline WordNet Motivation and Problem Statement


slide-1
SLIDE 1

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

Martin Saveski, Igor Trajkovski

Information Society Language Technologies Ljubljana 2010

1

slide-2
SLIDE 2

Outline

  • WordNet
  • Motivation and Problem Statement
  • Methodology
  • Results
  • Evaluation
  • Conclusion and Future Work

2

slide-3
SLIDE 3

WordNet

  • Lexical database of the English language
  • Groups words into sets of cognitive synonyms

called synsets

  • Each synsets contains gloss and links to other

synsets

– Links define the place of the synset in the conceptual space

  • Source of motivation for researchers from

various fields

3

slide-4
SLIDE 4

WordNet Example

4

  • Car
  • Auto
  • Automobile
  • Motorcar

{cab, taxi, hack, taxicab} {motor vehicle, automotive vehicle}

a motor vehicle with four wheels; usually propelled by an internal combustion engine

Gloss Hypernym Hyponym Synset

slide-5
SLIDE 5

Motivation

  • Plethora of WordNet applications

– Text classification, clustering, query expansion, etc.

  • There is no publicly available WordNet for the

Macedonian Language

– Macedonian was not included in the EuroWordNet and BalkaNet projects

  • Manual construction is expensive and labor

intensive process

– Need to automate the process

5

slide-6
SLIDE 6

Problem Statement

  • Assumptions:

– The conceptual space modeled by the PWN is not depended on the language in which it is expressed – Majority of the concepts exist in both languages, English and Macedonian, but have different notations

6

Given a synset in English, it is our goal to find a set of words which lexicalize the same concept in Macedonian

find translations which lexicalize the same concept

Macedonian Synset English Synset

slide-7
SLIDE 7

Resources and Tools

  • Resources:

– Princeton implementation of WordNet (PWN) – backbone for the construction – English-Macedonian Machine Readable Dictionary (in-house-developed) – 182,000 entries

  • T
  • ols:

– Google Translation System (Google Translate) – Google Search Engine

7

slide-8
SLIDE 8

Methodology

  • 1. Finding Candidate Words
  • 2. Translating the synset gloss
  • 3. Assigning scores the candidate words
  • 4. Selection of the candidate words

8

slide-9
SLIDE 9

T(W1) = CW11, CW12 … CW1s T(W2) = CW21, CW22 … CW2k T(W3) = CW31, CW32 … CW3j

  • T(Wn) = CWn1, CWn2 … CWnm

Finding Candidate Words

9

  • W1
  • W2
  • W3
  • Wn

PWN Synset

  • CW1
  • CW2
  • CW3
  • CWy

Candidate Words

MRD

  • T(W1) contains translations of all senses of the word W1
  • Essentially, we have Word Sense Disambiguation (WSD)

problem SET

slide-10
SLIDE 10

Finding Candidate Words (cont.)

W1 W2 . . . Wn CW1 CW2 . . . CWm PWN Synset Candidate Words MRD

slide-11
SLIDE 11

Translating the synset gloss

  • Statistical approach to WSD:

– Using the word sense definitions and a large text corpus, we can determine the sense in which the word is

  • Word Sense Definition = Synset Gloss
  • The gloss translation can be used to measure

the correlation between the synset and the candidate words

  • We use Google

Translate (EN-MK) to translate the glosses of the PWN synsets

11

slide-12
SLIDE 12

Translating the synset gloss (cont.)

W1 W2 . . . Wn CW1 CW2 . . . CWm PWN Synset Gloss Gloss Translation (T-Gloss) PWN Synset Candidate Words MRD

slide-13
SLIDE 13

Assigning scores to the candidate words

  • T
  • apply the statistical WSD technique we lack

a large, domain independent text corpus

  • Google Similarity Distance (GSD)

– Calculates the semantic similarity between words/phrases based on the Google result counts

  • We calculate GSD between each candidate

word and gloss translation

  • The GSD score is assigned to each candidate

word

13

slide-14
SLIDE 14

Assigning scores to the candidate words

W1 W2 . . . Wn CW1 CW2 . . . CWm PWN Synset Gloss Gloss Translation (T-Gloss) Google Similarity Distance (GSD) GSD(CW1, T-Gloss) GSD(CW2, T-Gloss) . . . GSD(CWm, T-Gloss) PWN Synset Candidate Words MRD Similarity Scores

slide-15
SLIDE 15

Selection of the candidate words

  • Selection by using two thresholds:
  • 1. Score(CW) >

T1

  • Ensures that the candidate word has minimum

correlation with the gloss translation

  • 2. Score(CW) > (T2 x MaxScore)
  • Discriminates between the words which

capture the meaning of the synset and those that do not

15

slide-16
SLIDE 16

Selection of the candidate words (cont.)

W1 W2 . . . Wn CW1 CW2 . . . CWm PWN Synset Gloss Gloss Translation (T-Gloss) Google Similarity Distance (GSD) GSD(CW1, T-Gloss) GSD(CW2, T-Gloss) . . . GSD(CWm, T-Gloss) CW1 CW2 . . . CWk PWN Synset Candidate Words MRD Selection Resulting Synset Similarity Scores

slide-17
SLIDE 17

Example

Name Epithet a defamatory or abusive word or phrase со клевети или навредлив збор или фраза (MK-GLOSS) Навреда

PWN Synset

Candidate Word English Explanation навреда

  • ffence, insult

епитет epithet, in a positive sense углед reputation крсти to name somebody назив name, title презиме last name наслов title глас voice име first name Google Similarity Distance (GSD)

MWN Synset

GSD Score 0.78 0.49 0.41 0.40 0.37 0.35 0.35 0.34 0.33

Selection T1 = 0,2 T2 = 0,62 MRD Synset Gloss Gloss Translation

slide-18
SLIDE 18

Results of the MWN construction

NB: All words included in the MWN are lemmas

18

Nouns Verbs Adjectives Adverbs Synsets 22838 7256 3125 57 Words 12480 2786 2203 84

5000 10000 15000 20000 25000

Size of the MWN

slide-19
SLIDE 19

Evaluation of the MWN

  • There is no manually constructed WordNet

(lack of Golden Standard)

  • Manual evaluation:

– Labor intensive and expensive

  • Alternative Method:

– Evaluation by use of MWN in practical applications – MWN applications were our motivation and ultimate goal

19

slide-20
SLIDE 20

MWN for T ext Classification

  • Easy to measure and compare the

performance of the classification algorithms

  • We extended the synset similarity measures

to word-to-word i.e. text-to-text level

– Leacock and Chodorow (LCH) (node-based) – Wu and Palmer (WUP) (arc-based)

  • Baseline:

– Cosine Similarity (classical approach)

20

slide-21
SLIDE 21

MWN for T ext Classification (cont.)

  • Classification Algorithm:

– K Nearest Neighbors (KNN) – Allows the similarity measures to be compared unambiguously

  • Corpus: A1 TV - News Archive (2005-2008)

21

Category Balkan Economy Macedonia Sci/Tech World Sport TOTAL Articles 1,264 1,053 3,323 920 1,845 1,232 9,637 Tokens 159,956 160,579 585,368 17,775 222,560 142,958 1,289,196

A1 Corpus, size and categories

slide-22
SLIDE 22

MWN for T ext Classification – Results

22

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Balkan Economy Macedonia Sci/Tech World Sport Weighted Average

WUP Similarity Cosine Similarity LCH Similarity

80,4% 73,7% 59,8% Text Classification Results (F-Measure, 10-fold cross-validation)

slide-23
SLIDE 23

Future Work

  • Investigation of the semantic relatedness

between the candidate words

– Word Clustering prior to assigning to synset – Assigning group of candidate words to the synset

  • Experiments of using the MWN for other

applications

– Text Clustering – Word Sense Disambiguation

23

slide-24
SLIDE 24

Q&A

24

Thank you for your attention. Questions?

slide-25
SLIDE 25

Google Similarity Distance

  • Word/phrases acquire meaning from the way

they are used in the society and from their relative semantics to other words/phrases

  • Formula:

f(x), f(y), f(x,y) – results counts of x, y, and (x, y) N – Normalization factor

25

slide-26
SLIDE 26

Synset similarity metrics

  • Leacock and Chodorow (LCH)

len – number of nodes form s1 to s2, D – maximum depth of the hierarchy

  • Measures in number of nodes

26

𝑡𝑗𝑛𝑀𝐷𝐼 𝑡1, 𝑡2 = − log 𝑚𝑓𝑜 𝑡1, 𝑡2 2 ∗ 𝐸

slide-27
SLIDE 27

Synset similarity metrics (cont.)

  • Wu and Palmer (WUP)

LCS – most specific synset ancestor to both synsets

  • Measures in number of links

27

𝑡𝑗𝑛𝑋𝑉𝑄 𝑡1, 𝑡2 = 2 ∗ 𝑒𝑓𝑞𝑢ℎ(𝑀𝐷𝑇) 𝑒𝑓𝑞𝑢ℎ 𝑡1 + 𝑒𝑓𝑞𝑢ℎ 𝑡2

slide-28
SLIDE 28

Semantic Word Similarity

  • The similarity of W1 and W2 is defined as:
  • The maximum similarity (minimum distance)

between the:

– Set of synsets containing W1, – Set of synsets containing W2

28

slide-29
SLIDE 29

Semantic T ext Similarity

  • The similarity between texts T1 and T2 is:

– idf – inverse document frequency (measures word specificity)

29

𝑡𝑗𝑛 𝑈

1 , 𝑈2 = 1

2 𝑛𝑏𝑦𝑇𝑗𝑛 𝑥, 𝑈2 ∗ 𝑗𝑒𝑔 𝑥

𝑥 ∈ 𝑈1

𝑗𝑒𝑔 𝑥

𝑥 ∈ 𝑈1

+ 𝑛𝑏𝑦𝑇𝑗𝑛 𝑥, 𝑈

1 ∗ 𝑗𝑒𝑔 𝑥 𝑥 ∈ 𝑈2

𝑗𝑒𝑔 𝑥

𝑥 ∈ 𝑈2