Identification of Transliterated Foreign Words in Hebrew Script - - PowerPoint PPT Presentation

identification of transliterated foreign words in hebrew
SMART_READER_LITE
LIVE PREVIEW

Identification of Transliterated Foreign Words in Hebrew Script - - PowerPoint PPT Presentation

Introduction Method Evaluation and Results Summary Identification of Transliterated Foreign Words in Hebrew Script Yoav Goldberg Michael Elhadad CiCLing 2008, Haifa, Israel Yoav Goldberg, Michael Elhadad Foreign Word Identification


slide-1
SLIDE 1

Introduction Method Evaluation and Results Summary

Identification of Transliterated Foreign Words in Hebrew Script

Yoav Goldberg Michael Elhadad CiCLing 2008, Haifa, Israel

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-2
SLIDE 2

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

A Typical Hebrew Text

Taken from YNET Gossip Section a Few Days Ago

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-3
SLIDE 3

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

A Typical Hebrew Text

Taken from YNET Gossip Section a Few Days Ago

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-4
SLIDE 4

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

A Typical Hebrew Text

Taken from YNET Gossip Section a Few Days Ago

KAST VVLLNTYYN’Z DYY ST DABL SKS

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-5
SLIDE 5

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

A Typical Hebrew Text

Taken from YNET Gossip Section a Few Days Ago

Foreign words written in Hebrew script Can’t expect comprehensive dictionary coverage Would like to identify them automatically

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-6
SLIDE 6

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: ו‘ג/John, טרבור/Robert, קיי‘ג/Jake NO: באוי/Yoav, גרבדלוג/Goldberg, דדַחלא/Elhadad

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-7
SLIDE 7

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: רק/Karen NO: רק/Keren

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-8
SLIDE 8

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: לקיימ/Michael (pronounced maykel) NO: לאכימ/Michael (pronounced mi-cha-el)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-9
SLIDE 9

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: דווילוה/Hollywood, קרויוינ/New-York NO: הילגנא/Anglia (England), הילטיא/Italya (Italy)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-10
SLIDE 10

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: טפוסורקיימ/Microsoft, קיינ/Nike NO: סוא/Osem, הבונת/Tnuva

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-11
SLIDE 11

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Names

  • f people if they are not Israeli / Hebrew / Russian /

Amharic / Arabic

  • f places in case they are pronounced the same in English
  • f Companies/Organization if they sound non-Hebrew

(mostly easy to decide)

  • f Months if they sound the same in English

Example YES: טסוגוא/August, רבמטפס/September NO: ינוי/Yuni (June), ילוי/Yuli (July)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-12
SLIDE 12

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example טסאק/Cast, לבאד/Double, ייד/Day, דנרט/Trend

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-13
SLIDE 13

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example הידפולקיצנא/En-ci-klo-pe-di-ya

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-14
SLIDE 14

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example וידר/Radio, סקס/Sex

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-15
SLIDE 15

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example ידנרט/Trendy

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-16
SLIDE 16

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example ילוהוכלא/Alcoholi (vs. Alcoholic)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-17
SLIDE 17

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Which words are we looking for?

Cognates, transliterations and borrowings It revolves around the pronunciation Some words are clearly Foreign Foreign origin, but Hebrew-sounding – NO Non inflected, and pronounced the same – YES Inflected, pronounced the same – YES Inflected, pronounced differently – MAYBE Can be read as Hebrew or Foreign – DEPENDS on context Example דב/Bad vs. cloth, branch, ר/Run vs. sang,Proper Name

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-18
SLIDE 18

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

The approach

We chose to tackle the problem as Performing Language Identification at the Word Level

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-19
SLIDE 19

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

The approach

We chose to tackle the problem as Performing Language Identification at the Word Level Language Identification accuracy: > 99% A solved problem?

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-20
SLIDE 20

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

The approach

We chose to tackle the problem as Performing Language Identification at the Word Level Language Identification accuracy: > 99% . . . but requires about 50 characters

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-21
SLIDE 21

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

The approach

We chose to tackle the problem as Performing Language Identification at the Word Level Language Identification accuracy: > 99% . . . but requires about 50 characters . . . and on top of that. . .

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-22
SLIDE 22

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Additional Problems We Have to Face

Hebrew Writing System Vowels are not written in most cases Letters (א,ו,י) can be either vowels or consonants Each of (p,f) (b,v) (s,sh) are encoded by the same letter (‌פ,ב,ש) Sounds th, j do not exist ⇒ Words are even shorter ⇒ Words forms are very ambiguous

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-23
SLIDE 23

Introduction Method Evaluation and Results Summary What we are trying to solve How we are trying to solve it

Additional Problems We Have to Face

Hebrew Writing System Vowels are not written in most cases Letters (א,ו,י) can be either vowels or consonants Each of (p,f) (b,v) (s,sh) are encoded by the same letter (‌פ,ב,ש) Sounds th, j do not exist ⇒ Words are even shorter ⇒ Words forms are very ambiguous (Lack of) Training Data No dictionary available Many ways for transliterating the same word

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-24
SLIDE 24

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

English Assume a language model M generating a word w We have several such models (Mi), one for each language. For a given word, we would like to find the language L most likely to generate this word Using Bayes rule , the word is fixed , assume models are a-priori equally likely Math p(w|M)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-25
SLIDE 25

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

English Assume a language model M generating a word w We have several such models (Mi), one for each language. For a given word, we would like to find the language L most likely to generate this word Using Bayes rule , the word is fixed , assume models are a-priori equally likely Math L = arg max

L

p(ML|w)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-26
SLIDE 26

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

English Assume a language model M generating a word w We have several such models (Mi), one for each language. For a given word, we would like to find the language L most likely to generate this word Using Bayes rule , the word is fixed , assume models are a-priori equally likely Math P(Mi|w) = P(w|Mi)P(Mi)/P(w)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-27
SLIDE 27

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

English Assume a language model M generating a word w We have several such models (Mi), one for each language. For a given word, we would like to find the language L most likely to generate this word Using Bayes rule , the word is fixed , assume models are a-priori equally likely Math P(Mi|w) ∝ P(w|Mi)P(Mi)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-28
SLIDE 28

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

English Assume a language model M generating a word w We have several such models (Mi), one for each language. For a given word, we would like to find the language L most likely to generate this word Using Bayes rule , the word is fixed , assume models are a-priori equally likely Math P(Mi|w) ∝ P(w|Mi)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-29
SLIDE 29

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Naive Bayes Language Identification

How do we estimate P(w|Mi) ?

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-30
SLIDE 30

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Generative N-Gram Model

English A Markov Model generating letters Each letter depends on the previous n letters Math Probability for generating a letter (c): P(ci|ci−1, . . . , ci−n)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-31
SLIDE 31

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Generative N-Gram Model

English A Markov Model generating letters Each letter depends on the previous n letters Math Probability for generating a word (w = c1, . . . , cN):

  • n≤i≤N

P(ci|wi−1, . . . , ci−n)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-32
SLIDE 32

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Generative N-Gram Model

English A Markov Model generating letters Each letter depends on the previous n letters The probabilities for generating a letter are the Model Parameters Math Model Parameters are estimated via Relative Frequency: P(ci|ci−1, . . . , ci−n) = #(cici−1 . . . ci−n) #(ci−1 . . . ci−n)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-33
SLIDE 33

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Non-Traditional Smoothing

English Smoothing is hardly needed for 1- or 2-gram models. . . . But we want the power of 3- and 4-gram models. Math

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-34
SLIDE 34

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Non-Traditional Smoothing

English Smoothing is hardly needed for 1- or 2-gram models. . . . But we want the power of 3- and 4-gram models. Instead of using a traditional backoff strategy, we linearly combine 4 different n-gram models Math P(w|Mi) = λ1P(w|Mi1) + λ2P(w|Mi2) + λ3P(w|Mi3) + λ4P(w|Mi4)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-35
SLIDE 35

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Non-Traditional Smoothing

English Smoothing is hardly needed for 1- or 2-gram models. . . . But we want the power of 3- and 4-gram models. Instead of using a traditional backoff strategy, we linearly combine 4 different n-gram models (this can be viewed as a voting scheme) Math P(w|Mi) = λ1P(w|Mi1) + λ2P(w|Mi2) + λ3P(w|Mi3) + λ4P(w|Mi4)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-36
SLIDE 36

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Non-Traditional Smoothing

English Smoothing is hardly needed for 1- or 2-gram models. . . . But we want the power of 3- and 4-gram models. Instead of using a traditional backoff strategy, we linearly combine 4 different n-gram models (this can be viewed as a voting scheme) We didn’t do anything fancy for setting the λs Math P(w|Mi) = λ1P(w|Mi1) + λ2P(w|Mi2) + λ3P(w|Mi3) + λ4P(w|Mi4)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-37
SLIDE 37

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Backward Model

English We also add a backward moving models (some of the probabilities are different. . . )

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-38
SLIDE 38

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Probabilistic Model

Backward Model

English We also add a backward moving models (some of the probabilities are different. . . ) Math P(w|Mi) = λ1P(w|Mi1) + λ2P(w|Mi2) + λ3P(w|Mi3) + λ4P(w|Mi4) + λ5P(w|Mi2back) + λ6P(w|Mi3back) + λ7P(w|Mi4back)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-39
SLIDE 39

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Unsupervised Setting: Training Data by Over-generation

Problem We need to count the number of times each ngram occurs . . . But we don’t have transliterated-Hebrew text . . . And we don’t even have “pure Hebrew” text

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-40
SLIDE 40

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Unsupervised Setting: Training Data by Over-generation

Problem We need to count the number of times each ngram occurs . . . But we don’t have transliterated-Hebrew text . . . And we don’t even have “pure Hebrew” text Solution 1 Using Ben Yehuda Project for estimating pure Hebrew

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-41
SLIDE 41

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Unsupervised Setting: Training Data by Over-generation

Problem We need to count the number of times each ngram occurs . . . But we don’t have transliterated-Hebrew text . . . And we don’t even have “pure Hebrew” text Solution 2 Using the Brown Corpus, the CMU pronunciation dictionary and a few simple rules for generating noisy transliterated-Hebrew

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-42
SLIDE 42

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

slide-43
SLIDE 43

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

slide-44
SLIDE 44

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

Phonetic Representation *K AW N T IY *K AW N IY

slide-45
SLIDE 45

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

Phonetic Representation *K AW N T IY *K AW N IY CMUDICT

... COUNTS K AW1 N T S COUNTY K AW1 N T IY0 COUNTY-2 K AW1 N IY0 COUP K UW1 ...

slide-46
SLIDE 46

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

Phonetic Representation *K AW N T IY *K AW N IY CMUDICT

... COUNTS K AW1 N T S COUNTY K AW1 N T IY0 COUNTY-2 K AW1 N IY0 COUP K UW1 ...

Phonetic → Hebrew *K

‌כ

AW וא,ווא,ואו N ‌נ T ת,ט IY י,יא, ǫ

slide-47
SLIDE 47

The Over-Generation Process

Raw English Text 6mb of Brown

The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent ...

Phonetic Representation *K AW N T IY *K AW N IY Many Hebrew Transliterations CMUDICT

... COUNTS K AW1 N T S COUNTY K AW1 N T IY0 COUNTY-2 K AW1 N IY0 COUP K UW1 ...

Phonetic → Hebrew *K

‌כ

AW וא,ווא,ואו N ‌נ T ת,ט IY י,יא, ǫ

slide-48
SLIDE 48

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

More about Over-Generation

Isn’t This Just Like Writing Rules? Not really

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-49
SLIDE 49

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

More about Over-Generation

Isn’t This Just Like Writing Rules? Not really Simple Rules → Data → Learning → Complex Rules

  • f a Different Nature

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-50
SLIDE 50

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

More about Over-Generation

Isn’t This Just Like Writing Rules? Not really Simple Rules → Data → Learning → Complex Rules

  • f a Different Nature

Some Things to Notice Every foreign word is represented in accordance to actual corpus frequency Many writing variations are taken into account ⇒ Complex interactions

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-51
SLIDE 51

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Not All Foreign Words are Created Equal

Observation Many of the transliterated words are Proper Names The sound patterns of proper names are somewhat different than those of regular words

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-52
SLIDE 52

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Not All Foreign Words are Created Equal

Observation Many of the transliterated words are Proper Names The sound patterns of proper names are somewhat different than those of regular words Adding another language model ⇒ We actually have 3 languages: Hebrew, Foreign and Foreign-Name Use a heuristic for extracting proper nouns, and estimate a language model for it This is very noisy – but still useful If the classifier decides Foreign or Foreign-Name, we take it to be Foreign

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-53
SLIDE 53

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Adding a Lexicon

HSpell – a Hebrew Spell Checker Lets assume a word is Foreign IFF its not in HSpell ⇒ Not very good results Lets assume a word is Foreign IFF its not in HSPELL and the statistical model says its Foreign ⇒ Better

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-54
SLIDE 54

Introduction Method Evaluation and Results Summary Learning Method Getting Training Data Additional Resources

Adding a Lexicon

HSpell – a Hebrew Spell Checker Lets assume a word is Foreign IFF its not in HSpell ⇒ Not very good results Lets assume a word is Foreign IFF its not in HSPELL and the statistical model says its Foreign ⇒ Better

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-55
SLIDE 55

Introduction Method Evaluation and Results Summary

Evaluation Setting

The Data 50 articles from gossip section of YNET 9618 words, 4044 word types Removed prefixes About 10% of the words are Foreign! 3608 Hebrew words 251 Foreign Proper Names 117 Foreign Words 68 Ambiguous (hard for Humans to judge) – discarded

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-56
SLIDE 56

Introduction Method Evaluation and Results Summary

Results - Supervised

5-fold cross validation experiment Experiment Precision (%) Recall (%) Baseline 17.75 60.42 Vot 59.72 60.81 Vot+Back 59.70 60.76

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-57
SLIDE 57

Introduction Method Evaluation and Results Summary

Results - Supervised

5-fold cross validation experiment Experiment Precision (%) Recall (%) Baseline 17.75 60.42 Vot 59.72 60.81 Vot+Back 59.70 60.76 Baseline: state of the art Language Identification (Dunnning 94)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-58
SLIDE 58

Introduction Method Evaluation and Results Summary

Results - Supervised

5-fold cross validation experiment Experiment Precision (%) Recall (%) Baseline 17.75 60.42 Vot 59.72 60.81 Vot+Back 59.70 60.76 Our combined (smoothed) models, with and without backward models

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-59
SLIDE 59

Introduction Method Evaluation and Results Summary

Results – Unsupervised

Training on Ben-Yehuda and Generated Data Experiment Precision (%) Recall (%) Best Supervised 59.70 60.76 Basline(3gram) 58.7 64.9 Basline(4gram) 55.6 65.7 Vot 76.3 71.7 Vot+Back 80.4 71.4 Vot+Back+Names 80.1 82

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-60
SLIDE 60

Introduction Method Evaluation and Results Summary

Results – Unsupervised

Training on Ben-Yehuda and Generated Data Experiment Precision (%) Recall (%) Best Supervised 59.70 60.76 Basline(3gram) 58.7 64.9 Basline(4gram) 55.6 65.7 Vot 76.3 71.7 Vot+Back 80.4 71.4 Vot+Back+Names 80.1 82 Baseline: state-of-the-art language identification 3- and 4-gram models

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-61
SLIDE 61

Introduction Method Evaluation and Results Summary

Results – Unsupervised

Training on Ben-Yehuda and Generated Data Experiment Precision (%) Recall (%) Best Supervised 59.70 60.76 Basline(3gram) 58.7 64.9 Basline(4gram) 55.6 65.7 Vot 76.3 71.7 Vot+Back 80.4 71.4 Vot+Back+Names 80.1 82 Our model: smoothed

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-62
SLIDE 62

Introduction Method Evaluation and Results Summary

Results – Unsupervised

Training on Ben-Yehuda and Generated Data Experiment Precision (%) Recall (%) Best Supervised 59.70 60.76 Basline(3gram) 58.7 64.9 Basline(4gram) 55.6 65.7 Vot 76.3 71.7 Vot+Back 80.4 71.4 Vot+Back+Names 80.1 82 Our model: adding a backward model

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-63
SLIDE 63

Introduction Method Evaluation and Results Summary

Results – Unsupervised

Training on Ben-Yehuda and Generated Data Experiment Precision (%) Recall (%) Best Supervised 59.70 60.76 Basline(3gram) 58.7 64.9 Basline(4gram) 55.6 65.7 Vot 76.3 71.7 Vot+Back 80.4 71.4 Vot+Back+Names 80.1 82 Our model: adding a proper names model

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-64
SLIDE 64

Introduction Method Evaluation and Results Summary

Results – Unsupervised with a Lexicon

Results Using HSpell Experiment Precision (%) Recall (%) Best w/o lexicon 80.1 82 HSPELL 13 85 Vot+Back+HSPELL 91.86 61.41 Vot+Back+Names+HSPELL 91.19 70.38

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-65
SLIDE 65

Introduction Method Evaluation and Results Summary

Results – Unsupervised with a Lexicon

Results Using HSpell Experiment Precision (%) Recall (%) Best w/o lexicon 80.1 82 HSPELL 13 85 Vot+Back+HSPELL 91.86 61.41 Vot+Back+Names+HSPELL 91.19 70.38 Only HSpell

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-66
SLIDE 66

Introduction Method Evaluation and Results Summary

Results – Unsupervised with a Lexicon

Results Using HSpell Experiment Precision (%) Recall (%) Best w/o lexicon 80.1 82 HSPELL 13 85 Vot+Back+HSPELL 91.86 61.41 Vot+Back+Names+HSPELL 91.19 70.38 HSPell + Statistical Model (no Proper Names model)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-67
SLIDE 67

Introduction Method Evaluation and Results Summary

Results – Unsupervised with a Lexicon

Results Using HSpell Experiment Precision (%) Recall (%) Best w/o lexicon 80.1 82 HSPELL 13 85 Vot+Back+HSPELL 91.86 61.41 Vot+Back+Names+HSPELL 91.19 70.38 HSPELL + Statistical Model (with Proper Names model)

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-68
SLIDE 68

Introduction Method Evaluation and Results Summary

Conclusions and Summary

The Good We identify Foreign Words in Hebrew Script We do it quite accurately . . . . . . Without annotating Hebrew data Technique is language independent The Interesting Lots of noisy data is WAY better than little clean data The Bad Completely context free – some decisions are impossible We don’t treat בלכוהשׁמ prefixes We didn’t actually try any other language pair

Yoav Goldberg, Michael Elhadad Foreign Word Identification

slide-69
SLIDE 69

Introduction Method Evaluation and Results Summary

Thanks

Yoav Goldberg, Michael Elhadad Foreign Word Identification