Text Mining in Hebrew Impact of Morphology Analysis on Topic - - PowerPoint PPT Presentation

text mining in hebrew
SMART_READER_LITE
LIVE PREVIEW

Text Mining in Hebrew Impact of Morphology Analysis on Topic - - PowerPoint PPT Presentation

Text Mining in Hebrew Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer Text in Hebrew Extract information from text in Hebrew Major immediate


slide-1
SLIDE 1

Text Mining in Hebrew

Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer

slide-2
SLIDE 2

Text in Hebrew

 Extract information from text in Hebrew  Major immediate obstacle

 Rich morphology  Very high number of distinct word forms  Very high ambiguity

Text Mining and Morphology

slide-3
SLIDE 3

Morphological Analysis

םלצב

םֶלֶצֱּב(name of an association) םֶּלַצֱּב(while taking a picture) םָלָצֱּב(their onion) םָלִצֱּב(under their shades) םָלַצֱּב(in a photographer) םָלַצַב(in the photographer( םֶלֶצֱּב(in an idol( םֶלֶצַב(in the idol(

Text Mining and Morphology

slide-4
SLIDE 4

How Critical is Morphological Analysis to Text Mining?

 How much does Hebrew morphology affect

high-level text analysis tasks?

 Named Entity Recognition  Information Extraction  Topic Analysis  Information Retrieval

Text Mining and Morphology

slide-5
SLIDE 5

Topic Analysis and Search

 Topic Analysis

 Unsupervised discovery of topics in text collection  Useful to browse a large corpus by theme  Difficult to evaluate

 Faceted Search

 Useful combination of search and browsing  Exploratory search (as opposed to fact finding)  Enabled by topic analysis

Text Mining and Morphology

slide-6
SLIDE 6

The Basic Idea

Text Mining and Morphology

One word שיא– about 50 distinct forms in the corpus

slide-7
SLIDE 7

Outline

 Objectives

 Topic Analysis in Hebrew  Improved Search

 Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-8
SLIDE 8

Objectives

 Input:

 Domain specific text corpus in Hebrew

 Output:

 Topic model:

 Discover “topics” discussed in the corpus  Recognize topics in unseen text

 Index text collection by topic

 Task:

 Search and browse text collection using topics

Objectives

slide-9
SLIDE 9

Example: Rambam’s Mishne Torah

 Corpus of Mishne Torah

 Exhaustive code of Halakha  Written by Maimonides 1170-1180  14 books, 85 sections, 1000 chapters, 15K

articles, 350K words.

 Creative compilation of laws from multiple

sources:

 Torah, Talmud (Bavli and Yerushalmi), Tosefta,

halakhic midrashim (sifra and sifre), Geonim.

 Synthetic hierarchical organization

Objectives

slide-10
SLIDE 10

Problems with Existing Search

 Morphology

Objectives

A single “ו“ and the word is not found…

slide-11
SLIDE 11

Problems: Explore complex topics

 “רוש“ refers to many complex halakhic topics:

 Damages (חגונ רוש)  Kosher meat (הטיחש)  Sacrifices (תונברק)  Shabat (תבש)  Calendar (רוש לזמ)

 Queries must be disambiguated

רוש+תבש?

Objectives

slide-12
SLIDE 12

Exploratory/Faceted Search

 How to deal with ambiguous query terms?

 Propose refinements according to contexts  “Do you mean: damages, meat, shabat…”  Propose facets for query refinement

 Where do the topics (facets) come from?  How do we disambiguate the query terms?  Given a disambiguated topic, how do we

refine the query?

Objectives

slide-13
SLIDE 13

Outline

 Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-14
SLIDE 14

Discovering Topic Models: LDA

 Latent Dirichlet Allocation

 Blei and Jordan 1993

 Discover (unsupervised) topic structures in a

document collection

 Topics are modeled as distributions of words  Probabilistic generative model of text

LDA

slide-15
SLIDE 15

Topics for רוש

LDA

slide-16
SLIDE 16

Topics for a Document

LDA

slide-17
SLIDE 17

The LDA Model

 Observation: documents are composed of

words.

 Intuition: documents exhibit multiple topics  Generative probabilistic model:

 Each document is a mixture of topics  Each word is drawn from the topics active in the

document

LDA

slide-18
SLIDE 18

Structure of the LDA Model

LDA From (Blei 2008)

slide-19
SLIDE 19

Learning an LDA Model from Observations

 Observation: documents and words  Objective: infer an underlying topic structure

 What are the topics?  How are the documents divided according to

those topics?

LDA

slide-20
SLIDE 20

Graphical Models

LDA (Blei 2008)

slide-21
SLIDE 21

LDA Graphical Model

LDA (Blei 2008)

slide-22
SLIDE 22

LDA Generative Process

LDA (Blei 2008)

slide-23
SLIDE 23

LDA Estimation

LDA (Blei 2008)

slide-24
SLIDE 24

LDA Approximation

LDA (Blei 2008)

slide-25
SLIDE 25

Outline

 Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-26
SLIDE 26

Morphological Analysis

םֶלֶצֱּב

םלצב proper-noun

םֶּלַצֱּב

םלצב verb, infinitive

םָלָצֱּב

לצב-ם

noun, singular, masculine

םָלִצֱּב

ב-לצ-ם

noun, singular, masculine

םָלַצֱּבםֶלֶצֱּב

ב-םלצ

noun, singular, masculine, absolute

ב-םלצ

noun, singular, masculine, construct

םָלַצַבםֶלֶצַב

ב-םלצ

noun, definitive singular, masculine Morphology

slide-27
SLIDE 27

Morphological Analyzer

 Implementation

 Corpus based  Lexicon based

 Analytic  Synthetic

Analyzer w1,…wn w1 {t11,…,ti1} … wn {tn1,…,tin}

Morphology

slide-28
SLIDE 28

Morphological Disambiguation

עידוה םלצב ןוגרא ןמאמה רזועב יתנחבה קחשמה תא םלצב םיקוושב ףטחנ םלצב וניסח םיענה םלצב יעוצקמ םלצב יתלקתנ תונותח םלצב יתשגפ יעוצקמה םלצב יתלקתנ

Morphology

slide-29
SLIDE 29

Morphological Disambiguation

Disambiguator w1 tj1 … wn tjn w1 {t11,…,ti1} … wN {tN1,…,tiN}

Morphology

slide-30
SLIDE 30

Hebrew Text Analysis System

http://www.cs.bgu.ac.il/~nlpproj/demo

Unknown Tokens Analyzer Tokenizer Morphological Analyzer Morphological Disambiguator Proper-name Classifier Named-Entity Recognizer Noun-Phrase Chunker SVM ME Lexicon SVM HMM ME

Morphology

slide-31
SLIDE 31

Morphological Disambiguation - Methods

 Rule-based vs. Stochastic models  Supervised vs. Unsupervised learning  Exact vs. Approximate inference

Morphology

slide-32
SLIDE 32

Hidden Markov Model

 S – a set of states (= tags)  O – a set of output symbols (= tokens)  µ – a probabilistic model

 State transition probabilities A = {ai,j}  Symbol emission probabilities B = {Bi,k}

Computational Model

slide-33
SLIDE 33

HMM- An Example

 S = {start, noun, verb}  O = {

{דלי ,חרי

 µ = (A,B(

verb noun 0.2 0.8 start 0.1 0.9 noun 0.1 0.9 verb

A B

חרידלי 0.70.3 noun 0.1 0.9 verb

Computational Model

slide-34
SLIDE 34

Markov Process

חרידלי

noun noun start

Computational Model

slide-35
SLIDE 35

Decoding

חרידלי

?? start

astart,noun0.8 astart,verb 0.2 anoun, noun 0.9 anoun,verb 0.1 averb, noun 0.9 averb,verb 0.1 bnoun,דלי0.3 bverb,דלי0.9 bnoun,חרי0.7 bverb,חרי0.1 Computational Model

slide-36
SLIDE 36

Decoding

(noun, noun) = astart,nounbnoun,דליanoun, nounbnoun,חרי= 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = astart,nounbnoun,דליanoun, verbbverb,חרי= 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = astart,verbbverb,דליaverb, nounbnoun,חרי= 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = astart,verbbverb,דליaverb, verbbverb,חרי= 0.2*0.9*0.1*0.1 = 0.0018

חרידלי

?? start

aBOS,noun0.8 aBOS,verb 0.2 anoun, noun 0.9 anoun,verb 0.1 averb, noun 0.9 averb,verb 0.1 bnoun,דלי0.3 bverb,דלי0.9 bnoun,חרי0.7 bverb,חרי0.1

Computational Model

slide-37
SLIDE 37

Decoding

(noun, noun) = astart,nounbnoun,דליanoun, nounbnoun,חרי= 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = astart,nounbnoun,דליanoun, verbbverb,חרי= 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = astart,verbbverb,דליaverb, nounbnoun,חרי= 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = astart,verbbverb,דליaverb, verbbverb,חרי= 0.2*0.9*0.1*0.1 = 0.0018

חרידלי

?? start

aBOS,noun0.8 aBOS,verb 0.2 anoun, noun 0.9 anoun,verb 0.1 averb, noun 0.9 averb,verb 0.1 bnoun,דלי0.3 bverb,דלי0.9 bnoun,חרי0.7 bverb,חרי0.1

Computational Model

slide-38
SLIDE 38

Decoding

(noun, noun) = astart,nounbnoun,דליanoun, nounbnoun,חרי= 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = astart,nounbnoun,דליanoun, verbbverb,חרי= 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = astart,verbbverb,דליaverb, nounbnoun,חרי= 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = astart,verbbverb,דליaverb, verbbverb,חרי= 0.2*0.9*0.1*0.1 = 0.0018

Viterbi Algorithm (dynamic programming)

חרידלי

?? start

aBOS,noun0.8 aBOS,verb 0.2 anoun, noun 0.9 anoun,verb 0.1 averb, noun 0.9 averb,verb 0.1 bnoun,דלי0.3 bverb,דלי0.9 bnoun,חרי0.7 bverb,חרי0.1

Computational Model

slide-39
SLIDE 39

Parameter Estimation

חרידלי

start

Computational Model

? ? ? ?

slide-40
SLIDE 40

Supervised Parameter Estimation

חרידלי

start

Computational Model

? ? ? ? noun noun

slide-41
SLIDE 41

Supervised Parameter Estimation

number of transitions from state i to state j number of transitions from state i number of lexical transitions from state i to symbol k number of transitions from state i

ai,j = bi,k =

slide-42
SLIDE 42

Unsupervised Parameter Estimation

חרידלי

start

Computational Model initial conditions initial conditions initial conditions initial conditions

slide-43
SLIDE 43

Unsupervised Parameter Estimation

expected number of transitions from state i to state j expected number of transitions from state i expected number of lexical transitions from state i to symbol k expected number of transitions from state i

ai,j = bi,k =

slide-44
SLIDE 44

Parameter Estimation

 Baum-Welch algorithm

 Start with a model with initial conditions  do

 Expectation: Calculate the expected number of

transitions according to the corpus and the current model.

 Maximization: Maximize the parameters of the model

according to the expected transition number.

Computational Model

slide-45
SLIDE 45

Token-based first order HMM

וניסחםיענהםלצב

t3 t2 t1 t1 prep + noun.sing.masc + possessive t2 def + adj.sing.masc t3 verb.plural.1.past

Computational Model

slide-46
SLIDE 46

Token-based first order HMM

וניסחםיענהםלצב

t3 t2 t1

 Tags: English 48, Hebrew 3561

 State Transitions: English 1.8K, Hebrew 855K  Lexical Transitions: English 57K, Hebrew 3.2M

Computational Model

slide-47
SLIDE 47

Token-based partial second order HMM

וניסחםיענהםלצב

t3 t2 t1

 Tags: English 48, Hebrew 3561  State Transitions: English 38K, Hebrew 41M  Lexical Transitions: English 57K, Hebrew 3.2M

Computational Model

slide-48
SLIDE 48

Token-based second order HMM

וניסחםיענהםלצב

t3 t2 t1

 Tags: English 48, Hebrew 3561  State Transitions: English 38K, Hebrew 41M  Lexical Transitions: English 300K, Hebrew 40M

Computational Model

slide-49
SLIDE 49

Word-based Model

 Computational Considerations

 Sparse data  Complexity (Number of parameters)

 Linguistic Motivation

 Adequate representation  Dynamic nature of the language

Computational Model

slide-50
SLIDE 50

Hebrew Word Definition

 Preposition prefix מ ל כ ב

תיב ,תיבב

 Conjunctions שכב שכל שכ ש ו

תיב ,תיבש תיבו

 Definite article ה

תיב ,תיבה םידמחנ ךכ לכ אלה

 Pronoun suffix

תיב ,ותיב

Computational Model

slide-51
SLIDE 51

Hebrew Word Definition

 Prepositions

ינפל ,ידי לע תובקעב ,םעטמ ,רדגב ,תרגסמב

 Adverbs

תוריהמב ,הרזחב ,ףתושמב

 Inter token words

דחא הפ ,ןיד ךרוע יפ לע ףא ,ידי לע ,מ דבל ,ל טרפ ,ל ףסונב

Computational Model

slide-52
SLIDE 52

Word-based first order HMM

וניסחםיענב

t1 prep t2 noun.sing.masc t3 possessive

לצםה

t1 t2 t3 t4 t5 t6 t4 def t5 adj.sing.masc t6 verb.plural.1.past

Computational Model

slide-53
SLIDE 53

Word-based first order HMM

וניסחםיענב

 Tags: English 48, Hebrew 362 (3561)

 State Transitions: English 1.8K, Hebrew 54K (855K)  Lexical Transitions: English 57K, Hebrew 2.3M (3.2M)

לצםה

t1 t2 t3 t4 t5 t6

Computational Model

slide-54
SLIDE 54

Word-based partial second order HMM

וניסחםיענב

 Tags: English 48, Hebrew 362 (3561)

 State Transitions: English 38K, Hebrew 2.5M (41M)  Lexical Transitions: English 57K, Hebrew 2.3M (3.2M)

לצםה

t1 t2 t3 t4 t5 t6

Computational Model

slide-55
SLIDE 55

Word-based second order HMM

וניסחםיענב

 Tags: English 48, Hebrew 362 (3561)

 State Transitions: English 38K, Hebrew 2.5M (41M)  Lexical Transitions: English 300K, Hebrew 16M (40M)

לצםה

t1 t2 t3 t4 t5 t6

Computational Model

slide-56
SLIDE 56

Text Representation

ב לצב םלצב

לצ םלצ ם םיענה ם םיענה וניסח םיענה וניסח EOS וניסח EOS EOS

וניסח םיענה ם לצ ב

וניסח םיענה םלצב

Computational Model

slide-57
SLIDE 57

Initial Conditions

Computational Model

חרידלי

start

initial conditions initial conditions initial conditions initial conditions

slide-58
SLIDE 58

Initial Condition Types

 Morpho-lexical

 p(t|w) תא: noun, preposition, pronoun

 Syntagmatic

 p(ti|ti-2ti-1) וניסח םיענה םלצב: probability of three consecutive

verbs

slide-59
SLIDE 59

Morpho-lexical approximations

 Morphology-based [Levinger et al. 95]

הריפחה תא תא יל ריבעהל הלוכי תא?

 pronoun  preposition  noun

Initial Conditions

slide-60
SLIDE 60

Morphology-based Approximations

 Similar word sets

 noun: תאה ,יתא ,םיתא  pronoun: התא ,םתא ,ןתא  Preposition: /

 The approximation of p(t|w) is based on the

frequencies of the similar words of t in the corpus

Initial Conditions

slide-61
SLIDE 61

Linear context-based approximations

 Motivation

ךלש 

preposition: ןכלש םכלש ךלש

noun: ןכלש םכלש ךלש

 Method

תוניפ שולש ךלש עבוכל 

p(preposition | עבוכל ,שולש)

p(noun | עבוכל ,שולש)

 Observed Data 

p(w|c), P(c|w) שולש ךלש עבוכל

p(t|w) - similar words algorithm

 Expectation/Maximization over p(t|w) and p(t|c)

Initial Conditions

slide-62
SLIDE 62

Linear-context Model

Expectation

Notation: w –word, c – context of a word, t - tag

Initial Conditions

raw-text corpus p(w|c), p(c|w) lexicon p(t|w)

Maximization

p(t|c) = ∑wp(t|w)p(w|c) p(t|w) = ∑cp(t|c)p(c|w)

Initial Conditions

slide-63
SLIDE 63

Syntagmatic Approximations

 Syntagmatic Constraints

 a construct state form cannot be followed by verb,

preposition, punctuation, existential, modal, or copula המדקתה תורש ,עב ימענ תא סרפ"מ

 a verb cannot be followed by prep לש

םולח לש דלי

 copula and existential cannot be followed by a verb

ןגב דלי שי ,דומח דלי אוה ינד

 a verb cannot be followed by another verb (with some

exceptions) םולח םלח דלי

Initial Conditions

slide-64
SLIDE 64

Syntagmatic Approximations

 Initial Transitions

 A small seed of randomly selected sentences

(10K annotated tokens)

 Tag trigram and bigram counts (ignoring the tag-

word annotations) are used for initialization of p(t|t-2,t-1) distribution

Initial Conditions

slide-65
SLIDE 65

POS Full

Morpho-lexical

Syntagmatic

91.9 87.1 Unif Unif 92.1 88.0 Morph 92.4 87.5 Linear 92.8 88.0 Morph+Linear 92.0 87.8 Unif Pair Const 91.8 88.1 Morph 92.4 87.5 Linear 92.8 88.0 Morph+Linear 92.6 89.4 Unif Init Trans 93.0 90.0 Morph 93.0 89.7 Linear 93.1 90.0 Morph+Linear Initial Condition

Baseline: Token-based, EM learning

slide-66
SLIDE 66

Hebrew Tagging - Analysis

 EM learning

 Unsupervised HMM learning on word model  EM is very effective for Hebrew: error reduction of 65%

  • ver uniform initial conditions

 Morphology-based Initial Conditions:

 Error reduction of 7.7% upon uniform distribution

 Syntagmatic Initial Conditions:

 Pair constraints only have minor impact.  Initial transition frequencies: error reduction of 16.5% for

full analysis; 12.5% for POS tagging

Initial Conditions

slide-67
SLIDE 67

Outline

 Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-68
SLIDE 68

Combining LDA and Morphology

 LDA picks up patterns of word co-occurrence

in documents.

 Heavy variations in Hebrew could mean we

“miss” co-occurrence if we do not first analyze morphology. What is the best method to combine LDA and Morphological analysis?

Combining LDA and Morphology

slide-69
SLIDE 69

Combining LDA and Morphology

3 options:

 Ignore morphology – token-based LDA  Pipeline – resolve morphological

ambiguities, then learn LDA.

 Joint – learn LDA on distributions of possible

morphological analyses

Combining LDA and Morphology

slide-70
SLIDE 70

Joint LDA-Morphology Learning

Combining LDA and Morphology Standard token-based LDA

slide-71
SLIDE 71

Joint LDA-Morphology Learning

Combining LDA and Morphology Joint Morphology-LDA Constrained By Tagger Decision

slide-72
SLIDE 72

Outline

 Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-73
SLIDE 73

Searching with Topics

Combine search and browse:

 Word Topics

Search

 Topic Documents

Browse Word  Topics Disambiguate Cluster unseen documents based on topics

Searching with Topics

slide-74
SLIDE 74

Mishne Torah Topics

 100 topics

(K parameter)

 Each word

covers many variations

Searching with Topics

slide-75
SLIDE 75

Mishne Torah Topics

 Each word covers many variations

Searching with Topics

slide-76
SLIDE 76

Query Disambiguation

 Various topics

for 1 word רוש

Searching with Topics

slide-77
SLIDE 77

Outline

 Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps

Outline

slide-78
SLIDE 78

How Good are Discovered Topics?

 Difficult to evaluate LDA topics

 Many parameters (many words, many topics)  Each run gives slightly different results  How to compare topic models?

 Task-based evaluation

 Use topics for Summarization

 Ontology alignment evaluation

 Compare topics with existing ontology

 Data-oriented evaluation Evaluation

slide-79
SLIDE 79

Ontology Alignment

 Mishne Torah has existing structure:

 Hierarchy of Book/Section/Chapter  Excellent indexes exist  Compare indexes with existing ontology.

 We find excellent alignment topic/Book-

Section

 Some topics are “cross-concerns” (witnesses – topic 5)

Evaluation

slide-80
SLIDE 80

Topic  Documents

 Fits the Rambam’s classification

Evaluation

slide-81
SLIDE 81

Other Domain: More Noise

 Applied LDA to InfoMed corpus

(www.infomed.co.il( corpus of “popular medicine” articles.

 Need a different

evaluation method

Evaluation

slide-82
SLIDE 82

Data-oriented Evaluation

 Method derived from our previous work on

  • ntology evaluation

 How well does an automatically constructed

partition of entities into classes represent reality?

 How can a classification be improved via

merge or split operations?

 Example: movies genres, taken from IMDB:

drama, comedy, war, sport, action…

Evaluation

slide-83
SLIDE 83

Text Classifier to Evaluate Classes

Idea: use a set of texts from a given domain as a proxy of the “reality” represented by the ontology.

Hypothesis:

If the ontology indicates that some movies are "clustered" according to one of the dimensions Then documents associated to these movies should also be found to be associated by a text-classification engine that has been trained on the classification induced by the ontology

Procedure:

Train a classifier for each class we want to evaluate If a classifier can decide with good accuracy whether a given text belongs to a class – the class is well-defined. In the movies domain, reviews are used as representative

Evaluation

slide-84
SLIDE 84

Action Comedy Crime Drama Family Foreign Horror Romance Sci-Fi Sports Suspense War

slide-85
SLIDE 85

Noisy Classes

 What can be done if the classes are very

noisy?

 Example: Keywords in IMDB

 Keyword lists are open to additions/deletions by users (yet

moderated).

 Examples of Keywords: Mafia Business New_York Wedding

Respect Home Organized_Crime Lawyer Violence … (and the film?)

 Too few movies are associated with any single keyword  Movies associated with a given keyword are not necessarily

related

 Could be the same for LDA topics in noisy domains

Evaluation

slide-86
SLIDE 86

Cluster Keywords using LDA

 Apply LDA on the keywords  Divide movies into classes according to LDA

distributions of the movies reviews

 Construct classifiers to evaluate the quality of

the LDA classes

 Example of the top keywords of a “good”

class: england inheritance london-england based-

  • n-novel mansion london maid class-differences

servant 19th-century period-piece butler orphan estate aunt uncle heir victorian-era love marriage

slide-87
SLIDE 87

Psycho

The Sopranos Secrets & Lies 12 Angry Men

….

The Godfather Magnolia The Graduate

Mafia

Business Neurotic

Racism

Respect Wedding Graduate

New_York

Gorilla

Psychiatrist

Home

Violence

Scuba

Organized Crime

Lawyer

Funeral England

….

money office business fraud advertising greed secretary hotel businessman computer boss unemployment bank internet salesman blackmail debt employer-employee-relationship factory inheritance police murder gangster robbery mafia revenge hitman chase bank-robbery organized-crime violence criminal crime gun heist neo-noir undercover police-officer betrayal corruption african-american racism interracial-relationship independent-film racial-slur race-relations prejudice blaxploitation violence urban interracial-romance ghetto based-on-true-story immigrant south-africa los-angeles-california southern-u.s. latino asian-american hip-hop Psycho

The Sopranos The Godfather Secrets & Lies

…. …. …. …. ….

slide-88
SLIDE 88

Conclusions

 Morphological analysis is a critical pre-

processing step for most text mining applications in Hebrew

 We obtain accuracy in the 90-95% range on

full segmentation/POS/morphological analysis – robust on unknown words.

 Enables effective LDA Topic Analysis in

Hebrew

 More work needed on Topic Analysis

Evaluation

Outline