Text Mining in Hebrew Impact of Morphology Analysis on Topic - PowerPoint PPT Presentation

Text Mining in Hebrew Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer

Text in Hebrew  Extract information from text in Hebrew  Major immediate obstacle  Rich morphology  Very high number of distinct word forms  Very high ambiguity Text Mining and Morphology

Morphological Analysis  םלצב  םֶלֶצֱּב (name of an association)  םֶּלַצֱּב (while taking a picture)  םָלָצֱּב (their onion)  םָלִצֱּב (under their shades)  םָלַצֱּב (in a photographer)  םָלַצַב (in the photographer (  םֶלֶצֱּב (in an idol (  םֶלֶצַב (in the idol ( Text Mining and Morphology

How Critical is Morphological Analysis to Text Mining?  How much does Hebrew morphology affect high-level text analysis tasks?  Named Entity Recognition  Information Extraction  Topic Analysis  Information Retrieval Text Mining and Morphology

Topic Analysis and Search  Topic Analysis  Unsupervised discovery of topics in text collection  Useful to browse a large corpus by theme  Difficult to evaluate  Faceted Search  Useful combination of search and browsing  Exploratory search (as opposed to fact finding)  Enabled by topic analysis Text Mining and Morphology

The Basic Idea One word שיא– about 50 distinct forms in the corpus Text Mining and Morphology

Outline  Objectives  Topic Analysis in Hebrew  Improved Search  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

Objectives  Input:  Domain specific text corpus in Hebrew  Output:  Topic model:  Discover “topics” discussed in the corpus  Recognize topics in unseen text  Index text collection by topic  Task:  Search and browse text collection using topics Objectives

Example: Rambam’s Mishne Torah  Corpus of Mishne Torah  Exhaustive code of Halakha  Written by Maimonides 1170-1180  14 books, 85 sections, 1000 chapters, 15K articles, 350K words.  Creative compilation of laws from multiple sources:  Torah, Talmud (Bavli and Yerushalmi), Tosefta, halakhic midrashim (sifra and sifre), Geonim.  Synthetic hierarchical organization Objectives

Problems with Existing Search  Morphology A single “ו“ and the word is not found… Objectives

Problems: Explore complex topics  “רוש“ refers to many complex halakhic topics:  Damages ( חגונ רוש )  Kosher meat ( הטיחש )  Sacrifices ( תונברק )  Shabat ( תבש )  Calendar ( רוש לזמ )  Queries must be disambiguated  רוש+תבש ? Objectives

Exploratory/Faceted Search  How to deal with ambiguous query terms?  Propose refinements according to contexts  “ Do you mean: damages, meat, shabat…”  Propose facets for query refinement  Where do the topics (facets) come from?  How do we disambiguate the query terms?  Given a disambiguated topic, how do we refine the query? Objectives

Outline  Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

Discovering Topic Models: LDA  Latent Dirichlet Allocation  Blei and Jordan 1993  Discover (unsupervised) topic structures in a document collection  Topics are modeled as distributions of words  Probabilistic generative model of text LDA

רוש Topics for LDA

Topics for a Document LDA

The LDA Model  Observation: documents are composed of words.  Intuition: documents exhibit multiple topics  Generative probabilistic model:  Each document is a mixture of topics  Each word is drawn from the topics active in the document LDA

Structure of the LDA Model From (Blei 2008) LDA

Learning an LDA Model from Observations  Observation: documents and words  Objective: infer an underlying topic structure  What are the topics?  How are the documents divided according to those topics? LDA

Graphical Models (Blei 2008) LDA

LDA Graphical Model (Blei 2008) LDA

LDA Generative Process (Blei 2008) LDA

LDA Estimation (Blei 2008) LDA

LDA Approximation (Blei 2008) LDA

Outline  Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

Morphological Analysis  םֶלֶצֱּב  םלצב proper-noun  םֶּלַצֱּב  םלצב verb, infinitive  םָלָצֱּב  לצב-ם noun, singular, masculine  םָלִצֱּב  ב-לצ-ם noun, singular, masculine  םָלַצֱּבםֶלֶצֱּב  ב-םלצ noun, singular, masculine, absolute  ב-םלצ noun, singular, masculine, construct  םָלַצַבםֶלֶצַב  ב-םלצ noun, definitive singular, masculine Morphology

Morphological Analyzer Analyzer w 1 ,…w n {t 11 ,…,t i1 } w 1 … w n {t n1 ,…,t in }  Implementation  Corpus based  Lexicon based  Analytic  Synthetic Morphology

Morphological Disambiguation  עידוה םלצב ןוגרא  ןמאמה רזועב יתנחבה קחשמה תא םלצב  םיקוושב ףטחנ םלצב  וניסח םיענה םלצב  יעוצקמ םלצב יתלקתנ  תונותח םלצב יתשגפ  יעוצקמה םלצב יתלקתנ Morphology

Morphological Disambiguation Disambiguator {t 11 ,…,t i1 } w 1 w 1 t j1 … … w N {t N1 ,…,t iN } w n t jn Morphology

Hebrew Text Analysis System Tokenizer Morphological Analyzer Lexicon ME Unknown Tokens Analyzer Morphological Disambiguator HMM Proper-name Classifier SVM ME Named-Entity Recognizer Noun-Phrase Chunker SVM http://www.cs.bgu.ac.il/~nlpproj/demo Morphology

Morphological Disambiguation - Methods  Rule-based vs. Stochastic models  Supervised vs. Unsupervised learning  Exact vs. Approximate inference Morphology

Hidden Markov Model  S – a set of states (= tags)  O – a set of output symbols (= tokens)  µ – a probabilistic model  State transition probabilities A = {a i,j }  Symbol emission probabilities B = {B i,k } Computational Model

HMM- An Example  S = {start, noun, verb} {דלי ,חרי  O = {  µ = (A,B( noun verb חרידלי start 0.8 0.2 noun 0.9 0.1 noun 0.70.3 verb 0.9 0.1 verb 0.9 0.1 A B Computational Model

Markov Process noun noun start חרידלי Computational Model

Decoding a noun, noun 0.9 a noun,verb 0.1 a start,noun 0.8 a verb, noun 0.9 a start,verb 0.2 a verb,verb 0.1 ?? start b noun, דלי 0. 3 b noun, חרי 0. 7 b verb, דלי 0. 9 b verb, חרי 0. 1 חרידלי Computational Model

Decoding a noun, noun 0.9 a noun,verb 0.1 a verb, noun 0.9 a BOS,noun 0.8 a verb,verb 0.1 a BOS,verb 0.2 ?? start b noun, חרי 0. 7 b noun, דלי 0. 3 b verb, חרי 0. 1 b verb, דלי 0. 9 חרידלי (noun, noun) = a start,noun b noun, דלי a noun, noun b noun, חרי = 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = a start,noun b noun, דלי a noun, verb b verb, חרי = 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = a start,verb b verb, דלי a verb, noun b noun, חרי = 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = a start,verb b verb, דלי a verb, verb b verb, חרי = 0.2*0.9*0.1*0.1 = 0.0018 Computational Model

Decoding a noun, noun 0.9 a noun,verb 0.1 a verb, noun 0.9 a BOS,noun 0.8 a verb,verb 0.1 a BOS,verb 0.2 ?? start b noun, חרי 0. 7 b noun, דלי 0. 3 b verb, חרי 0. 1 b verb, דלי 0. 9 חרידלי (noun, noun) = a start,noun b noun, דלי a noun, noun b noun, חרי = 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = a start,noun b noun, דלי a noun, verb b verb, חרי = 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = a start,verb b verb, דלי a verb, noun b noun, חרי = 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = a start,verb b verb, דלי a verb, verb b verb, חרי = 0.2*0.9*0.1*0.1 = 0.0018 Viterbi Algorithm (dynamic programming) Computational Model

Parameter Estimation ? ? start ? ? חרידלי Computational Model

Supervised Parameter Estimation ? ? noun noun start ? ? חרידלי Computational Model

Supervised Parameter Estimation number of transitions from state i to state j a i,j = number of transitions from state i number of lexical transitions from state i to symbol k b i,k = number of transitions from state i

Text Mining in Hebrew Impact of Morphology Analysis on Topic - PowerPoint PPT Presentation

Text Mining in Hebrew Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer Text in Hebrew Extract information from text in Hebrew Major immediate

The Hebrew Bible The Hebrew Bible (Old Testament) The Hebrew Bible (Old Testament) The

Devotionals from the Psalms 2 nd Edition The Hebrew Scriptures The Hebrew Scriptures The Hebrew

D EVOTIO N ALS DEVOTIONALS The Hebrew Bible The Hebrew Bible (Old Testament) The Hebrew

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Hebrew Dependency Parsing: Initial Results Yoav Goldberg Michael Elhadad IWPT 2009, Paris

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

The Hebrew-Christian Messiah; Or, the Presentation of the Messiah to the The Hebrew-Christian

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Software Benchmarking of the 2 nd round CAESAR Candidates Ralph Ankele 1 , Robin Ankele 2 1 Royal

Neosemantics - A Linked Data Toolkit for Neo4j Jess Barrasa - Neo4j Jess Barrasa

A heterospective on homophonophobia Daniel HarbourQueen Mary, University of London WOTM4 ,

Graphdatenbanksysteme Ein berblick Benjamin Gehrels benjamin@gehrels.info GitHub: @BGehrels

On the Syntax-Semantics Interface of Directed Transport and Caused Motion Expressions Rainer

Google Hacking 19 September 2013 Updated August 2015 #s Google's cache is over 95 Petabytes

Migrating plugins to standard features VimConf 2018 daisuzu About me daisuzu(Daisuke

Sambuz

Useful Links

Newsletter

Mail Us