text mining in hebrew
play

Text Mining in Hebrew Impact of Morphology Analysis on Topic - PowerPoint PPT Presentation

Text Mining in Hebrew Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer Text in Hebrew Extract information from text in Hebrew Major immediate


  1. Text Mining in Hebrew Impact of Morphology Analysis on Topic Analysis and on Search Quality Michael Elhadad, Meni Adler, Yoav Goldberg, Dudi Gabay and Yael Netzer

  2. Text in Hebrew  Extract information from text in Hebrew  Major immediate obstacle  Rich morphology  Very high number of distinct word forms  Very high ambiguity Text Mining and Morphology

  3. Morphological Analysis  םלצב  םֶלֶצֱּב (name of an association)  םֶּלַצֱּב (while taking a picture)  םָלָצֱּב (their onion)  םָלִצֱּב (under their shades)  םָלַצֱּב (in a photographer)  םָלַצַב (in the photographer (  םֶלֶצֱּב (in an idol (  םֶלֶצַב (in the idol ( Text Mining and Morphology

  4. How Critical is Morphological Analysis to Text Mining?  How much does Hebrew morphology affect high-level text analysis tasks?  Named Entity Recognition  Information Extraction  Topic Analysis  Information Retrieval Text Mining and Morphology

  5. Topic Analysis and Search  Topic Analysis  Unsupervised discovery of topics in text collection  Useful to browse a large corpus by theme  Difficult to evaluate  Faceted Search  Useful combination of search and browsing  Exploratory search (as opposed to fact finding)  Enabled by topic analysis Text Mining and Morphology

  6. The Basic Idea One word שיא– about 50 distinct forms in the corpus Text Mining and Morphology

  7. Outline  Objectives  Topic Analysis in Hebrew  Improved Search  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

  8. Objectives  Input:  Domain specific text corpus in Hebrew  Output:  Topic model:  Discover “topics” discussed in the corpus  Recognize topics in unseen text  Index text collection by topic  Task:  Search and browse text collection using topics Objectives

  9. Example: Rambam’s Mishne Torah  Corpus of Mishne Torah  Exhaustive code of Halakha  Written by Maimonides 1170-1180  14 books, 85 sections, 1000 chapters, 15K articles, 350K words.  Creative compilation of laws from multiple sources:  Torah, Talmud (Bavli and Yerushalmi), Tosefta, halakhic midrashim (sifra and sifre), Geonim.  Synthetic hierarchical organization Objectives

  10. Problems with Existing Search  Morphology A single “ו“ and the word is not found… Objectives

  11. Problems: Explore complex topics  “רוש“ refers to many complex halakhic topics:  Damages ( חגונ רוש )  Kosher meat ( הטיחש )  Sacrifices ( תונברק )  Shabat ( תבש )  Calendar ( רוש לזמ )  Queries must be disambiguated  רוש+תבש ? Objectives

  12. Exploratory/Faceted Search  How to deal with ambiguous query terms?  Propose refinements according to contexts  “ Do you mean: damages, meat, shabat…”  Propose facets for query refinement  Where do the topics (facets) come from?  How do we disambiguate the query terms?  Given a disambiguated topic, how do we refine the query? Objectives

  13. Outline  Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

  14. Discovering Topic Models: LDA  Latent Dirichlet Allocation  Blei and Jordan 1993  Discover (unsupervised) topic structures in a document collection  Topics are modeled as distributions of words  Probabilistic generative model of text LDA

  15. רוש Topics for LDA

  16. Topics for a Document LDA

  17. The LDA Model  Observation: documents are composed of words.  Intuition: documents exhibit multiple topics  Generative probabilistic model:  Each document is a mixture of topics  Each word is drawn from the topics active in the document LDA

  18. Structure of the LDA Model From (Blei 2008) LDA

  19. Learning an LDA Model from Observations  Observation: documents and words  Objective: infer an underlying topic structure  What are the topics?  How are the documents divided according to those topics? LDA

  20. Graphical Models (Blei 2008) LDA

  21. LDA Graphical Model (Blei 2008) LDA

  22. LDA Generative Process (Blei 2008) LDA

  23. LDA Estimation (Blei 2008) LDA

  24. LDA Approximation (Blei 2008) LDA

  25. Outline  Objectives  Topic Analysis with LDA  Obtaining Precise Morphology in Hebrew  Combining LDA and Morphological Analysis  Using Topic Models for Search  Evaluating Topic Models  Next Steps Outline

  26. Morphological Analysis  םֶלֶצֱּב  םלצב proper-noun  םֶּלַצֱּב  םלצב verb, infinitive  םָלָצֱּב  לצב-ם noun, singular, masculine  םָלִצֱּב  ב-לצ-ם noun, singular, masculine  םָלַצֱּבםֶלֶצֱּב  ב-םלצ noun, singular, masculine, absolute  ב-םלצ noun, singular, masculine, construct  םָלַצַבםֶלֶצַב  ב-םלצ noun, definitive singular, masculine Morphology

  27. Morphological Analyzer Analyzer w 1 ,…w n {t 11 ,…,t i1 } w 1 … w n {t n1 ,…,t in }  Implementation  Corpus based  Lexicon based  Analytic  Synthetic Morphology

  28. Morphological Disambiguation  עידוה םלצב ןוגרא  ןמאמה רזועב יתנחבה קחשמה תא םלצב  םיקוושב ףטחנ םלצב  וניסח םיענה םלצב  יעוצקמ םלצב יתלקתנ  תונותח םלצב יתשגפ  יעוצקמה םלצב יתלקתנ Morphology

  29. Morphological Disambiguation Disambiguator {t 11 ,…,t i1 } w 1 w 1 t j1 … … w N {t N1 ,…,t iN } w n t jn Morphology

  30. Hebrew Text Analysis System Tokenizer Morphological Analyzer Lexicon ME Unknown Tokens Analyzer Morphological Disambiguator HMM Proper-name Classifier SVM ME Named-Entity Recognizer Noun-Phrase Chunker SVM http://www.cs.bgu.ac.il/~nlpproj/demo Morphology

  31. Morphological Disambiguation - Methods  Rule-based vs. Stochastic models  Supervised vs. Unsupervised learning  Exact vs. Approximate inference Morphology

  32. Hidden Markov Model  S – a set of states (= tags)  O – a set of output symbols (= tokens)  µ – a probabilistic model  State transition probabilities A = {a i,j }  Symbol emission probabilities B = {B i,k } Computational Model

  33. HMM- An Example  S = {start, noun, verb} {דלי ,חרי  O = {  µ = (A,B( noun verb חרידלי start 0.8 0.2 noun 0.9 0.1 noun 0.70.3 verb 0.9 0.1 verb 0.9 0.1 A B Computational Model

  34. Markov Process noun noun start חרידלי Computational Model

  35. Decoding a noun, noun 0.9 a noun,verb 0.1 a start,noun 0.8 a verb, noun 0.9 a start,verb 0.2 a verb,verb 0.1 ?? start b noun, דלי 0. 3 b noun, חרי 0. 7 b verb, דלי 0. 9 b verb, חרי 0. 1 חרידלי Computational Model

  36. Decoding a noun, noun 0.9 a noun,verb 0.1 a verb, noun 0.9 a BOS,noun 0.8 a verb,verb 0.1 a BOS,verb 0.2 ?? start b noun, חרי 0. 7 b noun, דלי 0. 3 b verb, חרי 0. 1 b verb, דלי 0. 9 חרידלי (noun, noun) = a start,noun b noun, דלי a noun, noun b noun, חרי = 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = a start,noun b noun, דלי a noun, verb b verb, חרי = 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = a start,verb b verb, דלי a verb, noun b noun, חרי = 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = a start,verb b verb, דלי a verb, verb b verb, חרי = 0.2*0.9*0.1*0.1 = 0.0018 Computational Model

  37. Decoding a noun, noun 0.9 a noun,verb 0.1 a verb, noun 0.9 a BOS,noun 0.8 a verb,verb 0.1 a BOS,verb 0.2 ?? start b noun, חרי 0. 7 b noun, דלי 0. 3 b verb, חרי 0. 1 b verb, דלי 0. 9 חרידלי (noun, noun) = a start,noun b noun, דלי a noun, noun b noun, חרי = 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = a start,noun b noun, דלי a noun, verb b verb, חרי = 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = a start,verb b verb, דלי a verb, noun b noun, חרי = 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = a start,verb b verb, דלי a verb, verb b verb, חרי = 0.2*0.9*0.1*0.1 = 0.0018 Computational Model

  38. Decoding a noun, noun 0.9 a noun,verb 0.1 a verb, noun 0.9 a BOS,noun 0.8 a verb,verb 0.1 a BOS,verb 0.2 ?? start b noun, חרי 0. 7 b noun, דלי 0. 3 b verb, חרי 0. 1 b verb, דלי 0. 9 חרידלי (noun, noun) = a start,noun b noun, דלי a noun, noun b noun, חרי = 0.8*0.3*0.9*0.7 = 0.1512 (noun, verb) = a start,noun b noun, דלי a noun, verb b verb, חרי = 0.8*0.3*0.1*0.1 = 0.0024 (verb, noun) = a start,verb b verb, דלי a verb, noun b noun, חרי = 0.2*0.9*0.9*0.7 = 0.1134 (verb, verb) = a start,verb b verb, דלי a verb, verb b verb, חרי = 0.2*0.9*0.1*0.1 = 0.0018 Viterbi Algorithm (dynamic programming) Computational Model

  39. Parameter Estimation ? ? start ? ? חרידלי Computational Model

  40. Supervised Parameter Estimation ? ? noun noun start ? ? חרידלי Computational Model

  41. Supervised Parameter Estimation number of transitions from state i to state j a i,j = number of transitions from state i number of lexical transitions from state i to symbol k b i,k = number of transitions from state i

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend