Graph-Based Methods for M ltili Multilingual Text and Web l T t - PowerPoint PPT Presentation

Graph-Based Methods for M ltili Multilingual Text and Web l T t d W b Mining Mining Mark Last Department of Information Systems Engineering p y g g Ben-Gurion University of the Negev In cooperation with H Horst Bunke (University of Bern) t B k (U i it f B ) Abraham Kandel, Adam Schenker (University of South Florida) Alex Markov, Marina Litvak, Guy Danon (Ben-Gurion University) E-mail: mlast@bgu.ac.il Home Page: http://www.bgu.ac.il/~mlast/ Text Mining Day 2009 at BGU, May 25, 2009

Agenda • Introduction and Motivation • Graph Based Representations of Text and • Graph-Based Representations of Text and Web Documents • Graph-Based Categorization and Clustering Algorithms Algorithms • The Hybrid Approach to Web Document The Hybrid Approach to Web Document Categorization • Graph-Based Keyword Extraction • Summary • Summary Prof. Mark Last (BGU) 2

3 INTRODUCTION AND Prof. Mark Last (BGU) O MOTIVATION MOTIVATION C O

Web Mining Tasks g Web Mining Web Structure Web Usage Web Content Mining Mining Mining Mining Mining Mining PageRank g Information Document Document Keyword Search and Categorization g Clustering g Extraction and t act o a d Retrieval Retrieval Summarization Prof. Mark Last (BGU) 4

The Vector-Space Model (Salton et al ., 1975) (Salton et al 1975) • A t A text document is considered a “bag of words (terms / features)” t d t i id d “b f d (t / f t )” – Document d j = (w 1j ,… ,w | T| j ) where T = (t 1 ,… ,t | T| ) is set of terms (features) that occurs at least once in at least one document (features) that occurs at least once in at least one document ( vocabulary ) • Term: n -gram single word noun phrase keyphrase etc Term: n gram, single word, noun phrase, keyphrase, etc. • Term weights: binary, frequency-based, etc. • Meaningless (“stop”) words are removed Meaningless (“stop”) words are removed • Stemming operations may be applied – Leaders => Leader – Expiring => expire • The ordering and position of words, as well as document logical structure and layout , are completely ignored May 29, 2009 5

Advantages of the Vector-Space Model Model (based on Joachims, 2002) • A simple and straightforward representation for A i l d i h f d i f English and other languages, where words have a g g g clear delimiter • Most weighting schemes require a single scan of • Most weighting schemes require a single scan of each document • A fixed-size vector representation makes unstructured text accessible to most classification unstructured text accessible to most classification algorithms (from decision trees to SVMs) • Consistently good results in the information C i t tl d lt i th i f ti retrieval domain (mainly, on English corpora) May 29, 2009 6

Limitations of the Vector- Space Model Space Model • Text documents T t d t – Ignoring the word position in the document – Ignoring the ordering of words in the document • Web Documents – Ignoring the information contained in HTML tags (e.g., document sections) • Multilingual documents – Word separation may be tricky in some languages (e g – Word separation may be tricky in some languages (e.g., Latin, German, Chinese, etc.) – No comprehensive evaluation on large non-English No comprehensive evaluation on large non English corpora May 29, 2009 7

The Word Separation in the Ancient Latin Ancient Latin The Arch of Titus, Rome (1 st Century AD) Dedication to Julius Caesar (1 st Century BC) Words are separated by triangles May 29, 2009 8

Introduced in Schenker et al ., 2005 GRAPH-BASED REPRESENTATIONS OF TEXT AND WEB DOCUMENTS AND WEB DOCUMENTS Prof. Mark Last (BGU) 9

Relevant Definitions ( Based on Bunke and Kandel, 2 0 0 0 ) ( Based on Bunke and Kandel 2 0 0 0 ) ( ) G = α , β • A ( labeled ) graph G is a 4-tuple V , E , Where Wh ⊆ ⊆ × V is a set of nodes (vertices), ( ), is a set of E V V α edges connecting the nodes, is a function β β labeling the nodes and labeling the nodes and is a function labeling is a function labeling the edges. Edge label label Node x y label A B C • Node and edge IDs are omitted for brevity • Graph size : | G| = | V| + | E| • Graph size : | G| = | V| + | E| Prof. Mark Last (BGU) 10

The Graph-Based Model of Web Documents – Basic Ideas Documents Basic Ideas • At most one node for each unique term in a document • At most one node for each unique term in a document • If a word B follows a word A , there is a directed edge from A to B from A to B – Unless the words are separated by certain punctuation marks (periods, question marks, and exclamation points) • Stop words are removed • Graph size may be limited by including only the most f frequent terms t t • Stemming – Alternate forms of the same term (singular/plural, Alt t f f th t ( i l / l l past/present/future tense, etc.) are conflated to the most frequently occurring form q y g • Several variations for node and edge labeling (see the next slides) Prof. Mark Last (BGU) 11

The Standard Representation p • Edges are labeled according to the document section where the Edges are labeled according to the document section where the words are followed by each other – Title (TI) contains the text related to the document’s title and any provided ( ) y p keywords (meta-data); – Link (L) is the “anchor text” that appears in clickable hyper-links on the document; document; – Text (TX) comprises any of the visible text in the document (this includes anchor text but not title and keyword text) TI L YAHOO YAHOO NEWS NEWS MORE MORE TX TX SERVICE SERVICE REPORTS REPORTS REUTERS REUTERS TX Prof. Mark Last (BGU) 12

The Simple Representation • The graph is based only the visible text on the Th h i b d l h i ibl h page (title and meta-data are ignored) p g ( g ) • Edges are not labeled NEWS NEWS MORE MORE REPORTS REPORTS REUTERS REUTERS SERVICE SERVICE Prof. Mark Last (BGU) 13

Other Representations • The n distance Representation • The n -distance Representation – Look up to n terms ahead and connect the succeeding terms with an edge that is labeled with the succeeding terms with an edge that is labeled with the distance between them ( n ) • The n -simple Representation • The n -simple Representation – Look up to n terms ahead and connect the succeeding terms with an unlabeled edge succeeding terms with an unlabeled edge • The Absolute Frequency Representation – Each node and edge is labeled with an absolute Each node and edge is labeled with an absolute frequency measure • The Relative Frequency Representation The Relative Frequency Representation – Each node and edge is labeled with a relative frequency measure frequency measure Prof. Mark Last (BGU) 14

Graph Based Docum ent Representation Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 Exam ple Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 Prof. Mark Last (BGU) 15

16 title Representation - Parsing Parsing Graph Based Docum ent link text Representation Prof. Mark Last (BGU)

Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing TI TLE TI TLE CNN.com International Stop word removal Stop word removal Text A car bomb has exploded outside a popular Baghdad p p p g restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqi Prime Minister Ibrahim al-Jaafari and his driver were Iraqi Prime Minister Ibrahim al Jaafari and his driver were killing killed in a drive-by shooting. Stemming g Li k Links Iraq bomb: Four dead, 110 wounded. FULL STORY FULL STORY. Prof. Mark Last (BGU) 17

Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing TI TLE TI TLE CNN.com International Text A car bomb has exploded outside a popular Baghdad p p p g restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqis Prime Minister Ibrahim al-Jaafari and his driver were Iraqis Prime Minister Ibrahim al Jaafari and his driver were killing in a driver shooting. Li k Links Iraqis bomb: Four dead, 110 wounding. FULL STORY FULL STORY. Prof. Mark Last (BGU) 18

Standard Graph Based Docum ent Representation Representation Ten most frequent terms are used TX KILL Word Frequency CAR DRIVE Iraq Iraq 3 3 TX TX TX Text TX Kill 2 L L Bomb Bomb 2 2 IRAQ BOMB Wound 2 TX Link Link Drive D i 2 2 TX Explod 1 TX TX Baghdad 1 WOUND EXPLOD BAGHDAD International 1 Title CNN 1 TI INTERNATIONAL CNN Car 1 Prof. Mark Last (BGU) 19

Sim ple Graph Based Docum ent Representation Representation Ten most frequent terms are used KILL CAR DRIVE Word Frequency Iraq 3 Kill 2 Bomb 2 IRAQ BOMB Wound 2 Drive 2 Explod 1 WOUND EXPLOD BAGHDAD Baghdad 1 International International 1 1 CNN 1 Car Car 1 1 Prof. Mark Last (BGU) 20

Based on Schenker et al ., 2005 GRAPH-BASED CATEGORIZATION AND CATEGORIZATION AND CLUSTERING ALGORITHMS CLUSTERING ALGORITHMS Prof. Mark Last (BGU) 21

Graph-Based Methods for M ltili Multilingual Text and Web l T t - PowerPoint PPT Presentation

Graph-Based Methods for M ltili Multilingual Text and Web l T t d W b Mining Mining Mark Last Department of Information Systems Engineering p y g g Ben-Gurion University of the Negev In cooperation with H Horst Bunke (University of

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Standards for multilingual web sites MultilingualWeb.eu, 4-5 April 2011, Pisa, Italy M.T.

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Multilingual Web: Affordable for SMEs and Small Organizations? Multilingual Communication

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Walking with Jesus: A Visual Tour of the Holy Land Grace Chapel Sunday, January 31, 2016 The

a legal case study P e r f o r m a n c e : L y r i c s : P h i l K e a g g y M a r k L a n i

Last weeks message: A huge step backward Exodus 32:1-33:6 The consequences of sin (part two)

Outline VXD concept based on CMOS Pixel Sensors (CPS) Status of CPS development for running

The Gospel of Freedom and the Things That Hinder Ne New The Gospel of Freedom and

Grace ce Alo lone ne Pastor Augie Iadicicco For in the gospel the righteousness of God is

YOSEMITE SAM ACTS 4:34-35 There was not a needy person among them, for as many as were owners

Multifaceted Toponym Recognition for Streaming News Michael D. Lieberman Hanan Samet Center for