Graph-Based Methods for M ltili Multilingual Text and Web l T t - - PowerPoint PPT Presentation

graph based methods for m ltili multilingual text and web
SMART_READER_LITE
LIVE PREVIEW

Graph-Based Methods for M ltili Multilingual Text and Web l T t - - PowerPoint PPT Presentation

Graph-Based Methods for M ltili Multilingual Text and Web l T t d W b Mining Mining Mark Last Department of Information Systems Engineering p y g g Ben-Gurion University of the Negev In cooperation with H Horst Bunke (University of


slide-1
SLIDE 1

Graph-Based Methods for M ltili l T t d W b Multilingual Text and Web Mining Mining

Mark Last Department of Information Systems Engineering p y g g Ben-Gurion University of the Negev In cooperation with H t B k (U i it f B ) Horst Bunke (University of Bern) Abraham Kandel, Adam Schenker (University of South Florida) Alex Markov, Marina Litvak, Guy Danon (Ben-Gurion University)

E-mail: mlast@bgu.ac.il Home Page: http://www.bgu.ac.il/~mlast/

Text Mining Day 2009 at BGU, May 25, 2009

slide-2
SLIDE 2

Agenda

  • Introduction and Motivation
  • Graph Based Representations of Text and
  • Graph-Based Representations of Text and

Web Documents

  • Graph-Based Categorization and Clustering

Algorithms Algorithms

  • The Hybrid Approach to Web Document

The Hybrid Approach to Web Document Categorization

  • Graph-Based Keyword Extraction
  • Summary
  • Prof. Mark Last (BGU)

2

  • Summary
slide-3
SLIDE 3

O C O INTRODUCTION AND MOTIVATION MOTIVATION

  • Prof. Mark Last (BGU)

3

slide-4
SLIDE 4

Web Mining Tasks g

Web Mining Web Structure Mining Web Usage Mining Web Content Mining Mining Mining Mining

PageRank

Information Search and Retrieval Document Categorization Document Clustering Keyword Extraction and

g

Retrieval g g t act o a d Summarization

  • Prof. Mark Last (BGU)

4

slide-5
SLIDE 5

The Vector-Space Model

(Salton et al 1975) (Salton et al., 1975)

A t t d t i id d “b f d (t / f t )”

  • A text document is considered a “bag of words (terms / features)”

– Document dj = (w1j,… ,w| T| j) where T = (t 1,… ,t | T|) is set of terms (features) that occurs at least once in at least one document (features) that occurs at least once in at least one document (vocabulary)

  • Term: n-gram single word noun phrase keyphrase etc

Term: n gram, single word, noun phrase, keyphrase, etc.

  • Term weights: binary, frequency-based, etc.

Meaningless (“stop”) words are removed

  • Meaningless (“stop”) words are removed
  • Stemming operations may be applied

– Leaders => Leader – Expiring => expire

  • The ordering and position of words, as well as document logical

structure and layout, are completely ignored

May 29, 2009 5

slide-6
SLIDE 6

Advantages of the Vector-Space Model Model

(based on Joachims, 2002)

A i l d i h f d i f

  • A simple and straightforward representation for

English and other languages, where words have a g g g clear delimiter

  • Most weighting schemes require a single scan of
  • Most weighting schemes require a single scan of

each document

  • A fixed-size vector representation makes

unstructured text accessible to most classification unstructured text accessible to most classification algorithms (from decision trees to SVMs) C i t tl d lt i th i f ti

  • Consistently good results in the information

retrieval domain (mainly, on English corpora)

May 29, 2009 6

slide-7
SLIDE 7

Limitations of the Vector- Space Model Space Model

T t d t

  • Text documents

– Ignoring the word position in the document – Ignoring the ordering of words in the document

  • Web Documents

– Ignoring the information contained in HTML tags (e.g., document sections)

  • Multilingual documents

– Word separation may be tricky in some languages (e g – Word separation may be tricky in some languages (e.g., Latin, German, Chinese, etc.) – No comprehensive evaluation on large non-English No comprehensive evaluation on large non English corpora

May 29, 2009 7

slide-8
SLIDE 8

The Word Separation in the Ancient Latin Ancient Latin

The Arch of Titus, Rome (1st Century AD) Dedication to Julius Caesar (1st Century BC) Words are separated by triangles

May 29, 2009 8

slide-9
SLIDE 9

Introduced in Schenker et al., 2005

GRAPH-BASED REPRESENTATIONS OF TEXT AND WEB DOCUMENTS AND WEB DOCUMENTS

  • Prof. Mark Last (BGU)

9

slide-10
SLIDE 10

Relevant Definitions

( Based on Bunke and Kandel 2 0 0 0 ) ( Based on Bunke and Kandel, 2 0 0 0 )

( )

β α, , ,E V G =

  • A (labeled) graph G is a 4-tuple

Wh V V E × ⊆ Where V is a set of nodes (vertices), is a set of ⊆

α

β

( ), edges connecting the nodes, labeling the nodes and is a function is a function labeling

β

labeling the nodes and the edges. is a function labeling

Edge label A B x C y Node label label

  • Node and edge IDs are omitted for brevity
  • Graph size: | G| = | V| + | E|
  • Prof. Mark Last (BGU)

10

  • Graph size: | G| = | V| + | E|
slide-11
SLIDE 11

The Graph-Based Model of Web Documents Basic Ideas Documents – Basic Ideas

  • At most one node for each unique term in a document
  • At most one node for each unique term in a document
  • If a word B follows a word A, there is a directed edge

from A to B from A to B

– Unless the words are separated by certain punctuation marks (periods, question marks, and exclamation points)

  • Stop words are removed
  • Graph size may be limited by including only the most

f t t frequent terms

  • Stemming

Alt t f f th t ( i l / l l – Alternate forms of the same term (singular/plural, past/present/future tense, etc.) are conflated to the most frequently occurring form q y g

  • Several variations for node and edge labeling (see the

next slides)

  • Prof. Mark Last (BGU)

11

slide-12
SLIDE 12

The Standard Representation p

Edges are labeled according to the document section where the

  • Edges are labeled according to the document section where the

words are followed by each other

– Title (TI) contains the text related to the document’s title and any provided ( ) y p keywords (meta-data); – Link (L) is the “anchor text” that appears in clickable hyper-links on the document; document; – Text (TX) comprises any of the visible text in the document (this includes anchor text but not title and keyword text)

YAHOO NEWS MORE TI L YAHOO NEWS SERVICE MORE REPORTS REUTERS TX TX SERVICE REPORTS REUTERS TX

  • Prof. Mark Last (BGU)

12

slide-13
SLIDE 13

The Simple Representation

Th h i b d l h i ibl h

  • The graph is based only the visible text on the

page (title and meta-data are ignored) p g ( g )

  • Edges are not labeled

NEWS MORE NEWS SERVICE MORE REPORTS REUTERS SERVICE REPORTS REUTERS

  • Prof. Mark Last (BGU)

13

slide-14
SLIDE 14

Other Representations

  • The n distance Representation
  • The n-distance Representation

– Look up to n terms ahead and connect the succeeding terms with an edge that is labeled with the succeeding terms with an edge that is labeled with the distance between them (n)

  • The n-simple Representation
  • The n-simple Representation

– Look up to n terms ahead and connect the succeeding terms with an unlabeled edge succeeding terms with an unlabeled edge

  • The Absolute Frequency Representation

Each node and edge is labeled with an absolute – Each node and edge is labeled with an absolute frequency measure

  • The Relative Frequency Representation

The Relative Frequency Representation

– Each node and edge is labeled with a relative frequency measure

  • Prof. Mark Last (BGU)

14

frequency measure

slide-15
SLIDE 15

Graph Based Docum ent Representation

Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 Exam ple Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5

  • Prof. Mark Last (BGU)

15

slide-16
SLIDE 16

Graph Based Docum ent Representation - Parsing Representation Parsing title link text

  • Prof. Mark Last (BGU)

16

slide-17
SLIDE 17

Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing

TI TLE TI TLE

CNN.com International Stop word removal

Text

A car bomb has exploded outside a popular Baghdad Stop word removal p p p g restaurant, killing three Iraqis and wounding more than 110

  • thers, police officials said. Earlier an aide to the office of

Iraqi Prime Minister Ibrahim al-Jaafari and his driver were Iraqi Prime Minister Ibrahim al Jaafari and his driver were killed in a drive-by shooting.

Li k

Stemming killing

Links

Iraq bomb: Four dead, 110 wounded. FULL STORY g FULL STORY.

  • Prof. Mark Last (BGU)

17

slide-18
SLIDE 18

Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing

TI TLE TI TLE

CNN.com International

Text

A car bomb has exploded outside a popular Baghdad p p p g restaurant, killing three Iraqis and wounding more than 110

  • thers, police officials said. Earlier an aide to the office of

Iraqis Prime Minister Ibrahim al-Jaafari and his driver were Iraqis Prime Minister Ibrahim al Jaafari and his driver were killing in a driver shooting.

Li k Links

Iraqis bomb: Four dead, 110 wounding. FULL STORY FULL STORY.

  • Prof. Mark Last (BGU)

18

slide-19
SLIDE 19

Standard Graph Based Docum ent

Representation Representation

TX

Ten most frequent terms are used

KILL DRIVE CAR TX

Frequency Word 3 Iraq

TX TX TX L

3 Iraq 2 Kill 2 Bomb

Text

IRAQ BOMB TX L

2 Bomb 2 Wound 2 D i

Link

TX TX

2 Drive 1 Explod

Link

EXPLOD BAGHDAD WOUND TX

1 Baghdad 1 International

Title

CNN INTERNATIONAL TI

1 CNN 1 Car

  • Prof. Mark Last (BGU)

19

slide-20
SLIDE 20

Sim ple Graph Based Docum ent

Representation Representation

Ten most frequent terms are used

KILL DRIVE CAR

Frequency Word 3 Iraq 2 Kill

IRAQ BOMB

2 Bomb 2 Wound 2 Drive 1 Explod

EXPLOD BAGHDAD WOUND

1 Baghdad 1 International 1 International 1 CNN 1 Car

  • Prof. Mark Last (BGU)

20

1 Car

slide-21
SLIDE 21

Based on Schenker et al., 2005

GRAPH-BASED CATEGORIZATION AND CATEGORIZATION AND CLUSTERING ALGORITHMS CLUSTERING ALGORITHMS

  • Prof. Mark Last (BGU)

21

slide-22
SLIDE 22

“Lazy” Document Categorization with Graph-Based Models Graph-Based Models

  • The Basic k-Nearest Neighbors (k-NN) Algorithm

– Input: a set of labeled training documents, a query document d, Input: a set of labeled training documents, a query document d, and a parameter k defining the number of nearest neighbors to use – Output: a label indicating the category of the query document d – Step 1. Find the k nearest training documents to d according to a distance measure a distance measure – Step 2. Select the category of d to be the category held by the majority of the k nearest training documents majority of the k nearest training documents

  • k-Nearest Neighbors with Graphs (Schenker et al., 2005)

– Represent the documents as graphs Represent the documents as graphs – Use a graph-theoretical distance measure

  • Prof. Mark Last (BGU)

22

slide-23
SLIDE 23

Distance between two Graphs

  • Required properties

– (1) boundary condition: d(G G )≥0 – (1) boundary condition: d(G1,G2)≥0 – (2) identical graphs have zero distance: d(G1,G2)=0 → G1≅G2 (3) symmetry: d(G G )=d(G G ) – (3) symmetry: d(G1,G2)=d(G2,G1) – (4) triangle inequality: d(G1,G3)≤d(G1,G2)+d(G2,G3)

May 29, 2009 23

slide-24
SLIDE 24

Maximum Common Subgraph (mcs) (mcs)

  • The graph G is a maximum common subgraph

(mcs) if G is a common subgraph of G and G (mcs) if G is a common subgraph of G1 and G2 and there exist no other common subgraph G’ of G d G h th t |G’| |G| G1 and G2 such that |G’| > |G|

x q A B y w z A F x r q A x C D y B E p B

G2 G1 G

May 29, 2009 24

| G| = | V| + | E| = 2+ 1 = 3

slide-25
SLIDE 25

Minimum Common Supergraph (MCS) (MCS)

  • The graph G is a minimum common

supergraph (MCS) if G is a common p g p ( ) supergraph of G1 and G2 and there exist no other common supergraph G’ of G1 and G2 such that |G’| |G| |G’| < |G|

w A D x y A x D y B C B C

G G G

z

| G| = | V| + | E| = 4+ 2 = 6 G2 G1 G

May 29, 2009 25

| G| | V| + | E| 4+ 2 6

slide-26
SLIDE 26

MMCSN Distance betw een tw o Graphs Graphs

  • MMCSN Measure (Schenker et al 2005):
  • MMCSN Measure (Schenker et al., 2005):

d (G G ) =1− mcs(G1,G2)

  • mcs(G G )

maximum common subgraph

dMMCSN (G1,G2) =1 MCS(G1,G2)

  • mcs(G1, G2) - maximum common subgraph
  • MCS(G1, G2) - minimum common supergraph

A D A B A B

mcs (G1,G2)

B C

G1 G2

1 2 +

A D

1 MCS (G G )

667 . 5 4 1 2 1 ) , (

2 1

= + + − = G G dMMCSN

B C

  • Prof. Mark Last (BGU)

26

MCS (G1,G2)

slide-27
SLIDE 27

Other Distance Measures

  • Bunke and Shearer (1998):

dMCS(G1,G2) =1− mcs(G1,G2) max(G1, G2 )

  • Wallis et al. (2001):

(

1, 2 )

dWGU (G1,G2) =1− mcs(G1,G2) ( )

( )

  • Bunke (1997):

WGU ( 1, 2)

G1 + G2 − mcs(G1,G2) d (G G ) | G | + | G | 2| mcs(G G )|

  • Bunke (1997):

F á d d V li (2001)

dUGU(G1,G2)= | G1| + | G2| –2| mcs(G1,G2)|

  • Fernández and Valiente (2001):

d (G G ) | CS(G G )| | (G G )| dMMCS(G1,G2)= | MCS(G1,G2)| –| mcs(G1,G2)|

May 29, 2009 27

slide-28
SLIDE 28

k-Nearest Neighbors with Graphs

Sample Accuracy Results (Schenker et al

2004)

Sample Accuracy Results (Schenker et al., 2004)

Benchmark Data Set: K-series (Boley et al., 1999) 2 340 web documents from 20 categories

86%

2,340 web documents from 20 categories Source: English news pages hosted at Yahoo! Best results

82%

Best results with graphs

78% 74%

Best results ith t

70% 1 2 3 4 5 6 7 8 9 10

with vectors

1 2 3 4 5 6 7 8 9 10 Number of Nearest Neighbors (k)

Vector model (cosine) Vector model (Jaccard) Graphs (40 nodes/ graph) Graphs (70 nodes/ graph) Graphs (100 nodes/ graph) Graphs (150 nodes/ graph)

  • Prof. Mark Last (BGU)

28

slide-29
SLIDE 29

k-Nearest Neighbors with Graphs

Average Time to Classify One Document g y

Method Average time to classify one document Vector (cosine) 7.8 seconds V t (J d) 7 79 d Vector (Jaccard) 7.79 seconds Graphs, 40 nodes/graph 8.71 seconds Graphs, 70 nodes/graph 16.31 seconds G ap s, 70 odes/g ap 6.3 seco ds Graphs, 100 nodes/graph 24.62 seconds

May 29, 2009 29

slide-30
SLIDE 30

“Lazy” Document Categorization with Graph-Based Models Graph-Based Models

  • Advantages
  • Advantages

– Keeps HTML structure information – Retains original order of words Retains original order of words – Outperforms the vector-space model with several distance measures

  • Limitation

– Can work only with “lazy” classifiers (such as k-NN), which have a very low classification speed have a very low classification speed

  • Conclusion

– Graph models cannot be used directly for fast model-based Graph models cannot be used directly for fast, model based classification of web documents (e.g., using a decision tree)

  • Solution

– The hybrid approach: represent a document as a vector of sub-graphs (in a few minutes…)

  • Prof. Mark Last (BGU)

30

slide-31
SLIDE 31

The Graph-Based k-Means Clustering Algorithm Clustering Algorithm

Inputs: the set of n data items (represented by graphs) and a parameter k, defining the number of clusters p ( p y g p ) p , g to create Outputs: the centroids of the clusters (represented by median graphs) and for each data item the cluster (an integer in [1,k]) it belongs to Step 1. Assign each data item randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the median of the set of graphs of each cluster. Step 3. Given the ne w medians, assign each data item to be in the cluster of its closest median, using a p , g , g graph-theoretic distance measure. Step 4. Re-compute the medians as in Step 2. Repeat Steps 3 and 4 until the medians do not change.

Median of a set of graphs S (Bunke et al., 2001) is a graph g∈S such that g has the lowest average di t t ll l t i S distance to all elements in S:

i 1 d( G )

S

⎛ ⎜ ⎞ ⎟ g = argmin

∀s∈S S

d(s,Gi)

i=1

⎝ ⎜ ⎜ ⎠ ⎟ ⎟

  • Prof. Mark Last (BGU)

31

slide-32
SLIDE 32

Graph-Based Document Clustering

DI = dmin dmax

Clustering

Comparative Evaluation – Dunn Index

DI = dmin d

dmin - the minimum distance between any two objects in different clusters

The best graph

dmax

dmax - the maximum distance between any two items in the same cluster

graph- based methods The best vector- vector- based method

5/29/2009 Lecture No. 11 32

slide-33
SLIDE 33

O C O

Presented in Markov et al., 2008

THE HYBRID APPROACH TO WEB DOCUMENT CATEGORIZATION DOCUMENT CATEGORIZATION

  • Prof. Mark Last (BGU)

33

slide-34
SLIDE 34

The Hybrid Approach to Document Categorization Categorization

(Markov et al., 2006)

  • Basic Idea

– Represent a document as a vector of sub-graphs p g p – Categorize documents with a model-based classifier (e.g., a decision tree), which is much faster than a “lazy” method

  • Naïve Approach

– Select sub-graphs that are most frequent in each category

  • Smart Approach

– Select sub-graphs that are more frequent in a specific category than in other categories

  • Smart Approach with Fixed Threshold

– Select sub-graphs that are frequent in a specific category and more frequent than in other categories

May 29, 2009 34

slide-35
SLIDE 35

Predictive Model I nduction w ith Hybrid Representation (M

k t l 2006)

Hybrid Representation (Markov et al., 2006)

We b o r te xt Se t o f do c ume nts with kno wn We b o r te xt do c ume nts Se t o f do c ume nts with kno wn c ate go rie s – the training se t Do c ume nts graph Subgraph Extraction Text representation Graph Co nstruc tio n Do c ume nts graph re pre se ntatio n E xtrac tio n o f Feature selection (optional) Creation of prediction model Document classification E xtrac tio n o f sub-graphs

r e le vant for c lassific ation

( p ) p rules

c lassific ation

Re pre se ntatio n o f all do c ume nts as ve c to rs with Bo o le an value s fo r e ve ry sub-graph in the se t sub graph in the se t I de ntific atio n o f be st attribute s (Bo o le an fe ature s) fo r c lassific atio n F inally – pre dic tio n mo de l induc tio n and e xtrac tio n o f c lassific atio n rule s

  • Prof. Mark Last (BGU)

35

y p

slide-36
SLIDE 36

Frequent Subgraph Extraction Exam ple Exam ple

( based on the FSG algorithm by Kuram ochi and Karypis, 2 0 0 4 )

A b A b Subgraphs Docum ent Graph Extensions Arab Arab Arab Arab W t Arab Bank Politic West West Arab Politic Arab Arab West Arab Bank Politic Politic Politic Politic

  • Prof. Mark Last (BGU)

36

slide-37
SLIDE 37

Comparative Evaluation

B h k D t S t

  • Benchmark Data Sets

– K-series (Source: Boley et al., 1999)

  • 2 340 documents and 20 categories
  • 2,340 documents and 20 categories
  • Documents in that collection were originally news pages hosted at

Yahoo

U i (S C t l 1998) – U-series (Source: Craven et al., 1998)

  • 4167 documents taken from the computer science department of four

different universities: Cornell, Texas, Washington, and Wisconsin , , g ,

  • 7 major categories: course, faculty, students, project, staff, department,

and other

  • Known as “WebKB Dataset”
  • Known as WebKB Dataset
  • Dictionary construction

– N most frequent words in each document were taken for vector / N most frequent words in each document were taken for vector / graph construction, that is, exactly the same words in each document were used for both the graph-based and the bag-of-words representations

May 29, 2009 37

representations

slide-38
SLIDE 38

Categorization Accuracy and Speed g y p

Accuracy Comparison for C4.5, K

  • series

80%

Accuracy Comparison for NBC, K

  • series

75% 80% n Accuracy 65% 70% 75% 80% Accuracy 65% 70% Classificatio 50% 55% 60% 65% Classification 20 30 40 50 60 70 80 90 100 Frequent Terms Used Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold 20 30 40 50 60 70 80 90 100 Frequent Terms Used Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Classification Speed: 1.2 sec. per 1,000 Classification Speed: 0 3 1 000

Accuracy Comparison for C4.5, U-series

85% racy

Accuracy Comparison for NBC, U-series

75% 80% acy

. sec. pe ,000 documents 0.3 sec. per 1,000 documents

75% 80% ssification Accu 60% 65% 70% sification Accura

Classification Speed: 125 sec. per 1,000 Classification Speed: 1 7 sec per 1 000

70% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Clas Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold 50% 55% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Class B ag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

p , documents 1.7 sec. per 1,000 documents

  • Prof. Mark Last (BGU)

38

Hybrid Smart Hybrid Smart with Fixed Threshold Hybrid Smart Hybrid Smart with Fixed Threshold

slide-39
SLIDE 39

Percentage of Multi-node Subgraphs g g p

Relative Number of Multi Node Graphs for C4 .5 , K- series Relative Number of Multi Node Graphs for C4 .5 , U- series

50% 60% 70% 80% 90% 100% Node Graphs 20% 30% 40% Node Graphs 20% 30% 40% 50% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi N Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold 0% 10% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi N H b id N ï H b id S t H b id S t ith Fi d Th h ld Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

Relative Number of Multi Node Graphs for NBC, K- series

Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

Relative Number of Multi Node Graphs for NBC, U- series

60% 70% 80% 90% 100%

  • de Graphs

20% 30% 40%

  • de Graphs

20% 30% 40% 50% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi No 0% 10% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi N Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

  • Prof. Mark Last (BGU)

39

slide-40
SLIDE 40

G S O

Litvak and Last (2008)

GRAPH-BASED KEYWORD EXTRACTION EXTRACTION

  • Prof. Mark Last (BGU)

40

slide-41
SLIDE 41

Our methodology

  • The keyword

is a word presenting in the

  • The keyword - is a word presenting in the

document summary.

  • Document representation

the ”simple”

  • Document representation - the simple

directed graph:

Unique nodes non stop words – Unique nodes – non-stop words – Unlabeled edges - order-relationship

  • A → B B appears after A in the same sentence

A → B B appears after A in the same sentence

  • Keyword extraction as a first stage of

extractive summarization extractive summarization

– The most salient words (”keywords”) are extracted in order to generate a summary. g y

41

slide-42
SLIDE 42

The “simple” graph-based document representation document representation

Example:

Graph Text

< titl > H i Gilb t H d T d < title> Hurricane Gilbert Heads Toward Dominican Coast < /title> < TEXT> Hurricane Gilbert swept

sustained approaching southeast Gilbert swept heads

p toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for

storm winds mph 75 gusting Hurricane

populated south coast to prepare for high winds, heavy rains and high seas. The storm was approaching from the th t ith t i d i d f 75

92 gusting seas rains alerted Dominican heavy

southeast with sustained winds of 75 mph gusting to 92 mph. < /TEXT>

heavily civil prepare south populated defense coast

42

Sunday coast Republic

slide-43
SLIDE 43

Keyword extraction The supervised approach The supervised approach

T i i l ifi ti l ith it

  • Training a classification algorithm on a repository
  • f summarized documents.
  • Each node in a document graph belongs to one of

Each node in a document graph belongs to one of two classes:

YES the word is included in the document extractive – YES - the word is included in the document extractive summary NO

  • therwise

– NO - otherwise.

43

slide-44
SLIDE 44

The Supervised approach (cont.)

Th f t d f d l ifi ti The features used for nodes classification:

  • In Degree – number of incoming edges
  • Out Degree

number of outgoing edges

  • Out Degree – number of outgoing edges
  • Degree – total number of edges
  • Frequency – term frequency of the word represented by node
  • Frequent words distribution – ∈ {0, 1}, equals to 1 iff Frequency ≥

threshold (0.05)

  • Location Score – an average of location scores between all

Location Score an average of location scores between all sentences (S(N)) containing the word N represented by node, where sentence location score is an reciprocal of the sentence location in text (1/i) ( )

  • Tfidf Score – the tf-idf score of the word represented by node.

We used formula:

  • Headline Score – ∈ {0, 1}, equals to 1 iff document headline

contains word represented by node

44

slide-45
SLIDE 45

Feature extraction

Example: N d “D i i ”

southeast Gilbert swept

  • Node “Dominican”:
  • In Degree = 2
  • Out Degree = 2

sustained storm approaching winds heads

Out Degree 2

  • Degree = 4
  • Frequency = 2/ 27 = 0.074
  • Frequent words distribution

storm mph 75 gusting Hurricane Dominican heavy

  • Frequent words distribution

= 1

  • Location Score =

(1/ 1+ 1/ 2)/ 2 0 75

92 seas rains alerted populated

(1/ 1+ 1/ 2)/ 2 = 0.75

  • Tfidf Score =

(0.07/ 1.07)* log2(566/ 2) =

heavily civil prepare south defense Sunday coast Republic

0.53

  • Headline Score = 1

Sunday Republic

45

slide-46
SLIDE 46

The unsupervised approach

U i d t t it t ti i th t t f

  • Unsupervised text unit extraction in the context of

the text summarization task.

  • No collection of summarized documents is

No collection of summarized documents is needed

  • We apply the HITS algorithm to document graphs.

46

slide-47
SLIDE 47

HITS

Kleinberg, J.M. 1999.

  • For each node, HITS produces two sets of scores -

an ”authority” and a ”hub”:

  • For the total rank (H) calculation we used the

following four functions: following four functions:

47

slide-48
SLIDE 48

Experimental results

DUC 2002 ll i

  • DUC, 2002 collection:

– 566 English texts along with 2-3 summaries per g g p document on average. – The size (|V|) of syntactic graphs extracted from these The size (|V|) of syntactic graphs extracted from these texts is 196 on average, varying from 62 to 876.

48

slide-49
SLIDE 49

Comparison of supervised and unsupervised approaches unsupervised approaches

  • We consider unsupervised model based on extracting top N ranked

words for different values of 10 ≤ N ≤ 120.

49

  • Set from top 2 features: Frequent words distribution and In Degree is

used for NBC

slide-50
SLIDE 50

SUMMARY

  • Prof. Mark Last (BGU)

51

slide-51
SLIDE 51

Selected Publications

  • A Schenker M Last H Bunke A Kandel "Classification of Web Documents Using
  • A. Schenker, M. Last, H. Bunke, A. Kandel, Classification of Web Documents Using

Graph Matching", International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, Vol. 18,

  • No. 3, 2004, pp. 475-496.

, , pp

  • A. Schenker, H. Bunke, M. Last, A. Kandel, "Graph-Theoretic Techniques for Web

Content Mining", World Scientific, 2005.

  • A. Markov, M. Last, "A Simple, Structure-Sensitive Approach for Web Document

, , p , pp Classification", Atlantic Web Intelligence Conference (AWIC2005), Lodz, Poland, June 2005.

  • A. Markov, M. Last, and A. Kandel, “Fast Categorization of Web Documents

Represented by Graphs”, in Advances in Web Mining and Web Usage Analysis, O. Nasraoui, et al. (Eds), Springer Lecture Notes in Computer Science (LNCS/LNAI), Vol. 4811, 2007, pp. 56-71. A M k M L t d A K d l “Th H b id R t ti M d l f W b

  • A. Markov, M. Last, and A. Kandel, “The Hybrid Representation Model for Web

Document Classification”, International Journal of Intelligent Systems, Vol. 23, No. 6, pp. 654-679, 2008.

  • M Litvak and M Last "Graph Based Keyword Extraction for Single Document
  • M. Litvak and M. Last, Graph-Based Keyword Extraction for Single-Document

Summarization", Proceedings of the 2nd Workshop on Multi-source, Multilingual Information Extraction and Summarization (MMIES2), Manchester, UK, August 23, 2008,

  • pp. 17–24.
  • Prof. Mark Last (BGU)

52

pp

slide-52
SLIDE 52

Future Research

  • Enhancing graph representations of text and web

documents

– Utilizing POS tagging g gg g – Concept fusion based on available ontologies – Implementing graph representations for more languages p g g p p g g

  • Identification of the most relevant sections in long

documents online forums etc documents, online forums, etc.

  • Cross-lingual summarization of text documents

T i d i d ki i h b

  • Topic detection and tracking in the web content
  • Opinion and sentiment mining
  • Prof. Mark Last (BGU)

53

slide-53
SLIDE 53

Hohentwiel, May 2008

  • Prof. Mark Last (BGU)

54