Model-Based Classification of Web Documents Represented by Graphs - PowerPoint PPT Presentation

ןב תטיסרבינוא - בגנב ןוירוג Ben-Gurion University of the Negev Model-Based Classification of Web Documents Represented by Graphs Alex Markov and Mark Last Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer- Sheva, Israel Abraham Kandel National Institute for Applied Computational Intelligence University of South Florida, Tampa, FL, USA E-mail: mlast@bgu.ac.il Home Page: http://www.ise.bgu.ac.il/faculty/mlast/ WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA

Content • Introduction and Motivation • Graph-based Representation of Web Documents • The Hybrid Methodology for Web Document Representation and Classification – The Naïve Approach – The Smart Approach – The Smart Approach with Fixed Threshold • Comparative Evaluation • Conclusions and Future Research 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2 2006, at KDD 2006, Philadelphia, PA, USA

Motivation • Most of Web document classification algorithms – Treat web documents the same way as text documents • HTML tags are completely ignored • The popular Vector-Space model – Ignores the word position in the document – Ignores the order of words in the document • Solution – structure-sensitive document representation – Graph representation in this research 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 3 2006, at KDD 2006, Philadelphia, PA, USA

Text Categorization ( TC) Relevant Definitions • TC – task of assigning a Boolean { T, F} value ∈ × to each pair , where d , c D C j i D = (d 1 , … , d | D| ) is domain of documents and C = (c 1 , … , c | C| ) is set of pre-defined categories (classes) • Single Label TC – only one category can be assigned to each document • Multi Label TC – overlapping categories allowed • Ranking categorization – Degree of relevance of every document to each category is calculated 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 4 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 5 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation - Parsing title link text 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 6 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation - Preprocessing TI TLE CNN.com International Stemming Stop word removal Text A car bomb has exploded outside a popular Baghdad restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqi Prime Minister Ibrahim al-Jaafari and his driver were killed in a drive-by shooting. Links Iraq bomb: Four dead, 110 wounded. FULL STORY. 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 7 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation – Graph Construction TX Word Frequency KILL CAR DRIVE Iraq 3 TX Kill 2 TX TX Text Bomb 2 L Wound 2 IRAQ BOMB Drive 2 TX Link TX Explod 1 Baghdad 1 TX WOUND EXPLOD BAGHDAD International 1 Title CNN 1 Car 1 TI INTERNATIONAL CNN 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 8 2006, at KDD 2006, Philadelphia, PA, USA

Web Document Classification with Graph-Based Models • Advantages (Schenker et al ., 2004) – Keep HTML structure information – Retain original order of words • Limitation – Can work only with “lazy” classifiers, which have a very low classification speed • Example: k-Nearest Neighbors classifier • Conclusion – Graph models cannot be used directly for model-based classification of web documents • Solution – The hybrid approach : represent a document as a vector of sub-graphs 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 9 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation – Subgraphs Extraction • Naïve Method – Input: • G - Training set of directed, unique nodes graphs • t min – Threshold (minimum sub-graph frequency) – Output : Subgraph Class • Set of classification-relevant sub-graphs Frequency – Process: • For each class find frequent sub-graphs SCF > t min • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are frequent in a specific category 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 10 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation – Subgraphs Extraction • Sm art Method – Input • G – training set of directed, unique nodes graphs • CR min - Minimum Classification Rate – Output • Set of classification-relevant sub-graphs – Process : • For each class find sub-graphs CR > CR min • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are more frequent in a specific category than in other categories 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 11 2006, at KDD 2006, Philadelphia, PA, USA

Graph Based Docum ent Representation – Subgraphs Extraction • Sm art w ith Fixed Threshold Method – Input • G – training set of directed, unique nodes graphs • t min – Threshold (minimum sub-graph frequency) • CR min - Minimum Classification Rate – Output • Set of classification-relevant sub-graphs – Process : • For each class find sub-graphs SCF > t m in and CR > CR m in • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are frequent in a specific category and not frequent in other categories 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 12 2006, at KDD 2006, Philadelphia, PA, USA

Predictive Model I nduction w ith Hybrid Representation We b o r te xt do c ume nts Sub-graph Text representation Graph Extraction Co nstruc tio n Document Creation of Feature selection classification prediction model rules I F Re pre se ntatio n o f all do c ume nts as ve c to rs with de ntific atio n o f be st attribute s (bo o le an fe ature s) inally – pre dic tio n mo de l c o nstruc tio n and E Se t o f do c ume nts with kno wn c ate go ry – training se t Do c ume nts graph re pre se ntatio n xtrac tio n o f sub-graphs r e le vant for c lassific ation fo r c lassific atio n e xtrac tio n o f c lassific atio n rule s bo o le an value s fo r e ve ry sub-graph in the se t 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 13 2006, at KDD 2006, Philadelphia, PA, USA

Frequent Subgraphs Extraction: Notations Notation Description G Set of document graphs Subgraph frequency threshold t min K Number of edges in the graph G Single graph sg Single subgraph sg k Subgraph with k edges F k Set of frequent subgraphs with k edges E k Set of extension subgraphs with k edges C k Set of candidate subgraphs with k edges 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 14 2006, at KDD 2006, Philadelphia, PA, USA

Frequent Subgraphs Extraction: Algorithm ( based on the FSG algorithm by Kuram ochi and Karypis, 2 0 0 4 ) 1 : F 0 � Detect all frequent 1 node subgraphs (nodes) in G 2 : k � 1 3 : W hile F k-1 ≠ Ø Do For Each subgraph sg k-1 ∈ F k-1 Do 4 : For Each graph g ∈ G Do 5 : I f sg k-1 is subgraph of g Then 6 : E k � Detect all possible k edge extensions of sg k-1 in g 7 : For Each subgraph sg k ∈ E k Do 8 : I f sg k already a member of C k Then 9 : { sg k ∈ C k } .Count+ + 1 0 : 1 1 : Else sg k .Count � 1 1 2 : C k � sg k 1 3 : F k � { sg k in C k | sg k .Count > t min * | G| } 1 4 : 1 5 : k+ + 1 6 : Return F 1 , F 2 , … F k-2 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 15 2006, at KDD 2006, Philadelphia, PA, USA

Frequent Subgraphs Extraction: Com plexity Subgraph isom orphism Isomorphism between graph G 1 = (V 1 ,E 1 , α 1 , β 1 ) and part of graph G 2 = (V 2 ,E 2 , α 2 , β 2 ) can be found by two simple actions: Determine that V 1 ⊆ V 2 - O (| V 1 | * | V 2 | ) 1. Determine that E 1 ⊆ E 2 – O (| V 1 | 2 ) 2. Total complexity: O(| V 1 | * | V 2 | + | V 1 | 2 ) ≤ O(| V 2 | 2 ) Graph isom orphism Isomorphism between graphs G 1 = (V 1 ,E 1 , α 1 , β 1 ) and G 2 = (V 2 ,E 2 , α 2 , β 2 ) can be found by two simple actions: Determine G 1 ⊆ G 2 - O(| V 2 | ) 1. Determine G 2 ⊆ G 1 - O(| V 2 | ) 2. Total complexity: O(| V 2 | ) 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 16 2006, at KDD 2006, Philadelphia, PA, USA

Frequent Subgraph Extraction Exam ple Subgraphs Docum ent Graph Extensions Arab Arab Arab Arab Bank Arab West Politic West Arab Politic Arab Arab Arab West Bank Politic Politic Politic Politic 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 17 2006, at KDD 2006, Philadelphia, PA, USA

Model-Based Classification of Web Documents Represented by Graphs - PowerPoint PPT Presentation

- Ben-Gurion University of the Negev Model-Based Classification of Web Documents Represented by Graphs Alex Markov and Mark Last Department of Information Systems Engineering, Ben-Gurion

Web-enabled Biometric Software (WEBS) Mr. William A. Thum Accessions Suitability Office ARNG-HRR-O

About Me About Me The Webs Missing Links: The Webs Missing Links: Dual training Dual

Background of Project Background of Project The Webs Missing Links: The Webs Missing

Cons onsumer umer Aware areness ness 7 Reasons Why The CAP is LIT!!! 1. . Webs ebsit ite

Webs, foams, knot invariants, and representation theory David E. V. Rose University of North

Webs and polylogarithms Mark Harley The University of Edinburgh Giulio Falcioni, Einan Gardi, MH,

Food webs - The Who eats Who in Ecology. Nodes: Species Links: predator-prey interactions Food

Planes, Nets and Webs Lecture 1 G. Eric Moorhouse Department of Mathematics University of

Image-Based Rendering and Modeling l Image-based rendering (IBR): A scene is represented as a

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

EN ENGL GLISH ISH LAN ANGUAG GUAGE TOPIC 45 : SOLUTIONS. INTERMEDIATE STUDENTS BOOK. UNIT

How Current Android Malware Seeks to Evade Automated Code Analysis Siegfried Rasthofer, Irfan

Message Context for Internet Mail Eric Burger Emily Candell Charles Eliot Graham Klyne 49 th

Scattering of transient waves AR & FJS & TQ by piecewise homogeneous obstacles Notation

TACOM ILSC FY20 MDEX SOLDIER, CHEM-BIO AND WEAPONS READINESS AND SUSTAINMENT DIRECTORATE

CS305 Topic Other Impacts Productivity and jobs Work environment Globalization

Introduction to Introduction to Computer Systems Computer Systems David OHallaron August 27,

Express: Private Communication without Synchronization Saba Eskandarian, Henry Corrigan-Gibbs,

Model-Based Classification of Web Documents Represented by Graphs - PowerPoint PPT Presentation

- Ben-Gurion University of the Negev Model-Based Classification of Web Documents Represented by Graphs Alex Markov and Mark Last Department of Information Systems Engineering, Ben-Gurion

Web-enabled Biometric Software (WEBS) Mr. William A. Thum Accessions Suitability Office ARNG-HRR-O

About Me About Me The Webs Missing Links: The Webs Missing Links: Dual training Dual

Background of Project Background of Project The Webs Missing Links: The Webs Missing

Cons onsumer umer Aware areness ness 7 Reasons Why The CAP is LIT!!! 1. . Webs ebsit ite

Webs, foams, knot invariants, and representation theory David E. V. Rose University of North

Webs and polylogarithms Mark Harley The University of Edinburgh Giulio Falcioni, Einan Gardi, MH,

Food webs - The Who eats Who in Ecology. Nodes: Species Links: predator-prey interactions Food

Planes, Nets and Webs Lecture 1 G. Eric Moorhouse Department of Mathematics University of

Image-Based Rendering and Modeling l Image-based rendering (IBR): A scene is represented as a

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

EN ENGL GLISH ISH LAN ANGUAG GUAGE TOPIC 45 : SOLUTIONS. INTERMEDIATE STUDENTS BOOK. UNIT

How Current Android Malware Seeks to Evade Automated Code Analysis Siegfried Rasthofer, Irfan

Message Context for Internet Mail Eric Burger Emily Candell Charles Eliot Graham Klyne 49 th

Scattering of transient waves AR &amp; FJS &amp; TQ by piecewise homogeneous obstacles Notation

TACOM ILSC FY20 MDEX SOLDIER, CHEM-BIO AND WEAPONS READINESS AND SUSTAINMENT DIRECTORATE

CS305 Topic Other Impacts Productivity and jobs Work environment Globalization

Introduction to Introduction to Computer Systems Computer Systems David OHallaron August 27,

Express: Private Communication without Synchronization Saba Eskandarian, Henry Corrigan-Gibbs,

Scattering of transient waves AR & FJS & TQ by piecewise homogeneous obstacles Notation