Model-Based Classification of Web Documents Represented by Graphs - - PowerPoint PPT Presentation

model based classification of web documents represented
SMART_READER_LITE
LIVE PREVIEW

Model-Based Classification of Web Documents Represented by Graphs - - PowerPoint PPT Presentation

- Ben-Gurion University of the Negev Model-Based Classification of Web Documents Represented by Graphs Alex Markov and Mark Last Department of Information Systems Engineering, Ben-Gurion


slide-1
SLIDE 1

Model-Based Classification of Web Documents Represented by Graphs

Alex Markov and Mark Last

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer- Sheva, Israel

Abraham Kandel

National Institute for Applied Computational Intelligence University of South Florida, Tampa, FL, USA

WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA

E-mail: mlast@bgu.ac.il Home Page: http://www.ise.bgu.ac.il/faculty/mlast/

ןב תטיסרבינוא-בגנב ןוירוג Ben-Gurion University of the Negev

slide-2
SLIDE 2

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 2

Content

  • Introduction and Motivation
  • Graph-based Representation of Web Documents
  • The Hybrid Methodology for Web Document

Representation and Classification

– The Naïve Approach – The Smart Approach – The Smart Approach with Fixed Threshold

  • Comparative Evaluation
  • Conclusions and Future Research
slide-3
SLIDE 3

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 3

Motivation

  • Most of Web document classification algorithms

– Treat web documents the same way as text documents

  • HTML tags are completely ignored
  • The popular Vector-Space model

– Ignores the word position in the document – Ignores the order of words in the document

  • Solution – structure-sensitive document

representation

– Graph representation in this research

slide-4
SLIDE 4

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 4

Text Categorization ( TC)

Relevant Definitions

  • TC – task of assigning a Boolean { T, F} value

to each pair

C D c d

i j

× ∈ ,

D = (d1, … , d| D|) is domain of documents and C = (c1, … , c| C|) is set of pre-defined categories (classes)

  • Single Label TC – only one category can be

assigned to each document

  • Multi Label TC – overlapping categories allowed
  • Ranking categorization

– Degree of relevance of every document to each category is calculated

, where

slide-5
SLIDE 5

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 5

Graph Based Docum ent Representation

Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5

slide-6
SLIDE 6

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 6

Graph Based Docum ent Representation - Parsing title link text

slide-7
SLIDE 7

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 7

Graph Based Docum ent Representation - Preprocessing

TI TLE

CNN.com International

Text

A car bomb has exploded outside a popular Baghdad restaurant, killing three Iraqis and wounding more than 110

  • thers, police officials said. Earlier an aide to the office of

Iraqi Prime Minister Ibrahim al-Jaafari and his driver were killed in a drive-by shooting.

Links

Iraq bomb: Four dead, 110 wounded. FULL STORY. Stop word removal Stemming

slide-8
SLIDE 8

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 8

Graph Based Docum ent Representation – Graph Construction

IRAQ CNN KILL DRIVE BOMB EXPLOD CAR BAGHDAD INTERNATIONAL WOUND TI TX TX TX TX TX TX TX L

1 International 1 CNN 1 Baghdad 1 Explod 2 Drive 2 Wound 2 Bomb 1 Car 2 Kill 3 Iraq Frequency Word

Text Link Title

slide-9
SLIDE 9

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 9

Web Document Classification with Graph-Based Models

  • Advantages (Schenker et al., 2004)

– Keep HTML structure information – Retain original order of words

  • Limitation

– Can work only with “lazy” classifiers, which have a very low classification speed

  • Example: k-Nearest Neighbors classifier
  • Conclusion

– Graph models cannot be used directly for model-based classification of web documents

  • Solution

– The hybrid approach: represent a document as a vector of sub-graphs

slide-10
SLIDE 10

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 10

Graph Based Docum ent Representation – Subgraphs Extraction

  • Naïve Method

– Input:

  • G - Training set of directed, unique nodes graphs
  • t min – Threshold (minimum sub-graph frequency)

– Output:

  • Set of classification-relevant sub-graphs

– Process:

  • For each class find frequent sub-graphs SCF> t min
  • Combine all sub-graphs into one set
  • Classification-Relevant Sub-Graphs are

frequent in a specific category

Subgraph Class Frequency

slide-11
SLIDE 11

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 11

Graph Based Docum ent Representation – Subgraphs Extraction

  • Sm art Method

– Input

  • G – training set of directed, unique nodes graphs
  • CRmin - Minimum Classification Rate

– Output

  • Set of classification-relevant sub-graphs

– Process:

  • For each class find sub-graphs CR> CRmin
  • Combine all sub-graphs into one set
  • Classification-Relevant Sub-Graphs are more

frequent in a specific category than in other categories

slide-12
SLIDE 12

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 12

Graph Based Docum ent Representation – Subgraphs Extraction

  • Sm art w ith Fixed Threshold Method

– Input

  • G – training set of directed, unique nodes graphs
  • t min – Threshold (minimum sub-graph frequency)
  • CRmin - Minimum Classification Rate

– Output

  • Set of classification-relevant sub-graphs

– Process:

  • For each class find sub-graphs SCF> t m in and CR> CRm in
  • Combine all sub-graphs into one set
  • Classification-Relevant Sub-Graphs are

frequent in a specific category and not frequent in

  • ther categories
slide-13
SLIDE 13

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 13

Predictive Model I nduction w ith Hybrid Representation

Sub-graph Extraction Text representation Feature selection Creation of prediction model Document classification rules We b o r te xt do c ume nts Graph Co nstruc tio n

Se t o f do c ume nts with kno wn c ate go ry – training se t Do c ume nts graph re pre se ntatio n E xtrac tio n o f sub-graphsr

e le vant for c lassific ation

Re pre se ntatio n o f all do c ume nts as ve c to rs with bo o le an value s fo r e ve ry sub-graph in the se t I de ntific atio n o f be st attribute s (bo o le an fe ature s) fo r c lassific atio n F inally – pre dic tio n mo de l c o nstruc tio n and e xtrac tio n o f c lassific atio n rule s

slide-14
SLIDE 14

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 14

Frequent Subgraphs Extraction: Notations

Notation Description G Set of document graphs t min Subgraph frequency threshold K Number of edges in the graph G Single graph sg Single subgraph sg k Subgraph with k edges F k Set of frequent subgraphs with k edges E k Set of extension subgraphs with k edges C k Set of candidate subgraphs with k edges

slide-15
SLIDE 15

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 15

1 : F0 Detect all frequent 1 node subgraphs (nodes) in G 2 : k 1 3 : W hile Fk-1 ≠ Ø Do 4 : For Each subgraph sgk-1 ∈ Fk-1 Do 5 : For Each graph g ∈ G Do 6 : I f sgk-1 is subgraph of g Then 7 : Ek Detect all possible k edge extensions of sgk-1 in g 8 : For Each subgraph sgk ∈ Ek Do 9 : I f sgk already a member of Ck Then 1 0 : { sgk ∈ Ck}.Count+ + 1 1 : Else 1 2 : sgk.Count 1 1 3 : Ck sgk 1 4 : Fk { sgk in Ck | sgk.Count > t min * | G| } 1 5 : k+ + 1 6 : Return F1, F2, … Fk-2

Frequent Subgraphs Extraction: Algorithm

( based on the FSG algorithm by Kuram ochi and Karypis, 2 0 0 4 )

slide-16
SLIDE 16

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 16

Frequent Subgraphs Extraction: Com plexity Subgraph isom orphism

Isomorphism between graph G1= (V1,E1,α1,β1) and part

  • f graph G2= (V2,E2,α2,β2) can be found by two simple

actions:

1. Determine that V1⊆V2 - O(| V1| * | V2| ) 2. Determine that E1⊆E2 – O(| V1| 2)

Total complexity: O(| V1| * | V2| + | V1| 2) ≤ O(| V2| 2)

Graph isom orphism

Isomorphism between graphs G1= (V1,E1,α1,β1) and G2= (V2,E2,α2,β2) can be found by two simple actions:

1. Determine G1⊆G2 - O(| V2| ) 2. Determine G2⊆G1 - O(| V2| )

Total complexity: O(| V2| )

slide-17
SLIDE 17

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 17

Frequent Subgraph Extraction Exam ple

Arab West Arab Bank Politic Arab Politic Arab West Politic Arab Politic West Arab Subgraphs Docum ent Graph Extensions Arab Politic Arab Bank Politic Arab

slide-18
SLIDE 18

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 18

Comparative Evaluation

  • Benchmark Data Sets

– K-series

  • 2,340 documents and 20 categories
  • Documents in those collections were originally news pages hosted at

Yahoo

– U-series

  • 4167 documents taken from the computer science department of four

different universities: Cornell, Texas, Washington, and Wisconsin

  • 7 major categories: course, faculty, students, project, staff, department

and other

  • Dictionary construction

– N most frequent words in each document were taken for vector / graph construction, that is, exactly the same words in each document were used for both the graph-based and the bag-of- words representations

slide-19
SLIDE 19

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 19

Classification Results w ith C4 .5 – K series data set

Accuracy Comparison for C4.5, K

  • series

65% 70% 75% 80% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Classification Accuracy Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

slide-20
SLIDE 20

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 20

Classification Results w ith C4 .5 – U series data set

Accuracy Comparison for C4.5, U-series

70% 75% 80% 85% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Classification Accuracy Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Classification Speed: 0.3 sec. per 1,000 documents Classification Speed: 1.7 sec. per 1,000 documents

slide-21
SLIDE 21

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 21

Classification Results w ith Naïve Bayes – K series data set

Accuracy Comparison for NBC, K

  • series

50% 55% 60% 65% 70% 75% 80% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Classification Accuracy Bag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

slide-22
SLIDE 22

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 22

Classification Results w ith Naïve Bayes – U series data set

Accuracy Comparison for NBC, U-series

50% 55% 60% 65% 70% 75% 80% 20 30 40 50 60 70 80 90 100 Frequent Terms Used Classification Accuracy B ag-of-words Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Classification Speed: 1.2 sec. per 1,000 documents Classification Speed: 125 sec. per 1,000 documents

slide-23
SLIDE 23

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 23

Percentage of Multi-node Subgraphs

Relative Number of Multi Node Graphs for C4 .5 , K- series

20% 30% 40% 50% 60% 70% 80% 90% 100% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi Node Graphs Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

Relative Number of Multi Node Graphs for NBC, K- series

20% 30% 40% 50% 60% 70% 80% 90% 100% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi Node Graphs Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

Relative Number of Multi Node Graphs for C4 .5 , U- series

0% 10% 20% 30% 40% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi Node Graphs Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

Relative Number of Multi Node Graphs for NBC, U- series

0% 10% 20% 30% 40% 20 30 40 50 60 70 80 90 100 Fr e que nt Te r m s Use d Multi Node Graphs Hy br id Na ïv e Hy br id Sm a r t Hy br id Sm a r t with Fix e d Thr e shold

slide-24
SLIDE 24

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 24

Sum m ary

  • Different document representations were

empirically compared in terms of classification accuracy and execution time

  • The proposed hybrid methods were found

to be more accurate in most cases and generally much faster than their vector- space and graph-based counterparts

slide-25
SLIDE 25

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 25

Future research

  • Finding optimal parameters for sub-graph

extraction:

– Graph size N – t min for Naïve extraction – CRmin for Smart extraction

  • Applying the hybrid methodology to

additional classifiers

  • Applying the hybrid methodology to

unsupervised learning (clustering)

slide-26
SLIDE 26

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 26

slide-27
SLIDE 27

8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA 30

Selected References

  • M. Kuramochi and G. Karypis, "An Efficient Algorithm for Discovering Frequent

Subgraphs", IEEE Transactions on Knowledge and Data Engineering, Volume 16 , Issue 9, September 2004.

  • A. Schenker, M. Last, H. Bunke, A. Kandel, "Classification of Web Documents

Using Graph Matching", International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, Vol. 18, No. 3, 2004.

  • A. Schenker, H. Bunke, M. Last, A. Kandel, "Graph-Theoretic Techniques for

Web Content Mining", World Scientific, 2005.

  • A. Markov, M. Last, "A Simple, Structure-Sensitive Approach for Web Document

Classification", Atlantic Web Intelligence Conference (AWIC2005), Lodz, Poland, June 2005.

  • A. Markov and M. Last, "Efficient Graph-Based Representation of Web

Documents", Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS2005), October 7, 2005, Porto, Portugal.

  • M. Last, A. Markov, and A. Kandel, "Multi-Lingual Detection of Terrorist Content
  • n the Web", Proceedings of the PAKDD'06 International Workshop on

Intelligence and Security Informatics (WISI'06), Singapore, April 9, 2006.