2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
Valencia, Spain, November 14-15, 2005
Extending Decision Trees for Web Categorisation Multiparadigm - - PowerPoint PPT Presentation
2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination Valencia, Spain, November 14-15, 2005 Extending Decision Trees for Web Categorisation Multiparadigm Inductive Programming group (Extensions of Logic Programming group,
Valencia, Spain, November 14-15, 2005
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
3
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
4
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
5
Two main objectives:
More and more inductive techniques are needed:
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
6
Nowadays, the Web is the most important source for
Web information has special characteristics:
Specific tools are needed to help us handle such variety and
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
7
Data mining (or more academically KDD) aims at discovering
Web mining aims at discovering relevant knowledge from the
Web mining is classified into:
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
8
Web documents are especially difficult for classical DM
Web mining adapts classical DM techniques or develops
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
9
Categorisation aims at finding one or more categories (from a
When the number of possible categories is not very high, a
Some simple approaches to Web document
But this information is also relevant!
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
10
Special predicates: has_word(), has_anchor_word(), link_to()
Content information: text+title (bags of words)
Content information: text + title
Structure information: anchor words
Content information: keywords + some text (a few bags of words)
Structure information: hyperlinks
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
11
Our proposal:
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
12
Yes Cloudy Strong Hot No Rainy Weak Cold No Rainy Strong Cold No Rainy Strong Warm No Sunny Weak Hot Yes Sunny Strong Hot Sail? Sky Wind Temp.
ID3,C4.5, …
strong weak sunny rainy cloudy
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
13
n i i
1 1
n
1
1 1 1 n i i i
+
n
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
14
Centre splitting (Thornton 1995-2000)
Distance-based method for
One centre is calculated for each
The space is divided according to
The process is iterated and stopped
Requires a single distance between
A simple distance loses information and
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
15
Apply the centre splitting technique for each attribute. The centre must be a value of the dataset instead of computing the
Generates decision tree models (distance-based decision tree) in the
Conditions are expressed in terms of distances to prototypes (proximity
Can handle nominal, numerical and structured (complex) attributes.
Defining a metric or a similarity function for each attribute.
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
16
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
17
6 5 4 3 2 1 Id. Yes {(soccer,5), ( league,5)} {(soccer,champions league) ↔(scorers,classif.), (soccer,champions league) ↔(matches,referees)} 32 Yes {(soccer,7), ( league,5)} {(soccer,champions league) ↔(scorers,classif.), (soccer,champions league) ↔(matches,semi-final)} 41 No {(soccer,9),(league,8)} {(soccer,championships, leagues ) ↔(scorers,classif.), (scorers,classif.) ↔(best players,best referees)} 38 No {(economy,3),(politics,4), (law,10)} {(economy,politics) ↔(Dow Jones,Yen), (economy,politics) ↔(interview,elections)} 30 No {(Linux,3),(php,6), (networking,8)} {(Linux,networking) ↔(shell,learners), (Linux,networking) ↔(TCP/IP,telnet,ftp)} 25 No {(Topo,3), (Analysis,5),(Logic,5)} {(Math,Topo,Analysis,Logic)↔(invariant,surfaces), (Math,Topo,Analysis,Logic)↔(Lie ope,tangent), (Math,Topo,Analysis,Logic)↔(Gödel,Fuzzy)} 10 Sport news site? Content Structure Daily conn.
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
18
6 5 4 3 2 1 Id. Yes {(soccer,5), ( league,5)} {(soccer,champions league) ↔(scorers,classif.), (soccer,champions league) ↔(matches,referees)} 32 Yes {(soccer,7), ( league,5)} {(soccer,champions league) ↔(scorers,classif.), (soccer,champions league) ↔(matches,semi-final)} 41 No {(soccer,9),(league,8)} {(soccer,championships, leagues ) ↔(scorers,classif.), (scorers,classif.) ↔(best players,best referees)} 38 No {(economy,3),(politics,4), (law,10)} {(economy,politics) ↔(Dow Jones,Yen), (economy,politics) ↔(interview,elections)} 30 No {(Linux,3),(php,6), (networking,8)} {(Linux,networking) ↔(shell,learners), (Linux,networking) ↔(TCP/IP,telnet,ftp)} 25 No {(Topo,3), (Analysis,5),(Logic,5)} {(Math,Topo,Analysis,Logic)↔(invariant,surfaces), (Math,Topo,Analysis,Logic)↔(Lie ope,tangent), (Math,Topo,Analysis,Logic)↔(Gödel,Fuzzy)} 10 Sport news site? Content Structure Daily conn.
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
19
After the 1st step… (heuristic: accuracy)
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
20
Class No Class No
Class Yes Class No
F.C. Barcelona won…
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
21
DBDT has been implemented in WEKA It includes several distance and (pseudo-)distance functions
In the experiments, lists and sets have been the structured
Document representation
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
22
First experiment: classifying web sites by topic
mathematics (biographies, technical pages, personal web sites, …) sports (biographies, news, events, championships, …)
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
23
Second experiment: Learning user profiles
several topics (clinical information, music events…) documents ranked according to user preferences (hot, medium, cold)
Best accurate result (DBDT)
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
24
A general-purpose algorithm
Like ID3, c4.5, CART, etc.
Able to handle structured attributes
Just necessary to define a similarity function for each attribute
Applicable for web mining
Web classification/categorisation problems
Instead of “close to {(economy,3),(politics,4), (law,10)}” we would
Defining a generalisation operator in metric spaces.
Valencia, Nov. 14-15, 2005 2nd Annual Conference of the ICT for EU-India Cross Cultural Dissemination
25
Recommender systems. Personalisation. Ontology categorisation (using metrics between ontologies). …
Extracting rules from incomprehensible models (black-box
Combination of data mining and simulation. Applications in bioinformatics (complex data). Ranking predictions and evaluating their quality.