Stochastic models for semi- structured document mining
- P. Gallinari
Collaboration with
- G. Wisniewski – L. Denoyer – F. Maes
LIP6 University Pierre – Marie Curie - Fr
Stochastic models for semi- structured document mining P. Gallinari - - PowerPoint PPT Presentation
Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski L. Denoyer F. Maes LIP6 University Pierre Marie Curie - Fr Outline Context Generative tree models 3 problems
Collaboration with
LIP6 University Pierre – Marie Curie - Fr
2006-04-27 - LIPN - P. Gallinari 2
Context Generative tree models 3 problems
Classification Clustering Document mapping
Experiments Conclusion and future work
XML Document Mining Challenge
2006-04-27 - LIPN - P. Gallinari 3
Model, Classify, cluster structured data
Domains: Chemistry, biology, XML, etc Models: discriminant e.g. kernels, generative e.g. tree densities
Predict structured outputs
Domains: natural language parsing, taxonomies, etc Models: relational learning, large margin extensions
Learn to associate structured representations aka Tree mapping
Domains: databases, semi-structured data
2006-04-27 - LIPN - P. Gallinari 4
Structure only vs Structure + content Central complexity issue Representation space (#words, #tags, #relations) Search space for structured outputs - idem Large corpora needs simple and approximate methods
2006-04-27 - LIPN - P. Gallinari 5
<fig> <fgc> <sec> <p> <bdy> <article> <st> <hdr> text text text
2006-04-27 - LIPN - P. Gallinari 6
Context Generative tree models 3 problems
Classification Clustering Document restructuration
Experiments Conclusion and future work
XML Document Mining Challenge
2006-04-27 - LIPN - P. Gallinari 7
) , / ( ) / ( ) / , ( ) / ( Θ = = Θ = = Θ = = = Θ =
d d d d d
s S t T P s S P t T s S P d D P
Structural probability Content probability
d d t
2006-04-27 - LIPN - P. Gallinari 8
Belief Networks
Paragraphe Paragraphe Section Section Titre Titre Document Corps Titre du document Cette section contient deux paragraphes La deuxième section ne contient pas de paragraphesDocument Intro Section Section Paragraphe Paragraphe Paragraphe Document Intro Section Section Paragraphe Paragraphe Paragraphe
=
=
/ / 1
) ( ) (
d i d i d
s P s P
=
=
/ / 1
))) ( ( / ( ) (
d i i d i d d
n parent label s P s P
=
=
/ / 1
)) ( ( )), ( ( / ) (
d i d i d i d i d
n précédent label n parent label s P s P
Document Intro Section Section Paragraphe Paragraphe Paragraphe
2006-04-27 - LIPN - P. Gallinari 9
Model for each node 1st order dependency Use of a local generative model for each label
/ / 1 d d d d
=
=
/ / 1
) , / ( ) , / (
d i i d i d d d
s t P s t P θ θ
i d d
s i d i d i
2006-04-27 - LIPN - P. Gallinari 10
Document Intro Section Section Paragraphe Paragraphe Paragraphe T1= « Ce document est un exemple de document structuré arborescent » T2= « Ceci est la première section du document » T3= « Le premier paragraphe » T4= «Le second paragraphe » T5= «La seconde section » T6= «Le troisième paragraphe »
) / ( Document Intro P ) / ( Document Section P ) / ( Document Section P ) / ( Section Paragraphe P ) / ( Section Paragraphe P ) / ( Section Paragraphe P ) / 1 ( Intro T P
) / 2 ( Section T P
) / 3 ( Paragraphe T P ) / 5 ( Section T P ) / 4 ( Paragraphe T P ) / 6 ( Paragraphe T P
) / 6 ( ) / 5 ( ) / 4 ( * ) / 3 ( ) / 2 ( ) / 1 ( * ) / arg ( )? / ( ) / ( ) (
3
Paragraphe T P Section T P Paragraphe T P Paragraphe T P Section T P Intro T P Section raphe P P Document Section P Document Intro P d P =
2006-04-27 - LIPN - P. Gallinari 11
Likelihood maximization
Discriminant learning Logistic function
Error minimization
contenu structure D d d i t s d i d i D d s d D d
L L s t P s P d P L
TRAIN d i TRAIN TRAIN
+ = Θ + Θ = Θ =
∑ ∑ ∑ ∑
∈ = ∈ ∈ / / 1
) , / ( log ) / ( log ) / ( log
∑
=
− −
+ = + =
n i c i x pa i x c i x pa i x
e e x c P
c x P c x P
1 ) ( , ) ( ,
log ) / ( ) / ( log
1 1 1 1 ) / (
θ θ
2006-04-27 - LIPN - P. Gallinari 12
Fisher Score : Hypothesis Hypothesis : : The gradient of the log-likelihood is
informative about how much a feature « participate » to the generation of an example.
Fisher Kernel : K(X,Y)=K(Ux,Uy)
θ
2006-04-27 - LIPN - P. Gallinari 13
( )
∑ ∑
Λ ∈ = Θ Θ Θ
Θ ∇ + Θ ∇ = Θ + Θ ∇ =
l l s i t d i d i s d t d d s d d
d i tl
s t P s P s t P s P U
/
) , / ( log ) / ( log ) , / ( log ) / ( log
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ?
?
? ? ? ? ?
/ / 1
/ /
) , / ( log ,..., ) , / ( log ), / ( log
l s i t d i d i l s i t d i d i s d d
d i tl d i tl
s t P s t P s P U
Sous-vecteur correspondant au gradient sur le modèle de structure Sous-vecteur correspondant au gradient pour les nœuds de label l1 Sous-vecteur correspondant au gradient pour les nœuds de label
/ / ?
l
) , / ( log
t d d s
t P ? ? ?
2006-04-27 - LIPN - P. Gallinari 14
Fisher kernels: very large number of parameters
On INEX :
With flat models : 200 000 parameters With structure models : 20 millions parameters
2006-04-27 - LIPN - P. Gallinari 15
Natural setting for modeling semi structured multimedia documents
Structural probability (Belief network) Content probability (local generative model)
Learning with maximum likelihood, or cross-entropy Discriminant learning and Fisher Kernel
2006-04-27 - LIPN - P. Gallinari 16
Context Generative tree models 3 problems
Classification Clustering Document restructuration
Experiments Conclusion and future work
XML Document Mining Challenge
2006-04-27 - LIPN - P. Gallinari 17
One model for each category 3 XML corpora + 1 multimedia corpus
INEX : 12 000 articles from IEEE
18 journals
WebKB : Web pages (8K pages)
course, department, …7 topics
WIPO : XML Documents of patents
categories of patents
NetProtect (European project) : 100 000 web pages
pornographic or not
2006-04-27 - LIPN - P. Gallinari 18
0.604 0.677
Structure
0.565 0.662
NB
WIPO 0.743 0.827
Structure
0.706 0.801
NB
WebKB 0.622 0.619
Structure
0.605 0.59
NB
INEX F1 macro F1 micro
2006-04-27 - LIPN - P. Gallinari 19
0.600 0.575 Discriminant learning 0.668 0.661 Fisher kernel 0.564 0.534 SVM TF-IDF 0.622 0.619 Structure model 0.605 0.59 NB F1 macro F1 micro
INEX
0.715 0.862 Fisher Kernel 0.71 0.822 SVM TF-IDF 0.604 0.677 Structure model 0.565 0.662 NB F1 macro F1 micro
WIPO
0.792 0.868 Discriminant learning 0.738 0.823 Fisher Kernel 0.651 0.737 SVM TF-IDF 0.743 0.827 Structure model 0.706 0.801 NB F1 macro F1 micro
WebKB
2006-04-27 - LIPN - P. Gallinari 20
94.7 [94.2 ;95.1] 93.6 [93.1 ;94] Structure model text and pictures 82.7 [81.9 ;83.4] 83 [82.2 ;83.7] Structure model with pictures 92.9 [92.3 ;93.3] 92.5 [91.9 ;93] Structure model with text 88.4 [87.7 ;89] 89.9 [89.2 ;90.4] NB Microaverage recall Macroaverage recall
Director Ang Lee Takes Risks with Mean Green 'Hulk'
LOS ANGELES (Reuters) - Taiwan-born director Ang Lee, perhaps best known for his Oscar-winning "Crouching Tiger, Hidden Dragon," is taking a big risk with the splashy summer popcorn flick …... For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again. " This is a drama, a family drama," said Lee, "but with big action." His slumping shoulders twitch and he laughs…..
FAMILY DRAMA, BIG ACTION
2006-04-27 - LIPN - P. Gallinari 21
Structure model is able to handle structure and content information Both structure and content carry class information Multimedia categorization Not in this talk :
Categorization of parts of documents Categorization of trees (structure only)
2006-04-27 - LIPN - P. Gallinari 22
Context Generative tree models 3 problems
Classification Clustering Document restructuration
Experiments Conclusion and future work
XML Document Mining Challenge
2006-04-27 - LIPN - P. Gallinari 23
The usual goal is to find groups of similar documents (in a thematic sense) The task is different for structured documents :
What means “similar documents” : Same structure ? Same content ? Both Open question
2006-04-27 - LIPN - P. Gallinari 24
) / ( * ) / (
/ / 1
=
Θ = Θ
C i c d c
i i
s P d P α
Mixture model : EM algorithm (CEM) Use on the structure (only) using INEX corpus
2006-04-27 - LIPN - P. Gallinari 25
Different models
2006-04-27 - LIPN - P. Gallinari 26
A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3
1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P
2006-04-27 - LIPN - P. Gallinari 27
1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P ] 1 [ ] 1 [ ] 5 2 [ ] 5 3 [ BC C EEA B B A C A A → → → →
BC C EEA B B A C A A → → → →
< !DOCTYPE A [ <!ELEMENT A (A,C)> <!ELEMENT A (B) > <!ELEMENT B (E,E,A)> <!ELEMENT C (B,C)>]>
2006-04-27 - LIPN - P. Gallinari 28
70 75 80 85 90 95 5 10 15 18 20 25 30 35 40 Nombre de clusters Entropie micro moyenne (en %) Naive Bayes Parent Parent et Grand Parent Parent et Frere Grammaire
2006-04-27 - LIPN - P. Gallinari 29
2006-04-27 - LIPN - P. Gallinari 30
Mixture model of belief networks Different models Grammar model is better
Able to compute a kind of DTD Ill defined problem: clustering of XML documents ?
2006-04-27 - LIPN - P. Gallinari 31
Context Generative tree models 3 problems
Classification Clustering Document restructuration
Experiments Conclusion and future work
XML Document Mining Challenge
2006-04-27 - LIPN - P. Gallinari 32
<Restaurant> <Nom>L’olivier</Nom> <Description> Ce joli restaurant localisé près du métro Jaurès, au 19 du boulevard de la vilette, perdu dans le 19ème arrondissement de Paris propose une cuisine italienne, notamment des pâtes fraîches au 3 fromages. </Description> </Restaurant> <Restaurant> <Nom>La cantine</Nom> <Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE </Adresse> <Spécialités> Canard à l’orange, Lapin au miel </Spécialités> </Restaurant> <Restaurant> <Nom>Tokyo Bar</Nom> <Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>Bolivar</Rue> <Num>127</Num> </Adresse> <Plat>Sushi</Plat> <Plat>Sashimi</Plat> </Restaurant>
Problem: Query heterogeneous XML databases
Needs to know the correspondence between the structured representations
2006-04-27 - LIPN - P. Gallinari 33
Problem
Learn from examples how to map heterogeneous sources
schema Preserve the document semantic Sources: semistructured, HTML, PDF, flat text, etc Labeled tree mapping problem
<Restaurant> <Nom>La cantine</Nom> <Adresse> <Ville>Paris</Vill e> <Arrd>19</Arrd > <Rue>pyrénées</ Rue> <Num>65</Num> </Adresse> <Plat> Canard à l’orange </Plat> <Plat> Lapin au miel </Plat> </Restaurant>
<Restaurant> <Nom>La cantine</Nom> <Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE </Adresse> <Spécialités> Canard à l’orange, Lapin au miel </Spécialités> </Restaurant>
2006-04-27 - LIPN - P. Gallinari 34
Central issue: Complexity
Large collections Large feature space: 103 to 106 Large search space (exponential)
Approach
Learn generative models of XML target documents from a training set Decoding of unknown sources according to the learned model
2006-04-27 - LIPN - P. Gallinari 35
Why using ML for structure matching ?
Multiple sources: variability, documents do not follow the schema, collection growth, etc Web sources: DTDs, Schema are often unknown or do not exist
2006-04-27 - LIPN - P. Gallinari 36
Data centered view (Doan et al.)
Multiple independent classifier combination Centralized (mediator) or P2P 1:1 or m:n transformations
Document centered view
Document conversion (Xerox)
rendering format (HTML, PDF, etc) -> XML predefined DTD format
Information retrieval (LIP6)
Content and structure queries (e.g. INEX)
2006-04-27 - LIPN - P. Gallinari 37
Given ST a target format dsin(d) an input document Find the most probable target document
) ' ( max arg
) (
'
d in T T
S S d S
d d P d
∈
=
Decoding Learned transformation model
2006-04-27 - LIPN - P. Gallinari 38
sd td' sd' td
?
) , , / ( ) , / ( argmax
' ' ' ' 1
Θ Θ =
d d d d d d
t s t P s s P d
2006-04-27 - LIPN - P. Gallinari 39
Subtask of structure mapping
Tree structure remains unchanged Learn to automatically label nodes
) , / ,......., ( ) / ,..., ( max arg
' ' / / ' 1 ' / / ' 1 ,..., 1
/ / ' / / ' 1
Θ Θ =
Λ ∈ d d d d d d d s s
s t t P s s P d
d final d d d
A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3
? ? ? ? ? ? ? ? ? A A C A C B C E E
2006-04-27 - LIPN - P. Gallinari 40
Document Title Section Section Paragraph Paragraph Italic Paragraph FootNote Title,Section,Section Paragraph,Paragraph Paragraph,FootNote
=
d n
n tag n gs childrenta P s P
in nodes all
) ), ( | ) ( ( ) | ( θ θ
2006-04-27 - LIPN - P. Gallinari 41
1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P ] 1 [ ] 1 [ ] 5 2 [ ] 5 3 [ BC C EEA B B A C A A → → → →
2006-04-27 - LIPN - P. Gallinari 42
Structuration automatique F . MAES P . GALLINARI Problématiqu e etc .
2006-04-27 - LIPN - P. Gallinari 43
d = (c, s) s = (se, si)
2006-04-27 - LIPN - P. Gallinari 44
Segmentation and structuration are performed sequentially
Segmentation Structure Extraction
2006-04-27 - LIPN - P. Gallinari 45
Segmentation: HMM Structure
Document Intro Section Section Paragraphe Paragraphe Paragraphe2006-04-27 - LIPN - P. Gallinari 46
Hypothesis
Input document
HTML tags mostly for visualization Remove tags Keep only the segmentation (leaves)
Transformation
Leaves are the same in the HTML and XML document Target document model: node label depends
Context = content, left sibling, father
2006-04-27 - LIPN - P. Gallinari 47
2006-04-27 - LIPN - P. Gallinari 48
Probability of target tree Document model : max-entropy conditional model learned from a training set of target docs
∏
= =
i
n i i i i d T d T d Sin T
n father n sib c n P d d d P d d d P d d P )) ( ), ( , ( ) ,..., ( ) ,..., ( ) (
1 1 ) (
Document Intro Section Section Paragraphe Paragraphe Paragraphe2006-04-27 - LIPN - P. Gallinari 49
Solve Exact Dynamic Programming decoding
Approximate solution with LASO (Hal Daume ICML 2005)
) ' ( max arg
) (
'
d in T T
S S d S
d d P d
∈
=
2006-04-27 - LIPN - P. Gallinari 50
INEX corpus:
IEEE collection (XML) :
12 000 documents (training : 7 800 , Test : 4 200) ≈ 5 000 000 content nodes 139 tags Mean document depth ≈ 7 vocabulary : ≈ 22 000 mots
test corpus :
Transaction On …series Unlabeled documents (tags removed)
2006-04-27 - LIPN - P. Gallinari 51
9,5% 65,30% 49,70% 27,80% 139 tags 79,3% 86,50% 72,90% 58% 5 tags naïve model Struct + Content Structure Content
2006-04-27 - LIPN - P. Gallinari 52
5 Tags
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 % of nodes % of documents
139 Tags
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 1 % of nodes % of documents Content with all nodes Structure Structure and Content Content without empty nodes
2006-04-27 - LIPN - P. Gallinari 53
22,8% 24,6% 91,5% HMM + TMM 31,2% 75,7% 92,8 % Exact + TMM Structuration (internal nodes) Segmentation (leafs) labeling Models
Extreme structuration instance Exact + TMM: degraded version of HTML documents structuration
2006-04-27 - LIPN - P. Gallinari 54
IEEE collection / INEX corpus
12 K documents,
Average: 500 leaf nodes, 200 int nodes, 139 tags
Movie DB
10 K movie descriptions (IMDB)
Average: 100 leaf nodes, 35 int. nodes, 28 tags
Shakespeare 39 plays
Few doc, but:
Average: 4100 leaf nodes, 850 int nodes, 21 tags
Mini-Shakespeare
Randomly chosen 60 scenes from the plays
85 leaf nodes, 20 int. nodes, 7 tags
2006-04-27 - LIPN - P. Gallinari 55
2006-04-27 - LIPN - P. Gallinari 56
2006-04-27 - LIPN - P. Gallinari 57
Document restructuration is a new problem Tree transformation problem of high complexity (content + structure) Many different instances Approach based on generative models of target documents
2006-04-27 - LIPN - P. Gallinari 58
Challenge
INEX-Delos and Pascal networks of excellence
Three tasks
Classification Clustering Document mapping
3 XML corpora
IEEE collection IMDB (Movie descriptions) Wikipedia in 4 languages Dead line : june 2006
Web site : http://xmlmining.lip6.fr Email : xmlmining@lip6.fr