Stochastic models for semi- structured document mining P. Gallinari - - PowerPoint PPT Presentation

stochastic models for semi structured document mining p
SMART_READER_LITE
LIVE PREVIEW

Stochastic models for semi- structured document mining P. Gallinari - - PowerPoint PPT Presentation

Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski L. Denoyer F. Maes LIP6 University Pierre Marie Curie - Fr Outline Context Generative tree models 3 problems


slide-1
SLIDE 1

Stochastic models for semi- structured document mining

  • P. Gallinari

Collaboration with

  • G. Wisniewski – L. Denoyer – F. Maes

LIP6 University Pierre – Marie Curie - Fr

slide-2
SLIDE 2

2006-04-27 - LIPN - P. Gallinari 2

Outline

 Context  Generative tree models  3 problems

 Classification  Clustering  Document mapping

 Experiments  Conclusion and future work

 XML Document Mining Challenge

slide-3
SLIDE 3

2006-04-27 - LIPN - P. Gallinari 3

Context - Machine learning in the structured domain

 Model, Classify, cluster structured data

 Domains: Chemistry, biology, XML, etc  Models: discriminant e.g. kernels, generative e.g. tree densities

 Predict structured outputs

 Domains: natural language parsing, taxonomies, etc  Models: relational learning, large margin extensions

 Learn to associate structured representations aka Tree mapping

 Domains: databases, semi-structured data

slide-4
SLIDE 4

2006-04-27 - LIPN - P. Gallinari 4

Context- Machine learning in the structured domain

 Structure only vs Structure + content  Central complexity issue  Representation space (#words, #tags, #relations)  Search space for structured outputs - idem  Large corpora needs simple and approximate methods

slide-5
SLIDE 5

2006-04-27 - LIPN - P. Gallinari 5

Context-XML semi-structured documents

<fig> <fgc> <sec> <p> <bdy> <article> <st> <hdr> text text text

slide-6
SLIDE 6

2006-04-27 - LIPN - P. Gallinari 6

Outline

 Context  Generative tree models  3 problems

 Classification  Clustering  Document restructuration

 Experiments  Conclusion and future work

 XML Document Mining Challenge

slide-7
SLIDE 7

2006-04-27 - LIPN - P. Gallinari 7

Document model

) , / ( ) / ( ) / , ( ) / ( Θ = = Θ = = Θ = = = Θ =

d d d d d

s S t T P s S P t T s S P d D P

Structural probability Content probability

) , (

d d t

s d =

s t

slide-8
SLIDE 8

2006-04-27 - LIPN - P. Gallinari 8

Document Model: Structure

 Belief Networks

Paragraphe Paragraphe Section Section Titre Titre Document Corps Titre du document Cette section contient deux paragraphes La deuxième section ne contient pas de paragraphes

Document Intro Section Section Paragraphe Paragraphe Paragraphe Document Intro Section Section Paragraphe Paragraphe Paragraphe

=

=

/ / 1

) ( ) (

d i d i d

s P s P

=

=

/ / 1

))) ( ( / ( ) (

d i i d i d d

n parent label s P s P

( )

=

=

/ / 1

)) ( ( )), ( ( / ) (

d i d i d i d i d

n précédent label n parent label s P s P

Document Intro Section Section Paragraphe Paragraphe Paragraphe

slide-9
SLIDE 9

2006-04-27 - LIPN - P. Gallinari 9

Document Model: Content

Model for each node 1st order dependency Use of a local generative model for each label

) ,...., (

/ / 1 d d d d

t t t =

=

=

/ / 1

) , / ( ) , / (

d i i d i d d d

s t P s t P θ θ

) / ( ) , / (

i d d

s i d i d i

t P s t P θ θ =

slide-10
SLIDE 10

2006-04-27 - LIPN - P. Gallinari 10

Final network

Document Intro Section Section Paragraphe Paragraphe Paragraphe T1= « Ce document est un exemple de document structuré arborescent » T2= « Ceci est la première section du document » T3= « Le premier paragraphe » T4= «Le second paragraphe » T5= «La seconde section » T6= «Le troisième paragraphe »

) / ( Document Intro P ) / ( Document Section P ) / ( Document Section P ) / ( Section Paragraphe P ) / ( Section Paragraphe P ) / ( Section Paragraphe P ) / 1 ( Intro T P

) / 2 ( Section T P

) / 3 ( Paragraphe T P ) / 5 ( Section T P ) / 4 ( Paragraphe T P ) / 6 ( Paragraphe T P

( )

) / 6 ( ) / 5 ( ) / 4 ( * ) / 3 ( ) / 2 ( ) / 1 ( * ) / arg ( )? / ( ) / ( ) (

3

Paragraphe T P Section T P Paragraphe T P Paragraphe T P Section T P Intro T P Section raphe P P Document Section P Document Intro P d P =

slide-11
SLIDE 11

2006-04-27 - LIPN - P. Gallinari 11

Different learning techniques

 Likelihood maximization

 Discriminant learning  Logistic function

 Error minimization

contenu structure D d d i t s d i d i D d s d D d

L L s t P s P d P L

TRAIN d i TRAIN TRAIN

+ =           Θ +           Θ = Θ =

∑ ∑ ∑ ∑

∈ = ∈ ∈ / / 1

) , / ( log ) / ( log ) / ( log

=

− −

+ = + =

n i c i x pa i x c i x pa i x

e e x c P

c x P c x P

1 ) ( , ) ( ,

log ) / ( ) / ( log

1 1 1 1 ) / (

θ θ

slide-12
SLIDE 12

2006-04-27 - LIPN - P. Gallinari 12

Fisher Kernel

 Fisher Score :   Hypothesis Hypothesis : : The gradient of the log-likelihood is

informative about how much a feature « participate » to the generation of an example.

 Fisher Kernel : K(X,Y)=K(Ux,Uy)

) / ( log θ

θ

X P UX ∇ =

slide-13
SLIDE 13

2006-04-27 - LIPN - P. Gallinari 13

Use with the model

( )

∑ ∑

Λ ∈ = Θ Θ Θ

          Θ ∇ + Θ ∇ = Θ + Θ ∇ =

l l s i t d i d i s d t d d s d d

d i tl

s t P s P s t P s P U

/

) , / ( log ) / ( log ) , / ( log ) / ( log

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ?

?

? ? ? ? ?

/ / 1

/ /

) , / ( log ,..., ) , / ( log ), / ( log

l s i t d i d i l s i t d i d i s d d

d i tl d i tl

s t P s t P s P U

Sous-vecteur correspondant au gradient sur le modèle de structure Sous-vecteur correspondant au gradient pour les nœuds de label l1 Sous-vecteur correspondant au gradient pour les nœuds de label

/ / ?

l

) , / ( log

t d d s

t P ? ? ?

slide-14
SLIDE 14

2006-04-27 - LIPN - P. Gallinari 14

Remark

 Fisher kernels: very large number of parameters

 On INEX :

 With flat models : 200 000 parameters  With structure models : 20 millions parameters

slide-15
SLIDE 15

2006-04-27 - LIPN - P. Gallinari 15

Conclusion about this faimily of generative models

 Natural setting for modeling semi structured multimedia documents

 Structural probability (Belief network)  Content probability (local generative model)

 Learning with maximum likelihood, or cross-entropy  Discriminant learning and Fisher Kernel

slide-16
SLIDE 16

2006-04-27 - LIPN - P. Gallinari 16

Outline

 Context  Generative tree models  3 problems

 Classification  Clustering  Document restructuration

 Experiments  Conclusion and future work

 XML Document Mining Challenge

slide-17
SLIDE 17

2006-04-27 - LIPN - P. Gallinari 17

Classification

 One model for each category  3 XML corpora + 1 multimedia corpus

 INEX : 12 000 articles from IEEE

 18 journals

 WebKB : Web pages (8K pages)

 course, department, …7 topics

 WIPO : XML Documents of patents

 categories of patents

 NetProtect (European project) : 100 000 web pages

 pornographic or not

slide-18
SLIDE 18

2006-04-27 - LIPN - P. Gallinari 18

Categorization : Generative models

0.604 0.677

Structure

0.565 0.662

NB

WIPO 0.743 0.827

Structure

0.706 0.801

NB

WebKB 0.622 0.619

Structure

0.605 0.59

NB

INEX F1 macro F1 micro

slide-19
SLIDE 19

2006-04-27 - LIPN - P. Gallinari 19

0.600 0.575 Discriminant learning 0.668 0.661 Fisher kernel 0.564 0.534 SVM TF-IDF 0.622 0.619 Structure model 0.605 0.59 NB F1 macro F1 micro

INEX

0.715 0.862 Fisher Kernel 0.71 0.822 SVM TF-IDF 0.604 0.677 Structure model 0.565 0.662 NB F1 macro F1 micro

WIPO

0.792 0.868 Discriminant learning 0.738 0.823 Fisher Kernel 0.651 0.737 SVM TF-IDF 0.743 0.827 Structure model 0.706 0.801 NB F1 macro F1 micro

WebKB

Discriminant models

slide-20
SLIDE 20

2006-04-27 - LIPN - P. Gallinari 20

Multimedia model

94.7 [94.2 ;95.1] 93.6 [93.1 ;94] Structure model text and pictures 82.7 [81.9 ;83.4] 83 [82.2 ;83.7] Structure model with pictures 92.9 [92.3 ;93.3] 92.5 [91.9 ;93] Structure model with text 88.4 [87.7 ;89] 89.9 [89.2 ;90.4] NB Microaverage recall Macroaverage recall

Director Ang Lee Takes Risks with Mean Green 'Hulk'

LOS ANGELES (Reuters) - Taiwan-born director Ang Lee, perhaps best known for his Oscar-winning "Crouching Tiger, Hidden Dragon," is taking a big risk with the splashy summer popcorn flick …... For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again. " This is a drama, a family drama," said Lee, "but with big action." His slumping shoulders twitch and he laughs…..

FAMILY DRAMA, BIG ACTION

slide-21
SLIDE 21

2006-04-27 - LIPN - P. Gallinari 21

Classification : conclusion

 Structure model is able to handle structure and content information  Both structure and content carry class information  Multimedia categorization  Not in this talk :

 Categorization of parts of documents  Categorization of trees (structure only)

slide-22
SLIDE 22

2006-04-27 - LIPN - P. Gallinari 22

Outline

 Context  Generative tree models  3 problems

 Classification  Clustering  Document restructuration

 Experiments  Conclusion and future work

 XML Document Mining Challenge

slide-23
SLIDE 23

2006-04-27 - LIPN - P. Gallinari 23

Clustering

 The usual goal is to find groups of similar documents (in a thematic sense)  The task is different for structured documents :

 What means “similar documents” :  Same structure ?  Same content ?  Both  Open question

slide-24
SLIDE 24

2006-04-27 - LIPN - P. Gallinari 24

Clustering

) / ( * ) / (

/ / 1

=

Θ = Θ

C i c d c

i i

s P d P α

 Mixture model :  EM algorithm (CEM)  Use on the structure (only) using INEX corpus

slide-25
SLIDE 25

2006-04-27 - LIPN - P. Gallinari 25

Different models

slide-26
SLIDE 26

2006-04-27 - LIPN - P. Gallinari 26

The grammar model

A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3

1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P

slide-27
SLIDE 27

2006-04-27 - LIPN - P. Gallinari 27

Grammar model and DTD

A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3

1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P ] 1 [ ] 1 [ ] 5 2 [ ] 5 3 [ BC C EEA B B A C A A →  →  →  → 

BC C EEA B B A C A A →  →  →  → 

< !DOCTYPE A [ <!ELEMENT A (A,C)> <!ELEMENT A (B) > <!ELEMENT B (E,E,A)> <!ELEMENT C (B,C)>]>

slide-28
SLIDE 28

2006-04-27 - LIPN - P. Gallinari 28

Clustering results

70 75 80 85 90 95 5 10 15 18 20 25 30 35 40 Nombre de clusters Entropie micro moyenne (en %) Naive Bayes Parent Parent et Grand Parent Parent et Frere Grammaire

slide-29
SLIDE 29

2006-04-27 - LIPN - P. Gallinari 29

Example of DTDs

slide-30
SLIDE 30

2006-04-27 - LIPN - P. Gallinari 30

Clustering : conclusions

 Mixture model of belief networks  Different models  Grammar model is better

 Able to compute a kind of DTD  Ill defined problem: clustering of XML documents ?

slide-31
SLIDE 31

2006-04-27 - LIPN - P. Gallinari 31

Outline

 Context  Generative tree models  3 problems

 Classification  Clustering  Document restructuration

 Experiments  Conclusion and future work

 XML Document Mining Challenge

slide-32
SLIDE 32

2006-04-27 - LIPN - P. Gallinari 32

Structural heterogeneity

<Restaurant> <Nom>L’olivier</Nom> <Description> Ce joli restaurant localisé près du métro Jaurès, au 19 du boulevard de la vilette, perdu dans le 19ème arrondissement de Paris propose une cuisine italienne, notamment des pâtes fraîches au 3 fromages. </Description> </Restaurant> <Restaurant> <Nom>La cantine</Nom> <Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE </Adresse> <Spécialités> Canard à l’orange, Lapin au miel </Spécialités> </Restaurant> <Restaurant> <Nom>Tokyo Bar</Nom> <Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>Bolivar</Rue> <Num>127</Num> </Adresse> <Plat>Sushi</Plat> <Plat>Sashimi</Plat> </Restaurant>

 Problem: Query heterogeneous XML databases

  • r collections, Storage, etc

 Needs to know the correspondence between the structured representations

slide-33
SLIDE 33

2006-04-27 - LIPN - P. Gallinari 33

Document mapping problem

 Problem

 Learn from examples how to map heterogeneous sources

  • nto a predefined target

schema  Preserve the document semantic  Sources: semistructured, HTML, PDF, flat text, etc Labeled tree mapping problem

<Restaurant> <Nom>La cantine</Nom> <Adresse> <Ville>Paris</Vill e> <Arrd>19</Arrd > <Rue>pyrénées</ Rue> <Num>65</Num> </Adresse> <Plat> Canard à l’orange </Plat> <Plat> Lapin au miel </Plat> </Restaurant>

<Restaurant> <Nom>La cantine</Nom> <Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE </Adresse> <Spécialités> Canard à l’orange, Lapin au miel </Spécialités> </Restaurant>

slide-34
SLIDE 34

2006-04-27 - LIPN - P. Gallinari 34

Document mapping problem

 Central issue: Complexity

 Large collections  Large feature space: 103 to 106  Large search space (exponential)

 Approach

 Learn generative models of XML target documents from a training set  Decoding of unknown sources according to the learned model

slide-35
SLIDE 35

2006-04-27 - LIPN - P. Gallinari 35

Learning the correspondence via examples

 Why using ML for structure matching ?

 Multiple sources: variability, documents do not follow the schema, collection growth, etc  Web sources: DTDs, Schema are often unknown or do not exist

slide-36
SLIDE 36

2006-04-27 - LIPN - P. Gallinari 36

Learning correspondence

 Data centered view (Doan et al.)

 Multiple independent classifier combination  Centralized (mediator) or P2P  1:1 or m:n transformations

 Document centered view

 Document conversion (Xerox)

 rendering format (HTML, PDF, etc) -> XML predefined DTD format

 Information retrieval (LIP6)

 Content and structure queries (e.g. INEX)

slide-37
SLIDE 37

2006-04-27 - LIPN - P. Gallinari 37

Problem formulation

Given ST a target format dsin(d) an input document Find the most probable target document

) ' ( max arg

) (

'

d in T T

S S d S

d d P d

=

Decoding Learned transformation model

slide-38
SLIDE 38

2006-04-27 - LIPN - P. Gallinari 38

General restructuration model

sd td' sd' td

?

) , , / ( ) , / ( argmax

' ' ' ' 1

Θ Θ =

d d d d d d

t s t P s s P d

slide-39
SLIDE 39

2006-04-27 - LIPN - P. Gallinari 39

Instance 1 : Label mapping

 Subtask of structure mapping

 Tree structure remains unchanged  Learn to automatically label nodes

) , / ,......., ( ) / ,..., ( max arg

' ' / / ' 1 ' / / ' 1 ,..., 1

/ / ' / / ' 1

Θ Θ =

Λ ∈ d d d d d d d s s

s t t P s s P d

d final d d d

A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3

? ? ? ? ? ? ? ? ? A A C A C B C E E

slide-40
SLIDE 40

2006-04-27 - LIPN - P. Gallinari 40

Document structure model

Document Title Section Section Paragraph Paragraph Italic Paragraph FootNote Title,Section,Section Paragraph,Paragraph Paragraph,FootNote

=

d n

n tag n gs childrenta P s P

in nodes all

) ), ( | ) ( ( ) | ( θ θ

slide-41
SLIDE 41

2006-04-27 - LIPN - P. Gallinari 41

PCFG model

A C A A B A E E A A C A C B C E E A B Arbre 1 Arbre 2 Arbre 3

1 1 ) / , ( 2 2 ) / , , ( 5 2 ) / ( 5 3 ) / , ( = = = = C C B P B A E E P A B P A C A P ] 1 [ ] 1 [ ] 5 2 [ ] 5 3 [ BC C EEA B B A C A A →  →  →  → 

slide-42
SLIDE 42

2006-04-27 - LIPN - P. Gallinari 42

Instance 2: plain text structuring

Structuration automatique F . MAES P . GALLINARI Problématiqu e etc .

slide-43
SLIDE 43

2006-04-27 - LIPN - P. Gallinari 43

Stochastic model

d = (c, s) s = (se, si)

slide-44
SLIDE 44

2006-04-27 - LIPN - P. Gallinari 44

Sub-optimal approach

 Segmentation and structuration are performed sequentially

Segmentation Structure Extraction

slide-45
SLIDE 45

2006-04-27 - LIPN - P. Gallinari 45

Models

 Segmentation: HMM  Structure

Document Intro Section Section Paragraphe Paragraphe Paragraphe
slide-46
SLIDE 46

2006-04-27 - LIPN - P. Gallinari 46

Instance 3 : HTML to XML

 Hypothesis

 Input document

 HTML tags mostly for visualization  Remove tags  Keep only the segmentation (leaves)

 Transformation

 Leaves are the same in the HTML and XML document  Target document model: node label depends

  • nly on its local context

 Context = content, left sibling, father

slide-47
SLIDE 47

2006-04-27 - LIPN - P. Gallinari 47

Problem representation

slide-48
SLIDE 48

2006-04-27 - LIPN - P. Gallinari 48

Model and training

 Probability of target tree  Document model : max-entropy conditional model learned from a training set of target docs

= =

i

n i i i i d T d T d Sin T

n father n sib c n P d d d P d d d P d d P )) ( ), ( , ( ) ,..., ( ) ,..., ( ) (

1 1 ) (

Document Intro Section Section Paragraphe Paragraphe Paragraphe
slide-49
SLIDE 49

2006-04-27 - LIPN - P. Gallinari 49

Decoding

 Solve  Exact Dynamic Programming decoding

  • O(|Leaf nodes|3.|tags|)

 Approximate solution with LASO (Hal Daume ICML 2005)

  • O(|Leaf nodes|.|tags||tree nodes|)

) ' ( max arg

) (

'

d in T T

S S d S

d d P d

=

slide-50
SLIDE 50

2006-04-27 - LIPN - P. Gallinari 50

Experiments

 INEX corpus:

 IEEE collection (XML) :

 12 000 documents (training : 7 800 , Test : 4 200)  ≈ 5 000 000 content nodes  139 tags  Mean document depth ≈ 7  vocabulary : ≈ 22 000 mots

 test corpus :

 Transaction On …series  Unlabeled documents (tags removed)

slide-51
SLIDE 51

2006-04-27 - LIPN - P. Gallinari 51

Instance 1 : Label mapping - results

9,5% 65,30% 49,70% 27,80% 139 tags 79,3% 86,50% 72,90% 58% 5 tags naïve model Struct + Content Structure Content

slide-52
SLIDE 52

2006-04-27 - LIPN - P. Gallinari 52

Instance 1 : IR adapted measure

5 Tags

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 % of nodes % of documents

139 Tags

10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 1 % of nodes % of documents Content with all nodes Structure Structure and Content Content without empty nodes

slide-53
SLIDE 53

2006-04-27 - LIPN - P. Gallinari 53

Instance 2: plain text structuring Results

22,8% 24,6% 91,5% HMM + TMM 31,2% 75,7% 92,8 % Exact + TMM Structuration (internal nodes) Segmentation (leafs) labeling Models

 Extreme structuration instance  Exact + TMM: degraded version of HTML documents structuration

slide-54
SLIDE 54

2006-04-27 - LIPN - P. Gallinari 54

Instance 3 HTML to XML

 IEEE collection / INEX corpus

 12 K documents,

 Average: 500 leaf nodes, 200 int nodes, 139 tags

 Movie DB

 10 K movie descriptions (IMDB)

 Average: 100 leaf nodes, 35 int. nodes, 28 tags

 Shakespeare 39 plays

 Few doc, but:

 Average: 4100 leaf nodes, 850 int nodes, 21 tags

 Mini-Shakespeare

 Randomly chosen 60 scenes from the plays

 85 leaf nodes, 20 int. nodes, 7 tags

slide-55
SLIDE 55

2006-04-27 - LIPN - P. Gallinari 55

Performances

slide-56
SLIDE 56

2006-04-27 - LIPN - P. Gallinari 56

slide-57
SLIDE 57

2006-04-27 - LIPN - P. Gallinari 57

Conclusion

 Document restructuration is a new problem  Tree transformation problem of high complexity (content + structure)  Many different instances  Approach based on generative models of target documents

slide-58
SLIDE 58

2006-04-27 - LIPN - P. Gallinari 58

XML Document Mining Challenge 2006

 Challenge

 INEX-Delos and Pascal networks of excellence

 Three tasks

 Classification  Clustering  Document mapping

 3 XML corpora

 IEEE collection  IMDB (Movie descriptions)  Wikipedia in 4 languages  Dead line : june 2006

 Web site : http://xmlmining.lip6.fr  Email : xmlmining@lip6.fr