Stochastic models for semi- structured document mining P. Gallinari - PowerPoint PPT Presentation

Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski – L. Denoyer – F. Maes LIP6 University Pierre – Marie Curie - Fr

Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document mapping  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 2

Context - Machine learning in the structured domain  Model, Classify, cluster structured data Domains: Chemistry, biology, XML, etc  Models: discriminant e.g. kernels, generative e.g.  tree densities  Predict structured outputs Domains: natural language parsing, taxonomies, etc  Models: relational learning, large margin extensions   Learn to associate structured representations aka Tree mapping Domains: databases, semi-structured data  2006-04-27 - LIPN - P. Gallinari 3

Context- Machine learning in the structured domain  Structure only vs Structure + content  Central complexity issue  Representation space (#words, #tags, #relations)  Search space for structured outputs - idem  Large corpora needs simple and approximate methods 2006-04-27 - LIPN - P. Gallinari 4

Context-XML semi-structured documents <article> <hdr> <bdy> <fig> <fgc> text <sec> <st> text <p> text 2006-04-27 - LIPN - P. Gallinari 5

Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document restructuration  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 6

Document model s d t d d = ( s , ) t d d P ( D d / ) P ( S s , T t / ) = Θ = = = Θ d d d P ( S s / ) P ( T t / S s , ) = = Θ = = Θ Structural probability Content probability 2006-04-27 - LIPN - P. Gallinari 7

Document Model: Structure  Belief Networks Document Document Document Document Titre Corps ⇒ Titre du document Section Section Section Intro Intro Intro Section Section Section Section Section Cette section contient deux La deuxième section ne paragraphes contient pas de paragraphes Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Titre Paragraphe Paragraphe / d / / d / / d / ( ) d d d i d i d d d P ( s ) ∏ P s / label ( parent ∏ ( n )), label ( précédent ( n )) P ( s ) P ( s / label ( parent ( n ))) ∏ P ( s ) P ( s ) = = = i d i d i i i 1 i 1 = = i 1 = 2006-04-27 - LIPN - P. Gallinari 8

Document Model: Content  Model for each node t = ( t ,...., t ) 1 / d / d d d  1st order dependency / d / P ( t / s , ) P ( t / s , ) ∏ i i θ = θ d d d d i 1  Use of a local generative model for each = label P ( t / s , ) P ( t / ) i i i θ = θ i s d d d d 2006-04-27 - LIPN - P. Gallinari 9

Final network Document P ( Section / Document ) P ( Intro / Document ) P ( Section / Document ) Section Intro Section P ( Paragraphe / Section ) P ( Paragraphe / Section ) P ( T 1 / Intro ) P ( Paragraphe / Section ) P ( T 5 / Section ) P ( T 2 / Section ) Paragraphe Paragraphe T5= «La seconde Paragraphe T1= « Ce document section » est un exemple de document structuré arborescent » P ( T 3 / Paragraphe ) P ( T 6 / Paragraphe ) P ( T 4 / Paragraphe ) T2= « Ceci est la T3= « Le premier T4= «Le second T6= «Le troisième première section du paragraphe » paragraphe » paragraphe » ( ) document » 3 P ( d ) P ( Intro / Document ) P ( Section / Document )? P ( P arg raphe / Section ) = * P ( T 1 / Intro ) P ( T 2 / Section ) P ( T 3 / Paragraphe ) * P ( T 4 / Paragraphe ) P ( T 5 / Section ) P ( T 6 / Paragraphe ) 2006-04-27 - LIPN - P. Gallinari 10

Different learning techniques  Discriminant learning  Likelihood maximization 1 P ( c / x ) = P ( x / c ) ∑ log L log P ( d / ) − = Θ P ( x / c ) 1 e + d D ∈ 1 TRAIN =    / d /  c n θ x , pa ( x ) i i     d s d d t ∑ log ∑ ∑ ∑ − log P ( s / ) log P ( t / s , ) = Θ + Θ c    i i  θ d 1 e x , pa ( x ) s i 1 = + i i i     d D d D i 1 ∈ ∈ =     TRAIN TRAIN  Logistic function L L = + structure contenu Error minimization  2006-04-27 - LIPN - P. Gallinari 11

Fisher Kernel Fisher Score :  U X ∇ log P ( X / ) = θ θ Hypothesis : : The gradient of the log-likelihood is Hypothesis   informative about how much a feature « participate » to the generation of an example. Fisher Kernel : K(X,Y)=K(Ux,Uy)  2006-04-27 - LIPN - P. Gallinari 12

Use with the model   ( )   d s d d t d s d d t ∑ ∑ U log P ( s / ) log P ( t / s , ) log P ( s / ) log P ( t / s , ) = ∇ Θ + Θ = ∇ Θ + ∇ Θ d i i tl Θ Θ Θ     d l i / s l ∈ Λ  =  i ? ? ? ? ? ? ? ? ? ? ? ? d s d d t d d t ? ? U log P ( s / ), log P ( t / s , ) ,..., log P ( t / s , ) ? ? ? ? ? ? ? d i i i i ? tl tl ? ? ? ? ? ? ? ? ? ? ? ? ? ? d d i / s l i / s l ? ? ? ? ? ? ? ? i 1 i / / ? Sous-vecteur Sous-vecteur Sous-vecteur correspondant au correspondant au correspondant gradient pour les gradient pour les au gradient sur l / ? / nœuds de label l1 nœuds de label le modèle de structure d s d t log P ( t / , ) ? ? ? 2006-04-27 - LIPN - P. Gallinari 13

Remark  Fisher kernels: very large number of parameters  On INEX :  With flat models : 200 000 parameters  With structure models : 20 millions parameters 2006-04-27 - LIPN - P. Gallinari 14

Conclusion about this faimily of generative models  Natural setting for modeling semi structured multimedia documents  Structural probability (Belief network)  Content probability (local generative model)  Learning with maximum likelihood, or cross-entropy  Discriminant learning and Fisher Kernel 2006-04-27 - LIPN - P. Gallinari 15

Classification  One model for each category  3 XML corpora + 1 multimedia corpus INEX : 12 000 articles from IEEE   18 journals WebKB : Web pages (8K pages)   course, department, …7 topics WIPO : XML Documents of patents   categories of patents NetProtect (European project) : 100 000 web pages   pornographic or not 2006-04-27 - LIPN - P. Gallinari 17

Categorization : Generative models F1 micro F1 macro NB 0.59 0.605 INEX Structure 0.619 0.622 NB 0.801 0.706 WebKB Structure 0.827 0.743 NB 0.662 0.565 WIPO Structure 0.677 0.604 2006-04-27 - LIPN - P. Gallinari 18

Discriminant models F1 micro F1 macro F1 micro F1 macro NB 0.59 0.605 NB 0.801 0.706 Structure model 0.619 0.622 Structure model 0.827 0.743 SVM TF-IDF 0.534 0.564 SVM TF-IDF 0.737 0.651 Fisher kernel 0.661 0.668 Fisher Kernel 0.823 0.738 Discriminant learning 0.575 0.600 Discriminant learning 0.868 0.792 INEX WebKB F1 micro F1 macro NB 0.662 0.565 Structure model 0.677 0.604 SVM TF-IDF 0.822 0.71 Fisher Kernel 0.862 0.715 WIPO 2006-04-27 - LIPN - P. Gallinari 19

Multimedia model Director Ang Lee Takes Risks with Mean Macroaverage Microaverage Green 'Hulk' recall recall NB 89.9 88.4 [89.2 ;90.4] [87.7 ;89] LOS ANGELES (Reuters) - Taiwan-born director Ang Lee, Structure 92.5 92.9 perhaps best known for his Oscar-winning "Crouching Tiger, model [91.9 ;93] [92.3 ;93.3] Hidden Dragon," is taking a big with text risk with the splashy summer popcorn flick …... Structure 83 82.7 model [82.2 ;83.7] [81.9 ;83.4] with pictures Structure 93.6 94.7 FAMILY DRAMA, BIG ACTION model [93.1 ;94] [94.2 ;95.1] text and For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again. pictures " This is a drama, a family drama," said Lee, "but with big action." His slumping shoulders twitch and he laughs….. 2006-04-27 - LIPN - P. Gallinari 20

Classification : conclusion  Structure model is able to handle structure and content information  Both structure and content carry class information  Multimedia categorization  Not in this talk :  Categorization of parts of documents  Categorization of trees (structure only) 2006-04-27 - LIPN - P. Gallinari 21

Stochastic models for semi- structured document mining P. Gallinari - PowerPoint PPT Presentation

Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski L. Denoyer F. Maes LIP6 University Pierre Marie Curie - Fr Outline Context Generative tree models 3 problems

Semi-structured data Data is not just text, but is not as well- Semi-structured data

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Semi-Crystalline Polymer Morphologies and their Hierarchical Morphologies 1 Semi-Crystalline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Molecular diagnosis, part II Florian Markowetz

Identification and Characterization of PPI-461, a Potent and Selective HCV NS5A Inhibitor with

1

Chemistry 120 Fall 2016 Instructor: Dr. Upali Siriwardane e-mail: upali@latech.edu Office: CTH

Document layout analysis in SCRIBO Outline Introduction & Goals CSI Seminar - July 2011

Tcl / Tk as a Basis for Groupware Mark Roseman Department of Computer Science University of

H D =[l oo) fi fr I r = =) R=. \ ' r / I ai' t 6r. oi' qLr I l_ t. elo's R = L U

MIDDLEWARE & METHODOLOGY CRI / IRIT / LIP / LIUPPA / CRESTIC / APL 1 RELATED DOMAINS

Stochastic models for semi- structured document mining P. Gallinari - PowerPoint PPT Presentation

Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski L. Denoyer F. Maes LIP6 University Pierre Marie Curie - Fr Outline Context Generative tree models 3 problems

Semi-structured data Data is not just text, but is not as well- Semi-structured data

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Semi-Crystalline Polymer Morphologies and their Hierarchical Morphologies 1 Semi-Crystalline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Molecular diagnosis, part II Florian Markowetz

Identification and Characterization of PPI-461, a Potent and Selective HCV NS5A Inhibitor with

1

Chemistry 120 Fall 2016 Instructor: Dr. Upali Siriwardane e-mail: upali@latech.edu Office: CTH

Document layout analysis in SCRIBO Outline Introduction &amp; Goals CSI Seminar - July 2011

Tcl / Tk as a Basis for Groupware Mark Roseman Department of Computer Science University of

H D =[l oo) fi fr I r = =) R=. \ ' r / I ai' t 6r. oi' qLr I l_ t. elo's R = L U

MIDDLEWARE &amp; METHODOLOGY CRI / IRIT / LIP / LIUPPA / CRESTIC / APL 1 RELATED DOMAINS

Document layout analysis in SCRIBO Outline Introduction & Goals CSI Seminar - July 2011

MIDDLEWARE & METHODOLOGY CRI / IRIT / LIP / LIUPPA / CRESTIC / APL 1 RELATED DOMAINS