Using Wikipedia for Co-clustering Based Cross-domain Text - PDF document

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM — Pisa, December 16, 2008 Motivation • Labeled data are seldom available, and often too expensive to obtain. • Abundant labeled data may exist for a different but related domain. • Goal : Use the labeled data as auxiliary information to accomplish the task (classification) in the target domain.

Main Idea • Leverage the shared dictionary across the in-domain and out-of-domain (target) documents to propagate label information. common D i D o words label propagation Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation

Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • : in-domain documents D i • : out-of-domain documents D o • : set of class labels C • : dictionary of all the words W Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Co-clustering of : D o C D o : { d 1 , ..., d m } → { ˆ d 1 , ˆ d 2 , ..., ˆ d |C| } = ˆ D o w k } = ˆ C W : { w 1 , ..., w n } → { ˆ w 1 , ˆ w 2 , ..., ˆ W

Co-clustering based Classification (CoCC) [Dai et al., KDD 07] )(*)&+,$-.#"/01.(2-3+.- :4,;) 0123453647"#8& 64917%82) :4,;) 0123453647"#8& 64917%82) -"@%+) *+1)2%,#8>&:4,;)&45&'42?&<83 647"#8&"8;&012345&647"#8& 64917%82) :4,;) *+1)2%,#8>&0123453 647"#8&64917%82) -"@%+) A3A&7"B& @%2C%%8& +"@%+)&"8;& ;4917%82& :4,;) 9+1)2%,) 4$-506 4$-507 Co-clustering based Classification (CoCC) [Dai et al., KDD 07] !"#$#%&#'%$#(" :4,;) *-./0 0123453647"#8& 64917%82) !"#$%& '"(%)& *+"))# ! %, <8#2#"+#="2#48& <8#2#"+#="2#48& 45&64917%82& 45&:4,;& *+1)2%,) *+1)2%,) <83647"#8&"8;& 0123453647"#8& 64917%82) :4,;)

Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Iterative algorithm that achieves W { I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ ( I ( C ; W ) − I ( C ; ˆ min ˆ W )) } D o , ˆ loss loss in mutual in mutual information information between class between documents and labels and words words Information Theoretic Co-clustering [Dhillion et al., KDD 03] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) p ( x, y ) log p ( x, y ) � � I ( X ; Y ) = p ( x ) p ( y ) x y I ( C ; W ) − I ( C ; ˆ W )

Co-clustering based Classification (CoCC) [Dai et al., KDD 07] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ I ( C ; W ) − I ( C ; ˆ W ) = D ( f ( D o ; W ) || ˆ f ( D o ; W )) + λ D ( g ( C , W ) || ˆ g ( C , W )) p ( x ) log p ( x ) � D ( p ( x ) || q ( x )) = q ( x ) x Co-clustering based Classification (CoCC) [Dai et al., KDD 07] D ( f ( D o , W ) || ˆ f ( D o , W )) f ( d ) D ( f ( W| d ) || ˆ f ( W| ˆ � � = d )) d ∈ ˆ ˆ d ∈ ˆ d D o D ( f ( D o , W ) || ˆ f ( D o , W )) f ( w ) D ( f ( D o | w ) || ˆ � � = f ( D o | ˆ w )) w ∈ ˆ w ∈ ˆ w W ˆ D ( g ( C , W ) || ˆ g ( C , W )) � � = g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) w ∈ ˆ w ∈ ˆ ˆ W w

Co-clustering based Classification (CoCC) [Dai et al., KDD 07] C ( t ) D ( f ( W| d ) || ˆ f ( t − 1) ( W| ˆ D o ( d ) = argmin d )) ˆ d C ( t +1) w f ( w ) D ( f ( D o | w ) || ˆ ( d ) = argmin f ( D o | ˆ w )) W ˆ + λ g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation

"Ford "Felidae" Vehicles" Build Thesaurus Ambiguous Concepts: Category Category from Wikipedia Puma Puma (Car) Concept "Puma" belongs to "Puma Redirect Category (Car)" Concepts of "Puma" "Felidae" "Puma" Building Related "Auto- "Mountain "Cougar" Concepts of mobile" Semantic Lion" "Puma" Kernels   1 a . . . b from a . . . c 1 Wikipedia   Build Semantic Concept  . . .  ... Kernels . . . Proximity   . . . Matrix   Wikipedia: b c . . . 1 Overall Text Document Search Wikipedia "... The Cougar, also Approach Candidate Concepts in Puma and Mountain lion, Concepts Documents is a New World mammal Puma 2 of the Felidae family ..." ... Enrich Document Enriched Document Representation Disambiguation Representation with Wikipedia Puma 2 "Puma" here means a kind Concepts Cougar 2 of animal, not car or sport- brand. Felines 0.98 ... Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . . 0 0 0 b c 1 · · · · · · S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based

Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . .  1 if c i and c j are synonyms; 0 0 0 b c 1 · · · · · ·   µ − depth if c i and c j are hyponyms;  P ij = S if c i and c j are associative concepts;   0 otherwise .  S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based Building Semantic Kernels Machine learning, statistical learning and data mining are related subjects. Original BOW Vector <machine:1, statistical:1, learn:2, data:1, mine:1, relate:1, subject:1> Find Wikipedia Concepts and Keep as it is φ ( d ) <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; ... > × Machine Learning Statistical Learning Data Mining Artificial Intelligence . . . . . . Machine Learning 1 0.6276 0.4044 0.1314 . . . Statistical Learning 0.6276 1 0.2839 0.1146 P . . . Data Mining 0.4044 0.2839 1 0.0792 . . . Artificial Intelligence 0.1314 0.1146 0.0792 1 . . . . . . . . . Enriched Document Vector ˜ = φ ( d ) = φ ( d ) P Representation <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; artificial intelligence:0.3252 >

Empirical Evaluation • Data sets : 20Newsgroups and SRAA • Methods : • CoCC w/ and w/out enrichment • NB w/ and w/out enrichment Cross-domain Classification Precision Rates Data Set w/o enrichment w/ enrichment NB CoCC NB CoCC rec vs talk 0.824 0.921 0.853 0.998 rec vs sci 0.809 0.954 0.828 0.984 comp vs talk 0.927 0.978 0.934 0.995 comp vs sci 0.552 0.898 0.673 0.987 comp vs rec 0.817 0.915 0.825 0.993 sci vs talk 0.804 0.947 0.877 0.988 rec vs sci vs comp 0.584 0.822 0.635 0.904 rec vs talk vs sci 0.687 0.881 0.739 0.979 sci vs talk vs comp 0.695 0.836 0.775 0.912 rec vs talk vs sci vs comp 0.487 0.624 0.538 0.713 real vs simulation 0.753 0.851 0.826 0.977 auto vs aviation 0.824 0.959 0.933 0.992

CoCC with enrichment: Precision as a function of the number of iterations )"! !"( !"' ,-./01023 !"& !"% -./7817569:78171/0 -./78171/07817/2;< !"$ 1/07817569:7817/2;< 1/07817-./7817569:7817/2;< !"# ! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& 45.-650231 CoCC with enrichment: Precision as a function of (sci vs talk vs comp) λ '"! !"& !"% *+,-./.01 !"$ '#&230+42-56/7,+/ %$230+42-56/7,+/ !"# '%230+42-56/7,+/ ! !"!('#) !"!%#) !"'#) !"#) !") ' # $ & !

Using Wikipedia for Co-clustering Based Cross-domain Text - PDF document

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM Pisa, December 16, 2008 Motivation Labeled data are seldom

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Kicking Down the Cross Domain Door Techniques for Cross Domain Exploitation Billy K Rios (BK) and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

The Message of the Cross 1 Corinthians 1:18-31 Here is some test text Here is some test text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Erin Brantley, PhD(cand), MPH Drishti Pillai, MPH Leighton Ku, PhD, MPH June 2019 Supported by

Macquarie Group Limited JPMorgan Australia Corporate Access Days Singapore and Hong Kong 15-16

SmartPlantP&ID & SmartPlant3D Application in Gicotecnica . . Typical Project Owner: Oil

Opioids and the Labor Market EDA University Showcase 2020 Meetings Dionissi Aliprantis, Kyle Fee

Water Utility Climate Alliance Presentation at Public Water Supply Utilities Climate Impacts

Whos moving in? out? Population dynamics in Charlotte Rebecca Tippett, PhD Image Source: John

Welcome Back Pumas! Clarifying Expectations in Order to Have a Great 2 nd Semester P ark C rest M

of DCM Road, Kota India | Iran | Nepal | Nigeria | Oman | Qatar CITY SCAPE

Using Wikipedia for Co-clustering Based Cross-domain Text - PDF document

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM Pisa, December 16, 2008 Motivation Labeled data are seldom

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Kicking Down the Cross Domain Door Techniques for Cross Domain Exploitation Billy K Rios (BK) and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

The Message of the Cross 1 Corinthians 1:18-31 Here is some test text Here is some test text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Erin Brantley, PhD(cand), MPH Drishti Pillai, MPH Leighton Ku, PhD, MPH June 2019 Supported by

Macquarie Group Limited JPMorgan Australia Corporate Access Days Singapore and Hong Kong 15-16

SmartPlantP&amp;ID &amp; SmartPlant3D Application in Gicotecnica . . Typical Project Owner: Oil

Opioids and the Labor Market EDA University Showcase 2020 Meetings Dionissi Aliprantis, Kyle Fee

Water Utility Climate Alliance Presentation at Public Water Supply Utilities Climate Impacts

Whos moving in? out? Population dynamics in Charlotte Rebecca Tippett, PhD Image Source: John

Welcome Back Pumas! Clarifying Expectations in Order to Have a Great 2 nd Semester P ark C rest M

of DCM Road, Kota India | Iran | Nepal | Nigeria | Oman | Qatar CITY SCAPE

SmartPlantP&ID & SmartPlant3D Application in Gicotecnica . . Typical Project Owner: Oil