using wikipedia for co clustering based cross domain text
play

Using Wikipedia for Co-clustering Based Cross-domain Text - PDF document

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM Pisa, December 16, 2008 Motivation Labeled data are seldom


  1. Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM — Pisa, December 16, 2008 Motivation • Labeled data are seldom available, and often too expensive to obtain. • Abundant labeled data may exist for a different but related domain. • Goal : Use the labeled data as auxiliary information to accomplish the task (classification) in the target domain.

  2. Main Idea • Leverage the shared dictionary across the in-domain and out-of-domain (target) documents to propagate label information. common D i D o words label propagation Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation

  3. Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • : in-domain documents D i • : out-of-domain documents D o • : set of class labels C • : dictionary of all the words W Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Co-clustering of : D o C D o : { d 1 , ..., d m } → { ˆ d 1 , ˆ d 2 , ..., ˆ d |C| } = ˆ D o w k } = ˆ C W : { w 1 , ..., w n } → { ˆ w 1 , ˆ w 2 , ..., ˆ W

  4. Co-clustering based Classification (CoCC) [Dai et al., KDD 07] )(*)&+,$-.#"/01.(2-3+.- :4,;) 0123453647"#8& 64917%82) :4,;) 0123453647"#8& 64917%82) -"@%+) *+1)2%,#8>&:4,;)&45&'42?&<83 647"#8&"8;&012345&647"#8& 64917%82) :4,;) *+1)2%,#8>&0123453 647"#8&64917%82) -"@%+) A3A&7"B& @%2C%%8& +"@%+)&"8;& ;4917%82& :4,;) 9+1)2%,) 4$-506 4$-507 Co-clustering based Classification (CoCC) [Dai et al., KDD 07] !"#$#%&#'%$#(" :4,;) *-./0 0123453647"#8& 64917%82) !"#$%& '"(%)& *+"))# ! %, <8#2#"+#="2#48& <8#2#"+#="2#48& 45&64917%82& 45&:4,;& *+1)2%,) *+1)2%,) <83647"#8&"8;& 0123453647"#8& 64917%82) :4,;)

  5. Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Iterative algorithm that achieves W { I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ ( I ( C ; W ) − I ( C ; ˆ min ˆ W )) } D o , ˆ loss loss in mutual in mutual information information between class between documents and labels and words words Information Theoretic Co-clustering [Dhillion et al., KDD 03] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) p ( x, y ) log p ( x, y ) � � I ( X ; Y ) = p ( x ) p ( y ) x y I ( C ; W ) − I ( C ; ˆ W )

  6. p ( d, w ) , f ( d | w ) = p ( d | w ) = f ( d, w ) � f ( w ) = f ( w ) , d ∈ D o p ( d, w ) , f ( w | d ) = p ( w | d ) = p ( d, w ) � f ( d ) = f ( d ) , w ∈ W ˆ w | ˆ w | ˆ d ) , ˆ f ( ˆ w ) = p ( ˆ f ( ˆ d ) = p ( ˆ d | ˆ d | ˆ w ) , f ( d | ˆ ˆ d ) = p ( d | ˆ d ) , ˆ f ( w | ˆ w ) = p ( w | ˆ w ) , ˆ w ) = ˆ f ( d | ˆ d ) ˆ f ( ˆ w ) = p ( d | ˆ d ) p ( ˆ f ( d | ˆ d | ˆ d | ˆ w ) f ( w | ˆ ˆ d ) = ˆ w ) ˆ w | ˆ w | ˆ f ( w | ˆ f ( ˆ d ) = p ( w | ˆ w ) p ( ˆ d ) w ) p ( w ) g ( c, w ) = p ( c, ˆ w ) p ( w | ˆ w ) = p ( c, ˆ p ( ˆ w ) p ( c, w ) , g ( c | w ) = p ( c | w ) = g ( c, w ) � g ( w ) = g ( w ) , c ∈ C � w p ( c | w ) p ( w ) � w p ( c | w ) p ( w ) w ∈ ˆ w ∈ ˆ g ( c | ˆ ˆ w ) = = . p ( ˆ w ) � w p ( w ) w ∈ ˆ

  7. Co-clustering based Classification (CoCC) [Dai et al., KDD 07] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ I ( C ; W ) − I ( C ; ˆ W ) = D ( f ( D o ; W ) || ˆ f ( D o ; W )) + λ D ( g ( C , W ) || ˆ g ( C , W )) p ( x ) log p ( x ) � D ( p ( x ) || q ( x )) = q ( x ) x Co-clustering based Classification (CoCC) [Dai et al., KDD 07] D ( f ( D o , W ) || ˆ f ( D o , W )) f ( d ) D ( f ( W| d ) || ˆ f ( W| ˆ � � = d )) d ∈ ˆ ˆ d ∈ ˆ d D o D ( f ( D o , W ) || ˆ f ( D o , W )) f ( w ) D ( f ( D o | w ) || ˆ � � = f ( D o | ˆ w )) w ∈ ˆ w ∈ ˆ w W ˆ D ( g ( C , W ) || ˆ g ( C , W )) � � = g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) w ∈ ˆ w ∈ ˆ ˆ W w

  8. Co-clustering based Classification (CoCC) [Dai et al., KDD 07] C ( t ) D ( f ( W| d ) || ˆ f ( t − 1) ( W| ˆ D o ( d ) = argmin d )) ˆ d C ( t +1) w f ( w ) D ( f ( D o | w ) || ˆ ( d ) = argmin f ( D o | ˆ w )) W ˆ + λ g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation

  9. "Ford "Felidae" Vehicles" Build Thesaurus Ambiguous Concepts: Category Category from Wikipedia Puma Puma (Car) Concept "Puma" belongs to "Puma Redirect Category (Car)" Concepts of "Puma" "Felidae" "Puma" Building Related "Auto- "Mountain "Cougar" Concepts of mobile" Semantic Lion" "Puma" Kernels   1 a . . . b from a . . . c 1 Wikipedia   Build Semantic Concept  . . .  ... Kernels . . . Proximity   . . . Matrix   Wikipedia: b c . . . 1 Overall Text Document Search Wikipedia "... The Cougar, also Approach Candidate Concepts in Puma and Mountain lion, Concepts Documents is a New World mammal Puma 2 of the Felidae family ..." ... Enrich Document Enriched Document Representation Disambiguation Representation with Wikipedia Puma 2 "Puma" here means a kind Concepts Cougar 2 of animal, not car or sport- brand. Felines 0.98 ... Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . . 0 0 0 b c 1 · · · · · · S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based

  10. Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . .  1 if c i and c j are synonyms; 0 0 0 b c 1 · · · · · ·   µ − depth if c i and c j are hyponyms;  P ij = S if c i and c j are associative concepts;   0 otherwise .  S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based Building Semantic Kernels Machine learning, statistical learning and data mining are related subjects. Original BOW Vector <machine:1, statistical:1, learn:2, data:1, mine:1, relate:1, subject:1> Find Wikipedia Concepts and Keep as it is φ ( d ) <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; ... > × Machine Learning Statistical Learning Data Mining Artificial Intelligence . . . . . . Machine Learning 1 0.6276 0.4044 0.1314 . . . Statistical Learning 0.6276 1 0.2839 0.1146 P . . . Data Mining 0.4044 0.2839 1 0.0792 . . . Artificial Intelligence 0.1314 0.1146 0.0792 1 . . . . . . . . . Enriched Document Vector ˜ = φ ( d ) = φ ( d ) P Representation <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; artificial intelligence:0.3252 >

  11. Empirical Evaluation • Data sets : 20Newsgroups and SRAA • Methods : • CoCC w/ and w/out enrichment • NB w/ and w/out enrichment Cross-domain Classification Precision Rates Data Set w/o enrichment w/ enrichment NB CoCC NB CoCC rec vs talk 0.824 0.921 0.853 0.998 rec vs sci 0.809 0.954 0.828 0.984 comp vs talk 0.927 0.978 0.934 0.995 comp vs sci 0.552 0.898 0.673 0.987 comp vs rec 0.817 0.915 0.825 0.993 sci vs talk 0.804 0.947 0.877 0.988 rec vs sci vs comp 0.584 0.822 0.635 0.904 rec vs talk vs sci 0.687 0.881 0.739 0.979 sci vs talk vs comp 0.695 0.836 0.775 0.912 rec vs talk vs sci vs comp 0.487 0.624 0.538 0.713 real vs simulation 0.753 0.851 0.826 0.977 auto vs aviation 0.824 0.959 0.933 0.992

  12. CoCC with enrichment: Precision as a function of the number of iterations )"! !"( !"' ,-./01023 !"& !"% -./7817569:78171/0 -./78171/07817/2;< !"$ 1/07817569:7817/2;< 1/07817-./7817569:7817/2;< !"# ! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& 45.-650231 CoCC with enrichment: Precision as a function of (sci vs talk vs comp) λ '"! !"& !"% *+,-./.01 !"$ '#&230+42-56/7,+/ %$230+42-56/7,+/ !"# '%230+42-56/7,+/ ! !"!('#) !"!%#) !"'#) !"#) !") ' # $ & !

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend