Using Wikipedia for Co-clustering Based Cross-domain Text - - PDF document

using wikipedia for co clustering based cross domain text
SMART_READER_LITE
LIVE PREVIEW

Using Wikipedia for Co-clustering Based Cross-domain Text - - PDF document

Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM Pisa, December 16, 2008 Motivation Labeled data are seldom


slide-1
SLIDE 1

Using Wikipedia for Co-clustering Based Cross-domain Text Classification

Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia

ICDM — Pisa, December 16, 2008

Motivation

  • Labeled data are seldom available, and
  • ften too expensive to obtain.
  • Abundant labeled data may exist for a

different but related domain.

  • Goal: Use the labeled data as auxiliary

information to accomplish the task (classification) in the target domain.

slide-2
SLIDE 2

Main Idea

  • Leverage the shared dictionary across the

in-domain and out-of-domain (target) documents to propagate label information.

common words

Di Do

label propagation

Main Idea

  • Enrich document representation to fill

the semantic gap.

common words & semantic concepts

Di Do

label propagation

slide-3
SLIDE 3

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

  • : in-domain documents
  • : out-of-domain documents
  • : set of class labels
  • : dictionary of all the words

Di Do C W

  • Co-clustering of :

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

Do

CW : {w1, ..., wn} →{ ˆ w1, ˆ w2, ..., ˆ wk} = ˆ W CDo : {d1, ..., dm} →{ ˆ d1, ˆ d2, ..., ˆ d|C|} = ˆ Do

slide-4
SLIDE 4

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

*+1)2%,#8>&:4,;)&45&'42?&<83 647"#8&"8;&012345&647"#8& 64917%82) 0123453647"#8& 64917%82)

  • "@%+)

:4,;) :4,;) 0123453647"#8& 64917%82)

  • "@%+)

:4,;) :4,;) A3A&7"B& @%2C%%8& +"@%+)&"8;& ;4917%82& 9+1)2%,) )(*)&+,$-.#"/01.(2-3+.- 4$-506 4$-507 *+1)2%,#8>&0123453 647"#8&64917%82)

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

!"#$%& '"(%)& *+"))#!%, *-./0 0123453647"#8& 64917%82) :4,;) <83647"#8&"8;& 0123453647"#8& 64917%82) :4,;) <8#2#"+#="2#48& 45&64917%82& *+1)2%,) <8#2#"+#="2#48& 45&:4,;& *+1)2%,) !"#$#%&#'%$#("

slide-5
SLIDE 5

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

  • Iterative algorithm that achieves

min ˆ

Do, ˆ W{I(Do; W) − I( ˆ

Do; ˆ W) + λ(I(C; W) − I(C; ˆ W))}

loss in mutual information between documents and words loss in mutual information between class labels and words

Information Theoretic Co-clustering [Dhillion et al., KDD 03]

I(X; Y ) =

  • x
  • y

p(x, y) log p(x, y) p(x)p(y) I(Do; W) − I( ˆ Do; ˆ W) I(C; W) − I(C; ˆ W)

slide-6
SLIDE 6

f(w) =

  • d∈Do

p(d, w), f(d|w) = p(d|w) = f(d, w) f(w) , f(d) =

  • w∈W

p(d, w), f(w|d) = p(w|d) = p(d, w) f(d) , ˆ f( ˆ w| ˆ d) = p( ˆ w| ˆ d), ˆ f( ˆ d| ˆ w) = p( ˆ d| ˆ w), ˆ f(d| ˆ d) = p(d| ˆ d), ˆ f(w| ˆ w) = p(w| ˆ w), ˆ f(d| ˆ w) = ˆ f(d| ˆ d) ˆ f( ˆ d| ˆ w) = p(d| ˆ d)p( ˆ d| ˆ w) ˆ f(w| ˆ d) = ˆ f(w| ˆ w) ˆ f( ˆ w| ˆ d) = p(w| ˆ w)p( ˆ w| ˆ d) g(c, w) = p(c, ˆ w)p(w| ˆ w) = p(c, ˆ w)p(w) p( ˆ w) g(w) =

  • c∈C

p(c, w), g(c|w) = p(c|w) = g(c, w) g(w) , ˆ g(c| ˆ w) =

  • w∈ ˆ

w p(c|w)p(w)

p( ˆ w) =

  • w∈ ˆ

w p(c|w)p(w)

  • w∈ ˆ

w p(w)

.

slide-7
SLIDE 7

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

I(Do; W) − I( ˆ Do; ˆ W) + λI(C; W) − I(C; ˆ W) = D(f(Do; W)|| ˆ f(Do; W)) + λD(g(C, W)||ˆ g(C, W)) D(p(x)||q(x)) =

  • x

p(x) log p(x) q(x)

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

D(f(Do, W)|| ˆ f(Do, W)) =

  • ˆ

d∈ ˆ Do

  • d∈ ˆ

d

f(d)D(f(W|d)|| ˆ f(W| ˆ d)) D(f(Do, W)|| ˆ f(Do, W)) =

  • ˆ

w∈ ˆ W

  • w∈ ˆ

w

f(w)D(f(Do|w)|| ˆ f(Do| ˆ w)) D(g(C, W)||ˆ g(C, W)) =

  • ˆ

w∈ ˆ W

  • w∈ ˆ

w

g(w)D(g((C|w)||ˆ g(C| ˆ w)))

slide-8
SLIDE 8

Co-clustering based Classification (CoCC) [Dai et al., KDD 07]

C(t)

Do(d) = argmin ˆ d

D(f(W|d)|| ˆ f (t−1)(W| ˆ d)) C(t+1)

W

(d) = argmin

ˆ w f(w)D(f(Do|w)|| ˆ

f(Do| ˆ w)) +λg(w)D(g((C|w)||ˆ g(C| ˆ w)))

Main Idea

  • Enrich document representation to fill

the semantic gap.

common words & semantic concepts

Di Do

label propagation

slide-9
SLIDE 9

Building Semantic Kernels from Wikipedia: Overall Approach

Build Thesaurus from Wikipedia

Ambiguous Concepts: Puma Puma (Car) "Felidae" Category "Puma" "Cougar" "Mountain Lion" "Ford Vehicles" Category "Puma (Car)" "Auto- mobile"

Build Semantic Kernels

Text Document "... The Cougar, also Puma and Mountain lion, is a New World mammal

  • f the Felidae family ..."

Search Wikipedia Concepts in Documents Enrich Document Representation with Wikipedia Concepts

Enriched Document Representation Puma 2 Cougar 2 Felines 0.98 ... Disambiguation "Puma" here means a kind

  • f animal, not car or sport-

brand. Candidate Concepts Puma 2 ...

     1 a . . . b a 1 . . . c . . . . . . ... . . . b c . . . 1     

Wikipedia Concept Proximity Matrix Redirect Concepts of "Puma" Concept "Puma" belongs to Category "Felidae" Related Concepts of "Puma"

Proximity Matrix

Terms Concepts Terms 1 · · · 1 · · · . . . . . . ... . . . · · · 1 · · · · · · . . . . . . ... . . . · · · Concepts · · · · · · . . . . . . ... . . . · · · 1 a · · · b a 1 · · · c . . . . . . ... . . . b c · · · 1

S = λ1SBOW + λ2SOLC + (1 − λ1 − λ2)(1 − Dcat)

Content- based Outlink category-based Distance- based

slide-10
SLIDE 10

Proximity Matrix

Terms Concepts Terms 1 · · · 1 · · · . . . . . . ... . . . · · · 1 · · · · · · . . . . . . ... . . . · · · Concepts · · · · · · . . . . . . ... . . . · · · 1 a · · · b a 1 · · · c . . . . . . ... . . . b c · · · 1

S = λ1SBOW + λ2SOLC + (1 − λ1 − λ2)(1 − Dcat)

Content- based Outlink category-based Distance- based

Pij =        1 if ci and cj are synonyms; µ−depth if ci and cj are hyponyms; S if ci and cj are associative concepts;

  • therwise.

Building Semantic Kernels

<machine:1, statistical:1, learn:2, data:1, mine:1, relate:1, subject:1> <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1; artificial intelligence:0.3252> . . . Machine Learning Statistical Learning Data Mining Artificial Intelligence . . . Machine Learning 1 0.6276 0.4044 0.1314 . . . Statistical Learning 0.6276 1 0.2839 0.1146 . . . Data Mining 0.4044 0.2839 1 0.0792 . . . Artificial Intelligence 0.1314 0.1146 0.0792 1 . . . . . . . . .

×

<relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1; ... >

=

Original BOW Vector Find Wikipedia Concepts and Keep as it is Enriched Document Vector Representation Machine learning, statistical learning and data mining are related subjects.

φ(d) P ˜ φ(d) = φ(d)P

slide-11
SLIDE 11

Empirical Evaluation

  • Data sets: 20Newsgroups and SRAA
  • Methods:
  • CoCC w/ and w/out enrichment
  • NB w/ and w/out enrichment

Cross-domain Classification Precision Rates

Data Set w/o enrichment w/ enrichment NB CoCC NB CoCC rec vs talk 0.824 0.921 0.853 0.998 rec vs sci 0.809 0.954 0.828 0.984 comp vs talk 0.927 0.978 0.934 0.995 comp vs sci 0.552 0.898 0.673 0.987 comp vs rec 0.817 0.915 0.825 0.993 sci vs talk 0.804 0.947 0.877 0.988 rec vs sci vs comp 0.584 0.822 0.635 0.904 rec vs talk vs sci 0.687 0.881 0.739 0.979 sci vs talk vs comp 0.695 0.836 0.775 0.912 rec vs talk vs sci vs comp 0.487 0.624 0.538 0.713 real vs simulation 0.753 0.851 0.826 0.977 auto vs aviation 0.824 0.959 0.933 0.992

slide-12
SLIDE 12

CoCC with enrichment: Precision as a function of the number of iterations

!"# !"$ !"% !"& !"' !"( )"! ! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& ,-./01023 45.-650231

  • ./7817569:78171/0
  • ./78171/07817/2;<

1/07817569:7817/2;< 1/07817-./7817569:7817/2;<

CoCC with enrichment: Precision as a function of (sci vs talk vs comp) λ

! !"# !"$ !"% !"& '"! !"!('#) !"!%#) !"'#) !"#) !") ' # $ & *+,-./.01 !

'#&230+42-56/7,+/ %$230+42-56/7,+/ '%230+42-56/7,+/

slide-13
SLIDE 13

CoCC with enrichment: Precision as a function of the number of word clusters (sci vs talk vs comp)

! !"# !"$ !"% !"& !"' !"( !") !"* !"+ #"! $ & * #( %$ (& #$* $'( '#$ ,-./01023 4567.-8298:2-;8<=51>.-1

!?# !?!"$' !?!"#$'

Conclusions

  • Extended co-clustering approach for cross-

domain text classification by embedding background knowledge using Wikipedia

  • Future work:
  • Explore alternative representations for

common language substrate

  • Cross-language text classification