Instance Weighting for Domain Adaptation in NLP Jing Jiang & - - PDF document

instance weighting for domain adaptation in nlp
SMART_READER_LITE
LIVE PREVIEW

Instance Weighting for Domain Adaptation in NLP Jing Jiang & - - PDF document

Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007 Domain Adaptation Many NLP tasks are cast into classification


slide-1
SLIDE 1

1

Instance Weighting for Domain Adaptation in NLP

Jing Jiang & ChengXiang Zhai

University of Illinois at Urbana-Champaign

June 25, 2007

2

Domain Adaptation

  • Many NLP tasks are cast into classification problems
  • Lack of training data in new domains
  • Domain adaptation:

– POS: WSJ

  • biomedical text

– NER: news

  • blog, speech

– Spam filtering: public email corpus

  • personal inboxes
  • Domain overfitting

0.281 fly

mouse 0.541 mouse

mouse to find gene/protein from biomedical literature 0.641 Reuters

NYT 0.855 NYT

NYT to find PER, LOC, ORG from news text F1 Train

Test NER Task

slide-2
SLIDE 2

2

3

Existing Work

  • n Domain Adaptation
  • Existing work

– Prior on model parameters [Chelba & Acero 04] – Mixture of general and domain-specific distributions [Daumé III & Marcu 06] – Analysis of representation [Ben-David et al. 07]

  • Our work

– A fresh instance weighting perspective – A framework that incorporates both labeled and unlabeled instances

4

Outline

  • Analysis of domain adaptation
  • Instance weighting framework
  • Experiments
  • Conclusions
slide-3
SLIDE 3

3

5

The Need for Domain Adaptation

source domain target domain

6

The Need for Domain Adaptation

source domain target domain

slide-4
SLIDE 4

4

7

Where Does the Difference Come from?

p(x, y) p(x)p(y | x) ps(y | x)

  • pt(y | x)

ps(x)

pt(x)

labeling difference instance difference labeling adaptation instance adaptation

?

8

An Instance Weighting Solution (Labeling Adaptation)

source domain target domain pt(y | x)

✁ ps(y | x)

remove/demote instances

slide-5
SLIDE 5

5

9

An Instance Weighting Solution (Labeling Adaptation)

source domain target domain pt(y | x)

✁ ps(y | x)

remove/demote instances

10

An Instance Weighting Solution (Labeling Adaptation)

source domain target domain pt(y | x)

✁ ps(y | x)

remove/demote instances

slide-6
SLIDE 6

6

11

An Instance Weighting Solution (Instance Adaptation: pt(x) < ps(x))

source domain target domain pt(x) < ps(x) remove/demote instances

12

An Instance Weighting Solution (Instance Adaptation: pt(x) < ps(x))

source domain target domain pt(x) < ps(x) remove/demote instances

slide-7
SLIDE 7

7

13

An Instance Weighting Solution (Instance Adaptation: pt(x) < ps(x))

source domain target domain pt(x) < ps(x) remove/demote instances

14

An Instance Weighting Solution (Instance Adaptation: pt(x) > ps(x))

source domain target domain pt(x) > ps(x) promote instances

slide-8
SLIDE 8

8

15

An Instance Weighting Solution (Instance Adaptation: pt(x) > ps(x))

source domain target domain pt(x) > ps(x) promote instances

16

An Instance Weighting Solution (Instance Adaptation: pt(x) > ps(x))

source domain target domain pt(x) > ps(x) promote instances

slide-9
SLIDE 9

9

17

An Instance Weighting Solution (Instance Adaptation: pt(x) > ps(x))

  • Labeled target domain instances are useful
  • Unlabeled target domain instances may also be

useful

source domain target domain pt(x) > ps(x)

18

The Exact Objective Function

unknown true marginal and conditional probabilities in the target domain log likelihood (log loss function)

=

X Y * t y t t

dx x y p x y p x p ) ; | ( log ) | ( ) ( max arg θ θ

θ

slide-10
SLIDE 10

10

19

Three Sets of Instances

Ds Dt, l Dt, u

=

X Y * t y t t

dx x y p x y p x p ) ; | ( log ) | ( ) ( max arg θ θ

θ

20

Three Sets of Instances: Using Ds

=

X Y * t y t t

dx x y p x y p x p ) ; | ( log ) | ( ) ( max arg θ θ

θ

Ds Dt, l Dt, u

in principle, non-parametric density estimation; in practice, high dimensional data (future work)

) ( ) (

s i s s i t i

x p x p = β

✂ ✂

= =

s s

N i s i s i i i N i i i

x y p

1 1

) ; | ( log 1 max arg θ β α β α

θ

X≈Ds

) | ( ) | (

s i s i s s i s i t i

x y p x y p = α

need labeled target data

slide-11
SLIDE 11

11

21

=

X Y * t y t t

dx x y p x y p x p ) ; | ( log ) | ( ) ( max arg θ θ

θ

Ds Dt, l Dt, u

=

l t

N j t j t j l t

x y p N

,

1 ,

) ; | ( log 1 max arg θ

θ

X≈Dt,l

small sample size, estimation not accurate

Three Sets of Instances: Using Dt,l

22

✄ ☎

=

X Y * t y t t

dx x y p x y p x p ) ; | ( log ) | ( ) ( max arg θ θ

θ

Ds Dt, l Dt, u

✆ ✆

= ∈

u t

N k y t k k u t

x y p y C

,

1 ,

) ; | ( log ) ( 1 max arg

Y

θ γ

θ

X≈Dt,u

pseudo labels (e.g. bootstrapping, EM)

) | ( ) (

,u t k t k

x y p y = γ

Three Sets of Instances: Using Dt,u

slide-12
SLIDE 12

12

23

Using All Three Sets of Instances

Ds Dt, l Dt, u

?

) ; | ( log ) | ( ) ( max arg

=

X Y * t y t t

dx x y p x y p x p θ θ

θ

X ≈ Ds+ Dt,l+ Dt,u?

24

A Combined Framework

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

✂ ✂ ✂ ✂

= ∈ = = Y

a flexible setup covering both standard methods and new domain adaptive methods

1

, ,

= + +

u t l t s

λ λ λ

slide-13
SLIDE 13

13

25

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

  • =

∈ = = Y

i =

✂ i = 1, ✄ s = 1, ✄ t,l = ✄ t,u = 0

Standard Supervised Learning using only Ds

26

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

☎ ☎ ☎ ☎

= ∈ = = Y

✄ t,l = 1, ✄ s = ✄ t,u = 0

Standard Supervised Learning using only Dt,l

slide-14
SLIDE 14

14

27

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

  • =

∈ = = Y

Standard Supervised Learning using both Ds and Dt,l

i =

✂ i = 1, ✄ s = Ns/(Ns+Nt,l), ✄ t,l =

Nt,l/(Ns+Nt,l),

✄ t,u = 0

28

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

✁ ✁ ✁ ✁

= ∈ = = Y

i = 0 if (xi, yi) are predicted incorrectly by a

model trained from Dt,l; 1 otherwise

Domain Adaptive Heuristic:

  • 1. Instance Pruning
slide-15
SLIDE 15

15

29

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

  • =

∈ = = Y

Domain Adaptive Heuristic:

  • 2. Dt,l with higher weights
✄ s < Ns/(Ns+Nt,l), ✄ t,l > Nt,l/(Ns+Nt,l)

30

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

✁ ✁ ✁ ✁

= ∈ = = Y

Standard Bootstrapping

k(y) = 1 if p(y | xk) is large; 0 otherwise

slide-16
SLIDE 16

16

31

Domain Adaptive Heuristic:

  • 3. Balanced Bootstrapping

k(y) = 1 if p(y | xk) is large; 0 otherwise

✄ s = ✄ t,u = 0.5

)] ( log ) ; | ( log ) ( 1 ) ; | ( log 1 ) ; | ( log 1 [ max arg ˆ

, ,

1 , , 1 , , 1

θ θ γ λ θ λ θ β α λ θ

θ

p x y p y C x y p C x y p C

u t l t s

N k y t k k u t u t N j t i t i l t l t N i s i s i i i s s

+ + + =

  • =

∈ = = Y 32

Experiments

  • Three NLP tasks:

– POS tagging: WSJ (Penn TreeBank)

Oncology (biomedical) text (Penn BioIE) – NE type classification: newswire

conversational telephone speech (CTS) and web-log (WL) (ACE 2005) – Spam filtering: public email collection

personal inboxes (u01, u02, u03) (ECML/PKDD

2006)

slide-17
SLIDE 17

17

33

Experiments

  • Three heuristics:
  • 1. Instance pruning
  • 2. Dt,l with higher weights
  • 3. Balanced bootstrapping
  • Performance measure: accuracy

34

Instance Pruning

Removing “Misleading” Instances from Ds

0.6600 all 0.8830 all 0.6795 2400 0.8825 3200 0.6975 1200 0.8640 1600 0.7045 0.7815 WL k CTS k 0.8720 all 0.8714 16000 0.8709 8000 0.8630 Oncology k 0.8067 0.8517 0.8106 all 0.8328 0.8322 0.7911 600 0.8222 0.7228 0.6611 300 0.7644 0.6950 0.6306 User 3 User 2 User 1 k

POS NE Type Spam

useful in most cases; failed in some case When is it guaranteed to work? (future work)

slide-18
SLIDE 18

18

35

Dt,l with Higher Weights

until Ds and Dt,l Are Balanced

0.7840 0.9355 Ds + 10Dt,l 0.7820 0.9360 Ds + 5Dt,l 0.7735 0.9340 Ds + Dt,l 0.7045 0.7815 Ds WL CTS method 0.9443 Ds + 20Dt,l 0.9429 Ds + 10Dt,l 0.9349 Ds + Dt,l 0.8630 Ds Oncology method 0.9633 0.9628 0.9639 Ds + 10Dt,l 0.9601 0.9611 0.9628 Ds + 5Dt,l 0.9461 0.9572 0.9572 Ds + Dt,l 0.7644 0.6950 0.6306 Ds User 3 User 2 User 1 method

POS NE Type Spam

Dt,l is very useful promoting Dt,l is more useful

36

Instance Pruning + Dt,l with Higher Weights

0.6670 0.8950 Ds’+ 10Dt,l 0.7840 0.9355 Ds + 10Dt,l WL CTS Method 0.9422 Ds’ + 20Dt,l 0.9443 Ds + 20Dt,l Oncology method 0.9494 0.9478 0.9717 Ds’+ 10Dt,l 0.9633 0.9628 0.9639 Ds + 10Dt,l User 3 User 2 User 1 method

POS NE Type Spam

The two heuristics do not work well together How to combine heuristics? (future work)

slide-19
SLIDE 19

19

37

Balanced Bootstrapping

0.7523 0.8923 balanced bootstrap 0.7498 0.8917 standard bootstrap 0.7351 0.7781 supervised WL CTS method 0.8750 balanced bootstrap 0.8728 standard bootstrap 0.8630 supervised Oncology method 0.9772 0.9256 0.8816 balanced bootstrap 0.9760 0.9212 0.8720 standard bootstrap 0.8068 0.6976 0.6476 supervised User 3 User 2 User 1 method

POS NE Type Spam

Promoting target instances is useful, even with pseudo labels

38

Conclusions

  • Formally analyzed the domain adaptation from

an instance weighting perspective

  • Proposed an instance weighting framework for

domain adaptation

– Both labeled and unlabeled instances – Various weight parameters

  • Proposed a number of heuristics to set the

weight parameters

  • Experiments showed the effectiveness of the

heuristics

slide-20
SLIDE 20

20

39

Future Work

  • Combining different heuristics
  • Principled ways to set the weight

parameters

– Density estimation for setting

  • Thank You!