A fundamental task in IE An important and challenging task in - - PDF document

a fundamental task in ie an important and challenging
SMART_READER_LITE
LIVE PREVIEW

A fundamental task in IE An important and challenging task in - - PDF document


slide-1
SLIDE 1

1

  • ✁✄✂✆☎✞✝✠✟☛✡☛✟☛☞✍✌
✎ ✝✑✏ ✒✓✟✔☞ ✕✖✡✘✗✚✙✜✛✢✡✣✙✑✗✥✤ ✦✧✝✍✗ ★ ✒✩✏ ✤✫✪
  • ☞✬✡☛✟☛✡☛✭
✮✯✤✫✛✫✝✰✌✖☞✖✟☛✡☛✟✱✝✲☞

Jing Jiang & ChengXiang Zhai

Department of Computer Science University of Illinois at Urbana-Champaign

✳ ✴ ✵✥✶ ✷✩✸ ✹✻✺✽✼✔✾✿✼❁❀❃❂❄✷❆❅❈❇❊❉✫✺❋✾●✼✔✾❍❇✽✺ ■

A fundamental task in IE

An important and challenging task in biomedical text mining

Critical for relation mining

Great variation and different gene naming conventions

slide-2
SLIDE 2

2

✂✄✂✆☎ ✝✟✞✡✠☛☎☞✞✍✌ ✎✑✏✓✒ ✎✄☎✔✎✑✕✗✖✘✎✙✖✚✏✛✞✔✒ ✜

Performance degrades when test domain differs from training domain

Domain overfitting

0.281 fly

mouse 0.541 mouse

mouse gene, protein biomedical 0.641 Reuters

NYT 0.855 NYT

NYT LOC, ORG, PER news F1 train

test NE types task

✣ ✤✦✥✧✏✩★✪✖✫✏✓✒☞✬✮✭ ✞✡✠✰✯ ✜

Supervised learning

HMM, MEMM, CRF, SVM, etc. (e.g., [Zhou & Su 02],

[Bender et al. 03], [McCallum & Li 03])

Semi-supervised learning

Co-training ([Collins & Singer 1999])

Domain adaptation

External dictionary ([Ciaramita & Altun 2005])

Not seriously studied

slide-3
SLIDE 3

3

✂ ✖☎✄ ✏✓✒ ✂ ✜

Observations

Method

Generalizability-based feature ranking

Rank-based prior

Experiments

Conclusions and future work

✆ ✁ ✝ ★✰✂ ✠✟✞☞✎ ✖✫✏ ✞ ✒ ✠ ✜

Overemphasis on domain-specific features in the trained model

wingless daughterless eyeless apexless … fly “suffix –less” weighted high in the model trained from fly data

Useful for other organisms? in general NO!

May cause generalizable features to be downweighted

slide-4
SLIDE 4

4

✝ ★✰✂ ✠✟✞☞✎ ✖✫✏ ✞ ✒ ✠ ✠ ✜

Generalizable features: generalize well in all domains

…decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly)

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

✁ ✁ ✝ ★✰✂ ✠✟✞☞✎ ✖✫✏ ✞ ✒ ✠ ✠ ✜

Generalizable features: generalize well in all domains

…decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly)

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) “wi+2 = expressed” is generalizable

slide-5
SLIDE 5

5

✂✙✒ ✂✙✠ ✎ ✄ ✏✄✂✧✎ ✝ ✏ ✄ ✏ ✖✆☎✞✝ ✝ ✎ ★✰✂✄☎ ✝✟✂✆✎ ✖ ✂ ✠ ✂ ✠ ✎ ✒ ✯ ✏✓✒☞✬

fly yeast D3 Dm … training data

… …

  • less

… … expressed … … … … … expressed … … …

  • less

… … … expressed … …

  • less

… … … … … expressed … …

  • less

… 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

) ( 1 min ) (

j i i j

f r f s =

s(“expressed”) = 1/6 = 0.167 s(“-less”) = 1/8 = 0.125

… expressed … … …

  • less

… … … 0.125 … … … 0.167 … …

✟✡✠ ☛ ✂✄✎ ✖ ✂ ✠✰✂ ✠ ✎ ✒ ✯ ✏ ✒☞✬✌☞ ✄✛✂✄✎ ✠✰✒ ✏✓✒☞✬

labeled training data supervised learning algorithm trained classifier ... expressed … … …

  • less

… … top k features F

slide-6
SLIDE 6

6

✟ ✟ ☛ ✂✄✎ ✖ ✂ ✠✰✂ ✠ ✎ ✒ ✯ ✏ ✒☞✬✌☞ ✄✛✂✄✎ ✠✰✒ ✏✓✒☞✬

labeled training data supervised learning algorithm trained classifier ... expressed … … … top k features F

✟✂✁ ☛ ✂✄✎ ✖ ✂ ✠✰✂ ✠ ✎ ✒ ✯ ✏ ✒☞✬✌☞ ✄✛✂✄✎ ✠✰✒ ✏✓✒☞✬

labeled training data supervised learning algorithm trained classifier ... expressed … … …

  • less

… … prior F

logistic regression model (MaxEnt) rank-based prior variances in a Gaussian prior

slide-7
SLIDE 7

7

✏ ✞✡✠ ✞ ✎ ✠ ✏ ✎ ✒✂✁ ✂✧★ ✜

Logistic regression model

MAP parameter estimation

⋅ ⋅ =

l l k k

x x x y p ) exp( ) exp( ) , | ( β β β

☎ ☎ ☎ ☎ ☎ ☎

=

=

n i i i x

y p p

1

) , | ( ) ( max arg ˆ β β β

β

✆ ✆ ✆ ✆ ✝

− =

j j j j

p ) 2 exp( 2 1 ) (

2 2 2

σ β πσ β

✞ ✞ ✟

j 2 is a

function of rj prior for the parameters

✠☛✡ ☞✍✌✏✎✒✑✔✓✖✕✗✌✙✘✛✚✢✜ ✣✒✤✦✥★✧✩✤

variance

2

rank r

b

r a

/ 1 2 =

σ

r = 1, 2, 3, … a important features

large

2

non-important features

small

2

slide-8
SLIDE 8

8

✂✁ ☞✍✌✏✎✒✑✔✓✖✕✗✌✙✘✛✚✢✜ ✣✒✤✦✥★✧✩✤

variance

2

rank r

b

r a

/ 1 2 =

σ

r = 1, 2, 3, … a b = 6 b = 4 b = 2 a and b are set empirically

✂✄ ☎✝✆✟✞ ✞ ✌ ✤✡✠

D1 Dm …

training data

O1 Om …

individual domain feature ranking generalizability-based feature ranking

O’

learning

E

testing entity tagger test data

  • ptimal b1 for D1
  • ptimal b2 for D2

b =

☛ 1b1 + … + ☛ mbm ☞

1, … ,

m

rank-based prior

  • ptimal bm for Dm

rank-based prior

slide-9
SLIDE 9

9

✁ ✂☎✄ ✣✗✚✏✤✦✥ ✞ ✚ ✎✝✆✛✘ ✞

Data set

BioCreative Challenge Task 1B

Gene/protein recognition

3 organisms/domains: fly, mouse and yeast

Experimental setup

2 organisms for training, 1 for testing

Baseline: uniform-variance Gaussian prior

Compared with 3 regular feature ranking methods: frequency, information gain, chi-square

✡☞☛ ✌✎✍✑✏ ✒✔✓✖✕✘✗✚✙✛✍✢✜ ✣✤✗✦✥✚✧ ★✩✓✪✙✬✫✮✭✯✗✰✜✱✫

+35.5% +43.3% +1.4% % Imprv. 0.225 0.139 0.591 Domain 0.166 0.097 0.583 Baseline M+Y

F +9.2% +13.7% +1.9% % Imprv. 0.461 0.381 0.582 Domain 0.422 0.335 0.571 Baseline F+Y

M +7.1% +10.7% +3.2% % Imprv. 0.544 0.516 0.575 Domain 0.508 0.466 0.557 Baseline F+M

Y F1 Recall Precision Method Exp

slide-10
SLIDE 10

10

✡✁ ✂☎✄✝✆✟✞✡✠☞☛✍✌✏✎✑✄✓✒✕✔✖✌✘✗✚✙✛☛✢✜✤✣✦✥✓✧★✠✤☛✪✩✫✜✬✠✭✗✮✥✓☛✑✜✯☛✰✠✭✒✲✱✳✌✴✒✵✣ ✆ ✜✭✗✶✙✷✄✲✸✡✎

generalizability-based feature ranking information gain and chi-square feature frequency

✹✻✺ ✌✎✍✢✜✽✼✮✭★✾✔✙ ✗ ✍ ✜ ✙ ✓ ✜❀✿ ❁✍✾✢✥✚✾ ✕ ✫ ✣ ✍ ✕❃❂ ✠

We proposed

Generalizability-based feature ranking method

Rank-based prior variances

Experiments show

Domain-aware method outperformed baseline method

Generalizability-based feature ranking better than regular feature ranking

To exploit the unlabeled test data

slide-11
SLIDE 11

11

✹ ✡

Thank you!

  • ✧✱✫
✫ ✜❀✿