A fundamental task in IE An important and challenging task in - PDF document

❏ ■ ✴ ✳ � ❏ � ■ ★ ✎ ✁✄✂✆☎✞✝✠✟☛✡☛✟☛☞✍✌ ✝✑✏ ✒✓✟✔☞ ✕✖✡✘✗✚✙✜✛✢✡✣✙✑✗✥✤ ✦✧✝✍✗ ✒✩✏ ✤✫✪ ☞✬✡☛✟☛✡☛✭ ✮✯✤✫✛✫✝✰✌✖☞✖✟☛✡☛✟✱✝✲☞ Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign ✵✥✶ ✷✩✸ ✹✻✺✽✼✔✾✿✼❁❀❃❂❄✷❆❅❈❇❊❉✫✺❋✾●✼✔✾❍❇✽✺ A fundamental task in IE An important and challenging task in biomedical text mining Critical for relation mining Great variation and different gene naming conventions 1

✢ ✜ � ❏ ✢ ✢ ✣ ✜ ✢ ❏ ✜ ✢ ❏ ❏ ✁ ✜ ✜ ✂✄✂✆☎ ✝✟✞✡✠☛☎☞✞✍✌ ✎✑✏✓✒ ✎✄☎✔✎✑✕✗✖✘✎✙✖✚✏✛✞✔✒ Performance degrades when test domain differs from training domain Domain overfitting task NE types train test F1 news LOC, NYT NYT 0.855 ORG, PER Reuters NYT 0.641 biomedical gene, mouse mouse 0.541 protein fly mouse 0.281 ✤✦✥✧✏✩★✪✖✫✏✓✒☞✬✮✭ ✞✡✠✰✯ Supervised learning HMM, MEMM, CRF, SVM, etc. (e.g., [Zhou & Su 02], [Bender et al. 03], [McCallum & Li 03]) Semi-supervised learning Co-training ([Collins & Singer 1999]) Domain adaptation External dictionary ([Ciaramita & Altun 2005]) Not seriously studied 2

✝ ✠ ✞ ✜ � ✡ ✡ ✁ ✆ ✜ ✜ ❏ ❏ ✜ ✜ ✂ ✂ ✁ ✒ ✖ ☎✄ ✏✓✒ Observations Method Generalizability-based feature ranking Rank-based prior Experiments Conclusions and future work ✠ ✟✞ ☞✎ ★✰✂ ✖✫✏ Overemphasis on domain-specific features in the trained model “suffix –less” weighted high in wingless the model trained from fly daughterless data eyeless Useful for other organisms? apexless in general NO! … May cause generalizable features to be downweighted fly 3

✁ ❏ ✞ ✠ ✜ ❏ ✝ ✁ � ❏ ❏ ✠ ✒ ✠ ✜ ✁ ✝ ✠ ✒ ✞ ✠ ✟✞ ☞✎ ★✰✂ ✖✫✏ Generalizable features: generalize well in all domains … decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) ✠ ✟✞ ☞✎ ★✰✂ ✖✫✏ Generalizable features: generalize well in all domains … decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) “ w i +2 = expressed” is generalizable 4

✂ ✖ ✠ ✏ � ✯ ✖ ☛ ✎ ✒ ✯ ✒ ✎ ✂ ✠ ✂ ✠ ✝ ✄ ✎ ✝ ✁ ✏ ✄ ✏ ✎ ✏ ✄✂ ✧✎ ✖ ✆☎✞✝ ✂✙✒ ✂✙✠ ★✰✂✄☎ ✝✟✂✆✎ ✏✓✒☞✬ training data fly yeast D 3 … D m 1 … 1 … 1 … 1 … 2 … 2 … 2 … 2 … s(“expressed”) 3 -less 3 … 3 … 3 … = 1/6 = 0.167 4 4 4 4 … expressed expressed … 5 … 5 … 5 … … 5 expressed 6 expressed 6 … 6 … 6 … 7 … 7 … 7 -less 7 … 8 … 8 -less 8 … 8 -less s(“-less”) = 1/8 = 0.125 … … expressed 0.125 … … 1 = s f ( ) min … … j r f i ( ) … … i j -less 0.167 … … … … ✒☞✬ ✌☞ ✂✄✎ ✠✰✂ ✄ ✛✂✄✎ ✠✰✒ ✏✓✒☞✬ F ... expressed top k features … … … -less … … supervised labeled trained learning training data classifier algorithm ✟✡✠ 5

✂ ✯ ✖ ✟ ✠ ✎ ✒ ✯ ✏ ✏ ☛ ✒ ✎ ✠ ✂ ✖ ☛ ✟ ✒☞✬ ✌☞ ✂✄✎ ✠✰✂ ✄ ✛✂✄✎ ✠✰✒ ✏✓✒☞✬ F ... expressed top k features … … … supervised labeled trained learning training data classifier algorithm ✒☞✬ ✌☞ ✂✄✎ ✠✰✂ ✄ ✛✂✄✎ ✠✰✒ ✏✓✒☞✬ rank-based prior F ... variances in a expressed … Gaussian prior … … -less logistic … regression … prior model (MaxEnt) supervised labeled trained learning training data classifier algorithm ✟ ✂✁ 6

✆ ✪ ✝ ✪ ✞ ✞ ✟ ✪ ✆ ✟ ✫ ☎ ☎ ☎ ☎ ☎ ☎ ✆ ✫ ✄ � ✞ ✠ ✏ ✎ ✏ ✎ ✠ ✜ ✜ ✆ � ✒ ✂✁ ✞✡✠ ✂✧★ Logistic regression model x ⋅ β exp( ) β = p y x k ( | , ) k ⋅ β x exp( ) l l MAP parameter estimation n ∏ ˆ β = β β p p y i x arg max ( ) ( | , ) i β i = 1 2 is a prior for the β 2 j parameters ∏ 1 function of r j β = − j p ( ) exp( ) σ πσ 2 2 2 2 j j j ☞✍✌✏✎✒✑✔✓✖✕✗✌✙✘✛✚✢✜ ✣✒✤✦✥★✧✩✤ variance 2 a 2 = σ important features a r b 1 / large 2 non-important features small 2 r = 1, 2, 3, … rank r ✠☛✡ 7

☞ ✪ ✌ ✞ ☞ ☞✍✌✏✎✒✑✔✓✖✕✗✌✙✘✛✚✢✜ ✣✒✤✦✥★✧✩✤ variance 2 a 2 = σ a b r 1 / a and b are set empirically b = 6 b = 4 b = 2 r = 1, 2, 3, … rank r �✂✁ ☎✝✆✟✞ ✤ ✡✠ training data E test data D 1 D m … 1 , … , m individual domain feature ranking testing learning entity tagger … O 1 O m b = ☛ 1 b 1 + … + ☛ m b m rank-based prior generalizability-based feature ranking optimal b 1 for D 1 optimal b 2 for D 2 O’ rank-based prior optimal b m for D m �✂✄ 8

✲ ✟ ✲ ✟ ✟ ✟ ✟ ✟ ✠ ✲ ✚ ✞ ✞ ✣✗✚✏✤✦✥ ✎ ✝✆ ✛✘ ✂☎✄ Data set BioCreative Challenge Task 1B Gene/protein recognition 3 organisms/domains: fly, mouse and yeast Experimental setup 2 organisms for training, 1 for testing Baseline: uniform-variance Gaussian prior Compared with 3 regular feature ranking methods: frequency, information gain, chi-square � ✁� ✌✎✍✑✏ ✒✔✓✖✕✘✗✚✙✛✍✢✜ ✣✤✗✦✥✚✧ ★✩✓✪✙✬✫✮✭✯✗✰✜✱✫ Exp Method Precision Recall F1 F+M Y Baseline 0.557 0.466 0.508 Domain 0.575 0.516 0.544 % Imprv. +3.2% +10.7% +7.1% F+Y M Baseline 0.571 0.335 0.422 Domain 0.582 0.381 0.461 % Imprv. +1.9% +13.7% +9.2% M+Y F Baseline 0.583 0.097 0.166 Domain 0.591 0.139 0.225 % Imprv. +1.4% +43.3% +35.5% ✡☞☛ 9

✫ ✟ ✠ ✠ ✟ ✍ ✣ ✕ ✟ ✓ ✙ ✜ ✍ ✗ ✠ ✆ ✟ ✂☎✄✝✆✟✞✡✠☞☛✍✌✏✎✑✄✓✒✕✔✖✌✘✗✚✙✛☛✢✜✤✣✦✥✓✧★✠✤☛✪✩✫✜✬✠✭✗✮✥✓☛✑✜✯☛✰✠✭✒✲✱✳✌✴✒✵✣ ✜✭✗✶✙✷✄✲✸✡✎ generalizability-based feature ranking feature frequency information gain and chi-square ✡ ✁� ✌✎✍✢✜ ✽✼ ✮✭ ★✾ ✔✙ ✜ ❀✿ ❁✍✾ ✢✥ ✚✾ ✕ ❃❂ We proposed Generalizability-based feature ranking method Rank-based prior variances Experiments show Domain-aware method outperformed baseline method Generalizability-based feature ranking better than regular feature ranking To exploit the unlabeled test data ✹✻✺ 10

✹ ✡ � ✫ ✧✱✫ ✜ ❀✿ Thank you! 11

A fundamental task in IE An important and challenging task in - PDF document

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

No disclosures Warren Gasper MD UCSF Vascular Surgery 4/14/2016 2 Challenging Venous

CGO Task Presentation CGO Task Presentation CGO Task Presentation Effective Task Presentation

Telematics Task Force Telematics Task Force Charlie Gorman Charlie Gorman Talking Points

AU Task Force: 2018 Consultation Bob Dony Chair, AU Task Force April 5, 2018 Outline About

Task-Centered Design Task-Centered Process Creating a Task Scenario Scenario-based Walk-throughs

WASC 2019 Findings Presentation to stakeholders February 2019 2 3 Process Task 1 Task 2

iDASH - Secure Genome Analysis Task 1A Competition Using ObliVM Task 1B Set union Task 2A Xiao

IEA Bioenergy IEA BIOENERGY Task 42 Biorefinery 5 th Task Meeting Dublin, Ireland, 25/26 March

Cold Atom Atom Clocks Clocks Cold Cold Atom Clocks and Fundamental Fundamental Tests Tests

S Destination: Access to a Broad and Challenging Curriculum All students have equitable access

A Wellbeing Approach to Challenging Behaviours Andy McDonnell andym@studio3.org @lowarousal

Running head: CHALLENGING CARICATURES OF CHARACTER 1

Leadership Lessons for Leadership Lessons for Challenging Times Challenging Times Colonel

Mol2Net Study of the functional properties of the corn flour proteins ( Zea mays ), barley (

Stochastic multiscale modeling of subsurface and surface flows. Part III: Multiscale mortar finite

Antitrust as a cure for pharmaceutical competition: remarks from the Italian experience Giovanni

Technology Solutions Whats out there .... ? In-house .... Excel and paper based forms Call

ORTHOGONAL NMF-BASED TOP-K PATIENT MUTATION PROFILE SEARCHING Ref. Publication: Kim, S., Sael,

The Cross Language Image Retrieval Track ImageCLEF 2009 Henning Mller 1 , Barbara Caputo 2 ,

Insights and algorithms for the multivariate square-root lasso Aaron J. Molstad Department of

WHO 2016 UPDATE OF CNS TUMORS Arie Perry, M.D. Director, Neuropathology Courtesy of Dr. David