Co‐Reference in GATE Andrew Borthwick, Ph.D. Principal Scien7st, Intelius, Inc. July 28, 2009 Intelius, Inc.
What is Co‐reference? • Determine if word A refers to the same real‐ world en7ty as word B • In GATE, divided into two pieces – Orthographic: “OrthoMatcher” • Person, Organiza7on, Loca7on, and Date – Pronominal: Pronominal Coreferencer • He/him/his, she/her, it • Not “pleonas7c it” (“it is raining”) Intelius, Inc.
Why Co‐Reference? • Necessary for informa7on extrac7on – “He received a B.A. in computer science” – “Mr. Jones received a B.A. in computer science” • Word sense disambigua7on and named en7ty resolu7on – ‘Jack London wrote “Call of the Wild”. London is a famous American author .’ – Resolves “Unknown” annota7ons in ANNIE • Web person search • Field of academic study Intelius, Inc.
Co‐reference Engine Background • I have been maintaining the OrthoMatcher for the past year • OrthoMatcher much improved over previous – Doesn’t assume that all iden7cal strings refer to the same person. I.e. “David” can refer to two different en77es in two different places • One small change to pronominal coref – Speed improved by factor of 50 • Have concentrated on person matching • Have had to priori7ze enhancements Intelius, Inc.
Execu=ng GATE Co‐reference • Annie OrthoMatcher – Processing resource in ANNIE Creole plug‐in – Last element, by default in GATE pipeline – Requires output of NE Transducer and Annie tokenizer • ANNIE Pronominal Coreferencer – In ANNIE Creole plugin – Not loaded by default in GATE – Must follow OrthoMatcher Intelius, Inc.
Results of co‐reference • Co‐refs visible in GATE GUI • “Unknown” updated to appropriate named en7ty type, if possible • Co‐references marked – In the “matches” feature on each named en7ty – In the MatchesAnnots document feature • Visible in the GATE GUI • Accessible to downstream PR’s via “matches” feature and Annota7onSet.get() Intelius, Inc.
Annie OrthoMatcher • Tries to find the most recent compa7ble match for each Annota7on Type – For Annota7on X of Type T, checks every annota7on Y of Type T that occurs prior to X • Checks four “noMatch” rules. If a noMatch rule fires, X and Y are not a match • Checks c. 20 match rules of the form matchRule(Annota7on1, Annota7on2). If any matchRule returns “true”, then we match Intelius, Inc.
No‐match rules • Names are on the spurious match table – spur_match.lst • Incompa7ble middle names – John C. Smith != John Q. Smith • Incompa7ble genders – John Anderson != Mrs. Anderson • No forward reference of short name unless short name is unmatched Intelius, Inc.
Forward Reference Rule • Mark the name annota7on whenever a shortened person name is matched with the long form – Don’t allow subsequent long forms to match these short form – Can match if short form is not yet matched • Not allowed: – “John Smith … John … John Robertson” • “John Robertson” can’t coref with “John” • Allowed: – “John … John Robertson” Intelius, Inc.
Match Rules • Exact match on name, match on nickname – Runs off a nickname list – Note that Christopher = Chris, Chris7ne = Chris, but Christopher != Chris7ne • All tokens of name A are found in name B – John = John Smith, Smith = John Smith • All tokens of B in name A other than corporate designators and punctua7on – ACME, Inc. = ACME Intelius, Inc.
Orthomatcher Configura=on • Word lists are defined in “defini7onFileURL” – Format is list_file_url:list_label • These are the main things to configure Intelius, Inc.
Key Word Lists • alias: Aliases name A for name B – These are automa7cally matched • spur_match: Name A != name B – Automa7cally non‐matched • Miscellaneous word lists – cdg: Corporate designators – connector: Connec7ng words – prepos: Preposi7ons • Dept. of Defense = Defense Dept. Intelius, Inc.
Word Lists: Nicknames • Defined in nickname word list parameter • Formaked as – Name 1 – Name 2 – Subs7tu7on likelihood (non‐scien7fic intui7on) – Male/female variant (not used in Orthomatcher) • Name 1 and Name 2 are interchangeable for Orthomatcher • minimumNicknameLikelihood parameter defines minimum subs7tu7on likelihood Intelius, Inc.
Other Key Parameters • highPrecisionOrgs – Use very safe features for matching Orgs • ACME, Inc. = ACME is okay • Kalamazoo Financial Corpora7on = Kalamazoo is not • extLists – Defaults to true, false not tested by me – If false, tries to derive corporate designator from document Intelius, Inc.
Customizing OrthoMatcher • Modify parameters/lists (easy) • Add or subtract rules (moderately hard) – Code a new rule • Use matchRule12Name() as a template – Add to apply_rules_namematch() in OrthoMatcher.java – Recompile GATE • From GATE home directory, do bin/ant clean jar • Change core logic (hard) Intelius, Inc.
Pronominal Coreferencer • In ANNIE, but not default ANNIE – Run aner OrthoMatcher • Three submodules – Quoted speech iden7fica7on – “Pleonas7c ‘it’” iden7fica7on • “It is raining” – Pronominal coreference • Only two parameters – Inanimated en7ty types (what you match to ‘it’) – resolveIt: Try to match ‘it’? • False by default Intelius, Inc.
Core Logic • Two JAPE phases – Iden7fy quoted speech – Iden7fy pleonas7c it • Match pronouns to antecedents – Match “I”, “me”, “my” inside quoted speech to names outside quoted speech – Other than this, preky much match pronouns with last referent Intelius, Inc.
Pronominal Coref Assessment • Not very good with resolving “it” • Works reasonably well on personal pronouns – Difficult matching cases are rela7vely rare – Usual cause of error is that we fail to tag an antecedent as a person • When this happens, can match an antecedent fairly far away Intelius, Inc.
OrthoMatcher Assessment • Errors are also rela7vely rare • On Intelius data, errors onen involve family members in apposi7ve phrases – “James Madison, Jr., the son of the wealthy Virginia planter, James Madison, Sr., was the fourth president of the United States. Madison . . .” Intelius, Inc.
Engineering Shortcomings • Difficult to add a new rule. Have to modify source code and add it into the list of firing rules • No way to rank candidate antecedents – Can only say yes/no to a pair. Can’t say “maybe, unless there’s something beker” – Would like to rank sentence head words higher • Three coref systems: OrthoMatcher, pronominal, and Yaoyong Li’s ML toolkit Intelius, Inc.
Upcoming GATE 5.1 Enhancement • Leverage GATE 5.1 ability to run a PR which won’t cross sec7on boundaries – Can make a PR only “see” one sec7on at a 7me – Limits errors if you can par77on a document into logical sec7ons Intelius, Inc.
Medium‐term Engineering Goals • Introduce idea of named en7ty priori7za7on into OrthoMatcher • Can add and delete rules via config files rather than by edi7ng coref source code • Provide standard API to all rules Intelius, Inc.
Recommend
More recommend