co reference in gate

CoReferenceinGATE AndrewBorthwick,Ph.D. - PowerPoint PPT Presentation

CoReferenceinGATE AndrewBorthwick,Ph.D. PrincipalScien7st,Intelius,Inc. July28,2009 Intelius,Inc. WhatisCoreference? DetermineifwordAreferstothesamereal


  1. Co‐Reference
in
GATE
 Andrew
Borthwick,
Ph.D.
 Principal
Scien7st,
Intelius,
Inc.
 July
28,
2009
 Intelius,
Inc.


  2. What
is
Co‐reference?
 • Determine
if
word
A
refers
to
the
same
real‐ world
en7ty
as
word
B
 • In
GATE,
divided
into
two
pieces
 – Orthographic:

“OrthoMatcher”
 • Person,
Organiza7on,
Loca7on,
and
Date
 – Pronominal:

Pronominal
Coreferencer


 • He/him/his,
she/her,
it
 • Not
“pleonas7c
it”
(“it
is
raining”)
 Intelius,
Inc.


  3. Why
Co‐Reference?
 • Necessary
for
informa7on
extrac7on
 – “He
received
a
B.A.
in
computer
science”
 – “Mr.
Jones
received
a
B.A.
in
computer
science”
 • Word
sense
disambigua7on
and
named
en7ty
resolu7on
 – ‘Jack
London
wrote
“Call
of
the
Wild”.

London
is
a
famous
 American
author
.’
 – Resolves
“Unknown”
annota7ons
in
ANNIE
 • Web
person
search
 • Field
of
academic
study
 Intelius,
Inc.


  4. Co‐reference
Engine
Background
 • I
have
been
maintaining
the
OrthoMatcher
for
the
 past
year
 • OrthoMatcher
much
improved
over
previous
 – Doesn’t
assume
that
all
iden7cal
strings
refer
to
the
same
 person.

I.e.
“David”
can
refer
to

two
different
en77es
in
 two
different
places
 • One
small
change
to
pronominal
coref
 – Speed
improved
by
factor
of
50
 • Have
concentrated
on
person
matching
 • Have
had
to
priori7ze
enhancements
 Intelius,
Inc.


  5. Execu=ng
GATE
Co‐reference
 • Annie
OrthoMatcher
 – Processing
resource
in
ANNIE
Creole
plug‐in
 – Last
element,
by
default
in
GATE
pipeline
 – Requires
output
of
NE
Transducer
and
Annie
 tokenizer
 • ANNIE
Pronominal
Coreferencer
 – In
ANNIE
Creole
plugin
 – Not
loaded
by
default
in
GATE
 – Must
follow
OrthoMatcher
 Intelius,
Inc.


  6. Results
of
co‐reference
 • Co‐refs
visible
in
GATE
GUI
 • “Unknown”
updated
to
appropriate
named
 en7ty
type,
if
possible
 • Co‐references
marked
 – In
the
“matches”
feature
on
each
named
en7ty
 – In
the
MatchesAnnots
document
feature
 • Visible
in
the
GATE
GUI
 • 
Accessible
to
downstream
PR’s
via
“matches”
 feature
and
Annota7onSet.get()
 Intelius,
Inc.


  7. Annie
OrthoMatcher
 • Tries
to
find
the
most
recent
compa7ble
 match
for
each
Annota7on
Type
 – For
Annota7on
X
of
Type
T,
checks
every
 annota7on
Y
of
Type
T
that
occurs
prior
to
X
 • Checks
four
“noMatch”
rules.

If
a
noMatch
rule
fires,
X
 and
Y
are
not
a
match
 • Checks
c.
20
match
rules
of
the
form
 matchRule(Annota7on1,
Annota7on2).

If
any
 matchRule
returns
“true”,
then
we
match
 Intelius,
Inc.


  8. No‐match
rules
 • Names
are
on
the
spurious
match
table
 – spur_match.lst

 • Incompa7ble
middle
names
 – John
C.
Smith
!=
John
Q.
Smith
 • Incompa7ble
genders
 – John
Anderson
!=
Mrs.
Anderson
 • No
forward
reference
of
short
name
unless
 short
name
is
unmatched
 Intelius,
Inc.


  9. Forward
Reference
Rule
 • Mark
the
name
annota7on
whenever
a
shortened
 person
name
is
matched
with
the
long
form
 – Don’t
allow
subsequent
long
forms
to
match
these
short
 form
 – Can
match
if
short
form
is
not
yet
matched
 • Not
allowed:
 – “John
Smith
…
John
…
John
Robertson”
 • “John
Robertson”
can’t
coref
with
“John”
 • Allowed:
 – “John
…
John
Robertson”

 Intelius,
Inc.


  10. Match
Rules
 • Exact
match
on
name,
match
on
nickname
 – Runs
off
a
nickname
list
 – Note
that
Christopher
=
Chris,
Chris7ne
=
Chris,
 but
Christopher
!=
Chris7ne
 • All
tokens
of
name
A
are
found
in
name
B
 – John
=
John
Smith,
Smith
=
John
Smith
 • All
tokens
of
B
in
name
A
other
than
corporate
 designators
and
punctua7on
 – ACME,
Inc.
=
ACME
 Intelius,
Inc.


  11. Orthomatcher
Configura=on
 • Word
lists
are
defined
in
“defini7onFileURL”
 – Format
is
list_file_url:list_label
 • These
are
the
main
things
to
configure


 Intelius,
Inc.


  12. Key
Word
Lists
 • alias:

Aliases
name
A
for
name
B
 – These
are
automa7cally
matched
 • spur_match:

Name
A
!=
name
B
 – Automa7cally
non‐matched
 • Miscellaneous
word
lists
 – cdg:

Corporate
designators
 – connector:

Connec7ng
words
 – prepos:

Preposi7ons
 • Dept.
of
Defense
=
Defense
Dept.
 Intelius,
Inc.


  13. Word
Lists:

Nicknames
 • Defined
in
nickname
word
list
parameter
 • Formaked
as

 – Name
1
 – Name
2
 – Subs7tu7on
likelihood
(non‐scien7fic
intui7on)
 – Male/female
variant
(not
used
in
Orthomatcher)
 • Name
1
and
Name
2
are
interchangeable
for
Orthomatcher
 • minimumNicknameLikelihood
parameter
defines
minimum
 subs7tu7on
likelihood


 Intelius,
Inc.


  14. Other
Key
Parameters
 • highPrecisionOrgs
 – Use
very
safe
features
for
matching
Orgs
 • ACME,
Inc.
=
ACME
is
okay
 • Kalamazoo
Financial
Corpora7on
=
Kalamazoo
is
not
 • extLists
 – Defaults
to
true,
false
not
tested
by
me
 – If
false,
tries
to
derive
corporate
designator
from
 document
 Intelius,
Inc.


  15. Customizing
OrthoMatcher
 • Modify
parameters/lists
(easy)
 • Add
or
subtract
rules
(moderately
hard)
 – Code
a
new
rule
 • Use
matchRule12Name()
as
a
template
 – Add
to
apply_rules_namematch()
in
 OrthoMatcher.java
 – Recompile
GATE

 • From
GATE
home
directory,
do
bin/ant
clean
jar
 • Change
core
logic
(hard)
 Intelius,
Inc.


  16. Pronominal
Coreferencer
 • In
ANNIE,
but
not
default
ANNIE
 – Run
aner
OrthoMatcher
 • Three
submodules
 – Quoted
speech
iden7fica7on
 – “Pleonas7c
‘it’”
iden7fica7on
 • “It
is
raining”
 – Pronominal
coreference
 • Only
two
parameters
 – Inanimated
en7ty
types
(what
you
match
to
‘it’)
 – resolveIt:

Try
to
match
‘it’?
 • False
by
default
 Intelius,
Inc.


  17. Core
Logic
 • Two
JAPE
phases
 – Iden7fy
quoted
speech
 – Iden7fy
pleonas7c
it
 • Match
pronouns
to
antecedents
 – Match
“I”,
“me”,
“my”
inside
quoted
speech
to
 names
outside
quoted
speech
 – Other
than
this,
preky
much
match
pronouns
with 
 last
referent
 Intelius,
Inc.


  18. Pronominal
Coref
Assessment
 • Not
very
good
with
resolving
“it”
 • Works
reasonably
well
on
personal
pronouns
 – Difficult
matching
cases
are
rela7vely
rare
 – Usual
cause
of
error
is
that
we
fail
to
tag
an
 antecedent
as
a
person
 • When
this
happens,
can
match
an
antecedent
fairly
far
 away
 Intelius,
Inc.


  19. OrthoMatcher
Assessment
 • Errors
are
also
rela7vely
rare
 • On
Intelius
data,
errors
onen
involve
family
 members
in
apposi7ve
phrases
 – “James
Madison,
Jr.,
the
son
of
the
wealthy
 Virginia
planter,
James
Madison,
Sr.,
was
the
 fourth
president
of
the
United
States.

 Madison

.
.
.”
 Intelius,
Inc.


  20. Engineering
Shortcomings
 • Difficult
to
add
a
new
rule.

Have
to
modify
 source
code
and
add
it
into
the
list
of
firing
 rules

 • No
way
to
rank
candidate
antecedents
 – Can
only
say
yes/no
to
a
pair.

Can’t
say
“maybe,
 unless
there’s
something
beker”
 – Would
like
to
rank
sentence
head
words
higher

 • Three
coref
systems:

OrthoMatcher,
 pronominal,
and
Yaoyong
Li’s
ML
toolkit
 Intelius,
Inc.


  21. Upcoming
GATE
5.1
Enhancement
 • Leverage
GATE
5.1
ability
to
run
a
PR
which
 won’t
cross
sec7on
boundaries
 – Can
make
a
PR
only
“see”
one
sec7on
at
a
7me
 – Limits
errors
if
you
can
par77on
a
document
into
 logical
sec7ons
 Intelius,
Inc.


  22. Medium‐term
Engineering
Goals
 • Introduce
idea
of
named
en7ty
priori7za7on
 into
OrthoMatcher
 • Can
add
and
delete
rules
via
config
files
rather
 than
by
edi7ng
coref
source
code
 • Provide
standard
API
to
all
rules
 Intelius,
Inc.


Recommend


More recommend