co reference in gate
play

CoReferenceinGATE AndrewBorthwick,Ph.D. - PowerPoint PPT Presentation

CoReferenceinGATE AndrewBorthwick,Ph.D. PrincipalScien7st,Intelius,Inc. July28,2009 Intelius,Inc. WhatisCoreference? DetermineifwordAreferstothesamereal


  1. Co‐Reference
in
GATE
 Andrew
Borthwick,
Ph.D.
 Principal
Scien7st,
Intelius,
Inc.
 July
28,
2009
 Intelius,
Inc.


  2. What
is
Co‐reference?
 • Determine
if
word
A
refers
to
the
same
real‐ world
en7ty
as
word
B
 • In
GATE,
divided
into
two
pieces
 – Orthographic:

“OrthoMatcher”
 • Person,
Organiza7on,
Loca7on,
and
Date
 – Pronominal:

Pronominal
Coreferencer


 • He/him/his,
she/her,
it
 • Not
“pleonas7c
it”
(“it
is
raining”)
 Intelius,
Inc.


  3. Why
Co‐Reference?
 • Necessary
for
informa7on
extrac7on
 – “He
received
a
B.A.
in
computer
science”
 – “Mr.
Jones
received
a
B.A.
in
computer
science”
 • Word
sense
disambigua7on
and
named
en7ty
resolu7on
 – ‘Jack
London
wrote
“Call
of
the
Wild”.

London
is
a
famous
 American
author
.’
 – Resolves
“Unknown”
annota7ons
in
ANNIE
 • Web
person
search
 • Field
of
academic
study
 Intelius,
Inc.


  4. Co‐reference
Engine
Background
 • I
have
been
maintaining
the
OrthoMatcher
for
the
 past
year
 • OrthoMatcher
much
improved
over
previous
 – Doesn’t
assume
that
all
iden7cal
strings
refer
to
the
same
 person.

I.e.
“David”
can
refer
to

two
different
en77es
in
 two
different
places
 • One
small
change
to
pronominal
coref
 – Speed
improved
by
factor
of
50
 • Have
concentrated
on
person
matching
 • Have
had
to
priori7ze
enhancements
 Intelius,
Inc.


  5. Execu=ng
GATE
Co‐reference
 • Annie
OrthoMatcher
 – Processing
resource
in
ANNIE
Creole
plug‐in
 – Last
element,
by
default
in
GATE
pipeline
 – Requires
output
of
NE
Transducer
and
Annie
 tokenizer
 • ANNIE
Pronominal
Coreferencer
 – In
ANNIE
Creole
plugin
 – Not
loaded
by
default
in
GATE
 – Must
follow
OrthoMatcher
 Intelius,
Inc.


  6. Results
of
co‐reference
 • Co‐refs
visible
in
GATE
GUI
 • “Unknown”
updated
to
appropriate
named
 en7ty
type,
if
possible
 • Co‐references
marked
 – In
the
“matches”
feature
on
each
named
en7ty
 – In
the
MatchesAnnots
document
feature
 • Visible
in
the
GATE
GUI
 • 
Accessible
to
downstream
PR’s
via
“matches”
 feature
and
Annota7onSet.get()
 Intelius,
Inc.


  7. Annie
OrthoMatcher
 • Tries
to
find
the
most
recent
compa7ble
 match
for
each
Annota7on
Type
 – For
Annota7on
X
of
Type
T,
checks
every
 annota7on
Y
of
Type
T
that
occurs
prior
to
X
 • Checks
four
“noMatch”
rules.

If
a
noMatch
rule
fires,
X
 and
Y
are
not
a
match
 • Checks
c.
20
match
rules
of
the
form
 matchRule(Annota7on1,
Annota7on2).

If
any
 matchRule
returns
“true”,
then
we
match
 Intelius,
Inc.


  8. No‐match
rules
 • Names
are
on
the
spurious
match
table
 – spur_match.lst

 • Incompa7ble
middle
names
 – John
C.
Smith
!=
John
Q.
Smith
 • Incompa7ble
genders
 – John
Anderson
!=
Mrs.
Anderson
 • No
forward
reference
of
short
name
unless
 short
name
is
unmatched
 Intelius,
Inc.


  9. Forward
Reference
Rule
 • Mark
the
name
annota7on
whenever
a
shortened
 person
name
is
matched
with
the
long
form
 – Don’t
allow
subsequent
long
forms
to
match
these
short
 form
 – Can
match
if
short
form
is
not
yet
matched
 • Not
allowed:
 – “John
Smith
…
John
…
John
Robertson”
 • “John
Robertson”
can’t
coref
with
“John”
 • Allowed:
 – “John
…
John
Robertson”

 Intelius,
Inc.


  10. Match
Rules
 • Exact
match
on
name,
match
on
nickname
 – Runs
off
a
nickname
list
 – Note
that
Christopher
=
Chris,
Chris7ne
=
Chris,
 but
Christopher
!=
Chris7ne
 • All
tokens
of
name
A
are
found
in
name
B
 – John
=
John
Smith,
Smith
=
John
Smith
 • All
tokens
of
B
in
name
A
other
than
corporate
 designators
and
punctua7on
 – ACME,
Inc.
=
ACME
 Intelius,
Inc.


  11. Orthomatcher
Configura=on
 • Word
lists
are
defined
in
“defini7onFileURL”
 – Format
is
list_file_url:list_label
 • These
are
the
main
things
to
configure


 Intelius,
Inc.


  12. Key
Word
Lists
 • alias:

Aliases
name
A
for
name
B
 – These
are
automa7cally
matched
 • spur_match:

Name
A
!=
name
B
 – Automa7cally
non‐matched
 • Miscellaneous
word
lists
 – cdg:

Corporate
designators
 – connector:

Connec7ng
words
 – prepos:

Preposi7ons
 • Dept.
of
Defense
=
Defense
Dept.
 Intelius,
Inc.


  13. Word
Lists:

Nicknames
 • Defined
in
nickname
word
list
parameter
 • Formaked
as

 – Name
1
 – Name
2
 – Subs7tu7on
likelihood
(non‐scien7fic
intui7on)
 – Male/female
variant
(not
used
in
Orthomatcher)
 • Name
1
and
Name
2
are
interchangeable
for
Orthomatcher
 • minimumNicknameLikelihood
parameter
defines
minimum
 subs7tu7on
likelihood


 Intelius,
Inc.


  14. Other
Key
Parameters
 • highPrecisionOrgs
 – Use
very
safe
features
for
matching
Orgs
 • ACME,
Inc.
=
ACME
is
okay
 • Kalamazoo
Financial
Corpora7on
=
Kalamazoo
is
not
 • extLists
 – Defaults
to
true,
false
not
tested
by
me
 – If
false,
tries
to
derive
corporate
designator
from
 document
 Intelius,
Inc.


  15. Customizing
OrthoMatcher
 • Modify
parameters/lists
(easy)
 • Add
or
subtract
rules
(moderately
hard)
 – Code
a
new
rule
 • Use
matchRule12Name()
as
a
template
 – Add
to
apply_rules_namematch()
in
 OrthoMatcher.java
 – Recompile
GATE

 • From
GATE
home
directory,
do
bin/ant
clean
jar
 • Change
core
logic
(hard)
 Intelius,
Inc.


  16. Pronominal
Coreferencer
 • In
ANNIE,
but
not
default
ANNIE
 – Run
aner
OrthoMatcher
 • Three
submodules
 – Quoted
speech
iden7fica7on
 – “Pleonas7c
‘it’”
iden7fica7on
 • “It
is
raining”
 – Pronominal
coreference
 • Only
two
parameters
 – Inanimated
en7ty
types
(what
you
match
to
‘it’)
 – resolveIt:

Try
to
match
‘it’?
 • False
by
default
 Intelius,
Inc.


  17. Core
Logic
 • Two
JAPE
phases
 – Iden7fy
quoted
speech
 – Iden7fy
pleonas7c
it
 • Match
pronouns
to
antecedents
 – Match
“I”,
“me”,
“my”
inside
quoted
speech
to
 names
outside
quoted
speech
 – Other
than
this,
preky
much
match
pronouns
with 
 last
referent
 Intelius,
Inc.


  18. Pronominal
Coref
Assessment
 • Not
very
good
with
resolving
“it”
 • Works
reasonably
well
on
personal
pronouns
 – Difficult
matching
cases
are
rela7vely
rare
 – Usual
cause
of
error
is
that
we
fail
to
tag
an
 antecedent
as
a
person
 • When
this
happens,
can
match
an
antecedent
fairly
far
 away
 Intelius,
Inc.


  19. OrthoMatcher
Assessment
 • Errors
are
also
rela7vely
rare
 • On
Intelius
data,
errors
onen
involve
family
 members
in
apposi7ve
phrases
 – “James
Madison,
Jr.,
the
son
of
the
wealthy
 Virginia
planter,
James
Madison,
Sr.,
was
the
 fourth
president
of
the
United
States.

 Madison

.
.
.”
 Intelius,
Inc.


  20. Engineering
Shortcomings
 • Difficult
to
add
a
new
rule.

Have
to
modify
 source
code
and
add
it
into
the
list
of
firing
 rules

 • No
way
to
rank
candidate
antecedents
 – Can
only
say
yes/no
to
a
pair.

Can’t
say
“maybe,
 unless
there’s
something
beker”
 – Would
like
to
rank
sentence
head
words
higher

 • Three
coref
systems:

OrthoMatcher,
 pronominal,
and
Yaoyong
Li’s
ML
toolkit
 Intelius,
Inc.


  21. Upcoming
GATE
5.1
Enhancement
 • Leverage
GATE
5.1
ability
to
run
a
PR
which
 won’t
cross
sec7on
boundaries
 – Can
make
a
PR
only
“see”
one
sec7on
at
a
7me
 – Limits
errors
if
you
can
par77on
a
document
into
 logical
sec7ons
 Intelius,
Inc.


  22. Medium‐term
Engineering
Goals
 • Introduce
idea
of
named
en7ty
priori7za7on
 into
OrthoMatcher
 • Can
add
and
delete
rules
via
config
files
rather
 than
by
edi7ng
coref
source
code
 • Provide
standard
API
to
all
rules
 Intelius,
Inc.


Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend