CoReferenceinGATE AndrewBorthwick,Ph.D. - - PowerPoint PPT Presentation

co reference in gate
SMART_READER_LITE
LIVE PREVIEW

CoReferenceinGATE AndrewBorthwick,Ph.D. - - PowerPoint PPT Presentation

CoReferenceinGATE AndrewBorthwick,Ph.D. PrincipalScien7st,Intelius,Inc. July28,2009 Intelius,Inc. WhatisCoreference? DetermineifwordAreferstothesamereal


slide-1
SLIDE 1

Co‐Reference
in
GATE


Andrew
Borthwick,
Ph.D.
 Principal
Scien7st,
Intelius,
Inc.
 July
28,
2009


Intelius,
Inc.


slide-2
SLIDE 2

What
is
Co‐reference?


  • Determine
if
word
A
refers
to
the
same
real‐

world
en7ty
as
word
B


  • In
GATE,
divided
into
two
pieces


– Orthographic:

“OrthoMatcher”


  • Person,
Organiza7on,
Loca7on,
and
Date


– Pronominal:

Pronominal
Coreferencer




  • He/him/his,
she/her,
it

  • Not
“pleonas7c
it”
(“it
is
raining”)


Intelius,
Inc.


slide-3
SLIDE 3

Why
Co‐Reference?


  • Necessary
for
informa7on
extrac7on


– “He
received
a
B.A.
in
computer
science”
 – “Mr.
Jones
received
a
B.A.
in
computer
science”


  • Word
sense
disambigua7on
and
named
en7ty
resolu7on


– ‘Jack
London
wrote
“Call
of
the
Wild”.

London
is
a
famous
 American
author
.’
 – Resolves
“Unknown”
annota7ons
in
ANNIE


  • Web
person
search

  • Field
of
academic
study


Intelius,
Inc.


slide-4
SLIDE 4

Co‐reference
Engine
Background


  • I
have
been
maintaining
the
OrthoMatcher
for
the


past
year


  • OrthoMatcher
much
improved
over
previous


– Doesn’t
assume
that
all
iden7cal
strings
refer
to
the
same
 person.

I.e.
“David”
can
refer
to

two
different
en77es
in
 two
different
places


  • One
small
change
to
pronominal
coref


– Speed
improved
by
factor
of
50


  • Have
concentrated
on
person
matching

  • Have
had
to
priori7ze
enhancements


Intelius,
Inc.


slide-5
SLIDE 5

Execu=ng
GATE
Co‐reference


  • Annie
OrthoMatcher


– Processing
resource
in
ANNIE
Creole
plug‐in
 – Last
element,
by
default
in
GATE
pipeline
 – Requires
output
of
NE
Transducer
and
Annie
 tokenizer


  • ANNIE
Pronominal
Coreferencer


– In
ANNIE
Creole
plugin
 – Not
loaded
by
default
in
GATE
 – Must
follow
OrthoMatcher


Intelius,
Inc.


slide-6
SLIDE 6

Results
of
co‐reference


  • Co‐refs
visible
in
GATE
GUI

  • “Unknown”
updated
to
appropriate
named


en7ty
type,
if
possible


  • Co‐references
marked


– In
the
“matches”
feature
on
each
named
en7ty
 – In
the
MatchesAnnots
document
feature


  • Visible
in
the
GATE
GUI

  • Accessible
to
downstream
PR’s
via
“matches”


feature
and
Annota7onSet.get()


Intelius,
Inc.


slide-7
SLIDE 7

Annie
OrthoMatcher


  • Tries
to
find
the
most
recent
compa7ble


match
for
each
Annota7on
Type


– For
Annota7on
X
of
Type
T,
checks
every
 annota7on
Y
of
Type
T
that
occurs
prior
to
X


  • Checks
four
“noMatch”
rules.

If
a
noMatch
rule
fires,
X


and
Y
are
not
a
match


  • Checks
c.
20
match
rules
of
the
form


matchRule(Annota7on1,
Annota7on2).

If
any
 matchRule
returns
“true”,
then
we
match


Intelius,
Inc.


slide-8
SLIDE 8

No‐match
rules


  • Names
are
on
the
spurious
match
table


– spur_match.lst



  • Incompa7ble
middle
names


– John
C.
Smith
!=
John
Q.
Smith


  • Incompa7ble
genders


– John
Anderson
!=
Mrs.
Anderson


  • No
forward
reference
of
short
name
unless


short
name
is
unmatched


Intelius,
Inc.


slide-9
SLIDE 9

Forward
Reference
Rule


  • Mark
the
name
annota7on
whenever
a
shortened


person
name
is
matched
with
the
long
form


– Don’t
allow
subsequent
long
forms
to
match
these
short
 form
 – Can
match
if
short
form
is
not
yet
matched


  • Not
allowed:


– “John
Smith
…
John
…
John
Robertson”


  • “John
Robertson”
can’t
coref
with
“John”

  • Allowed:


– “John
…
John
Robertson”



Intelius,
Inc.


slide-10
SLIDE 10

Match
Rules


  • Exact
match
on
name,
match
on
nickname


– Runs
off
a
nickname
list
 – Note
that
Christopher
=
Chris,
Chris7ne
=
Chris,
 but
Christopher
!=
Chris7ne


  • All
tokens
of
name
A
are
found
in
name
B


– John
=
John
Smith,
Smith
=
John
Smith


  • All
tokens
of
B
in
name
A
other
than
corporate


designators
and
punctua7on


– ACME,
Inc.
=
ACME


Intelius,
Inc.


slide-11
SLIDE 11

Orthomatcher
Configura=on


  • Word
lists
are
defined
in
“defini7onFileURL”


– Format
is
list_file_url:list_label


  • These
are
the
main
things
to
configure




Intelius,
Inc.


slide-12
SLIDE 12

Key
Word
Lists


  • alias:

Aliases
name
A
for
name
B


– These
are
automa7cally
matched


  • spur_match:

Name
A
!=
name
B


– Automa7cally
non‐matched


  • Miscellaneous
word
lists


– cdg:

Corporate
designators
 – connector:

Connec7ng
words
 – prepos:

Preposi7ons


  • Dept.
of
Defense
=
Defense
Dept.


Intelius,
Inc.


slide-13
SLIDE 13

Word
Lists:

Nicknames


  • Defined
in
nickname
word
list
parameter

  • Formaked
as



– Name
1
 – Name
2
 – Subs7tu7on
likelihood
(non‐scien7fic
intui7on)
 – Male/female
variant
(not
used
in
Orthomatcher)


  • Name
1
and
Name
2
are
interchangeable
for
Orthomatcher

  • minimumNicknameLikelihood
parameter
defines
minimum


subs7tu7on
likelihood




Intelius,
Inc.


slide-14
SLIDE 14

Other
Key
Parameters


  • highPrecisionOrgs


– Use
very
safe
features
for
matching
Orgs


  • ACME,
Inc.
=
ACME
is
okay

  • Kalamazoo
Financial
Corpora7on
=
Kalamazoo
is
not

  • extLists


– Defaults
to
true,
false
not
tested
by
me
 – If
false,
tries
to
derive
corporate
designator
from
 document


Intelius,
Inc.


slide-15
SLIDE 15

Customizing
OrthoMatcher


  • Modify
parameters/lists
(easy)

  • Add
or
subtract
rules
(moderately
hard)


– Code
a
new
rule


  • Use
matchRule12Name()
as
a
template


– Add
to
apply_rules_namematch()
in
 OrthoMatcher.java
 – Recompile
GATE



  • From
GATE
home
directory,
do
bin/ant
clean
jar

  • Change
core
logic
(hard)


Intelius,
Inc.


slide-16
SLIDE 16

Pronominal
Coreferencer


  • In
ANNIE,
but
not
default
ANNIE


– Run
aner
OrthoMatcher


  • Three
submodules


– Quoted
speech
iden7fica7on
 – “Pleonas7c
‘it’”
iden7fica7on


  • “It
is
raining”


– Pronominal
coreference


  • Only
two
parameters


– Inanimated
en7ty
types
(what
you
match
to
‘it’)
 – resolveIt:

Try
to
match
‘it’?


  • False
by
default


Intelius,
Inc.


slide-17
SLIDE 17

Core
Logic


  • Two
JAPE
phases


– Iden7fy
quoted
speech
 – Iden7fy
pleonas7c
it


  • Match
pronouns
to
antecedents


– Match
“I”,
“me”,
“my”
inside
quoted
speech
to
 names
outside
quoted
speech
 – Other
than
this,
preky
much
match
pronouns
with 
 last
referent


Intelius,
Inc.


slide-18
SLIDE 18

Pronominal
Coref
Assessment


  • Not
very
good
with
resolving
“it”

  • Works
reasonably
well
on
personal
pronouns


– Difficult
matching
cases
are
rela7vely
rare
 – Usual
cause
of
error
is
that
we
fail
to
tag
an
 antecedent
as
a
person


  • When
this
happens,
can
match
an
antecedent
fairly
far


away


Intelius,
Inc.


slide-19
SLIDE 19

OrthoMatcher
Assessment


  • Errors
are
also
rela7vely
rare

  • On
Intelius
data,
errors
onen
involve
family


members
in
apposi7ve
phrases


– “James
Madison,
Jr.,
the
son
of
the
wealthy
 Virginia
planter,
James
Madison,
Sr.,
was
the
 fourth
president
of
the
United
States.

 Madison

.
.
.”


Intelius,
Inc.


slide-20
SLIDE 20

Engineering
Shortcomings


  • Difficult
to
add
a
new
rule.

Have
to
modify


source
code
and
add
it
into
the
list
of
firing
 rules



  • No
way
to
rank
candidate
antecedents


– Can
only
say
yes/no
to
a
pair.

Can’t
say
“maybe,
 unless
there’s
something
beker”
 – Would
like
to
rank
sentence
head
words
higher



  • Three
coref
systems:

OrthoMatcher,


pronominal,
and
Yaoyong
Li’s
ML
toolkit


Intelius,
Inc.


slide-21
SLIDE 21

Upcoming
GATE
5.1
Enhancement


  • Leverage
GATE
5.1
ability
to
run
a
PR
which


won’t
cross
sec7on
boundaries


– Can
make
a
PR
only
“see”
one
sec7on
at
a
7me
 – Limits
errors
if
you
can
par77on
a
document
into
 logical
sec7ons


Intelius,
Inc.


slide-22
SLIDE 22

Medium‐term
Engineering
Goals


  • Introduce
idea
of
named
en7ty
priori7za7on


into
OrthoMatcher


  • Can
add
and
delete
rules
via
config
files
rather


than
by
edi7ng
coref
source
code


  • Provide
standard
API
to
all
rules


Intelius,
Inc.


slide-23
SLIDE 23

Longer
term
Engineering
Goals


  • Allow
integrated
calls
to
JAPE
transducer

  • Run
pronominal
and
orthographic
co‐

reference
engines
from
same
code
base
with
 different
configura7ons


  • Provide
ML
integra7on,
probably
via
GATE’s


ML
toolkit


  • New
API
for
recording
co‐reference

  • Revisit
GATE
GUI


Intelius,
Inc.