cross language ir
play

CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th - PDF document

5/24/09 CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th BenCartereCe CrossLanguageIR Usersubmitsaqueryinonelanguage,gets


  1. 5/24/09
 Cross‐Language
IR
 CISC489/689‐010,
Lecture
#23
 Monday,
May
11 th 
 Ben
CartereCe
 Cross‐Language
IR
 • User
submits
a
query
in
one
language,
gets
 results
in
a
different
language
 • Documents
are
semi‐structured
and
 heterogeneous
(as
almost
all
data
in
IR),
and
 also
in
mulNple
languages
 • InformaNon
may
only
be
available
in
 documents
wriCen
in
one
of
the
languages
 • Highly
useful
to
intelligence
community
 1


  2. 5/24/09
 Approaches
to
CLIR
 • Translate
the
documents
into
the
users’
 language,
and
let
the
users
submit
queries
in
 their
own
language
 • Translate
the
users’
queries
into
target
 language(s)
and
use
the
translated
query
for
 retrieval
 • Translate
both
queries
and
documents
to
an
 “intermediate”
language
 AutomaNc
TranslaNon
 • What
are
some
approaches
to
automaNc
 translaNon?
 – Language‐to‐language
dicNonaries
 • Languages
do
not
translate
precisely
 – One
word
with
several
meanings
in
one
language
 might
translate
to
several
different
words
in
the
other
 – Many
words
with
the
same
meaning
might
all
 translate
to
a
single
word
 – A
word
in
one
language
might
only
be
expressible
as
a
 phrase
in
another
(or
vice‐versa)
 – etc…
 2


  3. 5/24/09
 Example
 • English
queries
to
retrieve
Spanish
documents
 • System
works
by
translaNng
query
to
Spanish
 • Query:

“bank
fraud”
 • TranslaNons
of
“bank”:
 • TranslaNons
of
“fraud”:
 – Orilla 
(river
bank)
 – Impostor (fraudulent
person)
 – Terraplen (bank
of
earth)
 – Fraude 
(decepNon)
 – Banco (bank
of
clouds)
 – Bateria (bank
of
lights)
 • How
would
a
dicNonary‐ – Banco (financial
insNtuNon)
 based
system
know
which
 – Banca 
(casino
bank) pair
of
translaNons
to
use?
 • Possibly
correct
translaNon: • Fraude bancario 
 StaNsNcal
Approach
 • Instead
of
trying
to
translate
directly,
apply
 staNsNcal
methods
 • Learn
“translaNon
probabiliNes”
P(f
|
e)
–
 probability
of
translaNng
string
e
in
language
E
 to
string
f
in
language
F
 • E.g.:
 – P(orilla
fraude
|
bank
fraud),
P(orilla
impostor
|
 bank
fraud),
P(banco
fraude
|
bank
fraud),
…
 3


  4. 5/24/09
 Cross‐Language
Language
Model
 • Recall
query‐likelihood
language
model:
 (1 − α D ) tf qD ctf q � � P ( Q | D ) = P ( q | D ) = | D | + α D | C | q ∈ Q q ∈ Q • Let’s
adapt
this
to
cross‐language
retrieval
 using
staNsNcal
translaNon
 � P ( Q f | D e ) = P ( q f | D e ) q f ∈ Q f � � = P ( q f | t e ) P ( t e | D e ) q f ∈ Q f t e ∈ E � � (1 − α D e ) tf t e D e ctf t e � � = P ( q f | t e ) | D e | + α D e | C e | q f ∈ Q f t e ∈ E TranslaNon
Model
 • What
is
P(q f 
|
t e )?
 • The
 transla6on model :

probability
of
 translaNng
word
t e 
in
language
E
to
word
q f 
in
 language
F
 • Where
does
it
come
from?
 – Maybe
a
dicNonary
approach:

every
possible
 translaNon
of
t e 
has
equal
probability
 – e.g.
P(orilla
|
bank)
=
P(banco
|
bank)
=
P(banca
|
 bank)
=
…
 4


  5. 5/24/09
 StaNsNcal
TranslaNon
Model
 • An
alternaNve
approach:

 parallel corpora 
 StaNsNcal
TranslaNon
with
Parallel
 Corpora
 • Parallel
corpora
consist
of
documents
in
two
 or
more
languages
that
are
known
to
be
 translaNons
of
one
another
 • The
parallel
copora
are
 aligned :

string
e
and
 string
f
are
marked
as
translaNons
of
each
 other
 • We
can
use
these
alignments
to
esNmate
a
 translaNon
model
 5


  6. 5/24/09
 TranslaNon
Model
 • To
esNmate
P(q f 
|
t e ),
count
the
number
of
 aligned
string
pairs
(e,
f)
such
that
t e 
is
a
word
 in
e
and
q f 
is
a
word
in
f
 • Divide
by
the
total
number
of
strings
in
 language
e
that
contain
t e
 P ( q f | t e ) = |{ ( e, f ) | t e ∈ e and q f ∈ f }| |{ e | t e ∈ e }| Simple
Alignment
Example
 • English
sentence:

“The
objecNve
was
clear:
arrest
and
 extradite
to
Mexico
the
woman
against
whom
they
had
 charged
for
fraud
to
a
recognized
banking
insNtuNon.”
 • Spanish
sentence:

“El
objeNvo
era
claro:
detener
a
la
 mujer
y
enviarla
de
regreso
a
México
pues
habían
 cargos
en
su
contra
por
fraude
a
una
reconocida
 insNtución
bancaria.”
 • Every
pair
of
words
in
these
two
sentences
will
have
 some
translaNon
probability
 • Over
many
sentences,
the
highest
probabiliNes
will
be
 the
pairs
of
words
that
are
most
closely
related
 6


  7. 5/24/09
 Alignments
 • Alignments
can
be
much
more
detailed
 Images
from
Brown
et
al.,
“The
 MathemaNcs
of
StaNsNcal
Machine
 TranslaNon”
 Parallel
Corpora
 • Where
do
we
get
parallel
corpora?
 – Find
documents
that
we
know
to
be
translaNons
 – Canadian
Hansard:

transcripts
of
Canadian
 parliamentary
debates
in
both
English
and
French
 – European
Union
law
in
22
languages
 • Anything
that’s
not
law‐related?
 – Wikipedia
arNcles
in
different
languages..

Not
 necessarily
translaNons
though
 7


  8. 5/24/09
 CLIR
Experiments
 • CLIR
track
ran
at
TREC
from
1998
through
 2002
 • Languages
used
include
English,
German,
 French,
Italian,
Chinese,
and
Arabic
 • Other
issues
in
CLIR:
 – SegmentaNon,
stemming,
stopping,
phrases
 require
different
approaches
in
different
 languages
 – I
am
going
to
focus
on
high‐level
problem
 CLIR
Experiments
 • In
2001
and
2002,
the
main
CLIR
task
was
English
 queries
to
retrieve
Arabic
documents
 • Documents:

383,872
news
arNcles
from
Agence
 France
Press
from
1994‐2000
 • InformaNon
needs:

25
queries,
descripNons,
and
 narraNves
in
English
by
naNve
Arabic
speakers
 – Translated
into
Arabic
and
French
as
well
 • ParNcipaNng
sites
could
do
CLIR
(English
to
Arabic
 or
French
to
Arabic)
or
normal
IR
(Arabic
to
 Arabic)
 8


  9. 5/24/09
 Example
Topic
 <num>
Number:
AR26
 <num>
Number:
AR26
 <Ntle>

 ﺲﻠﺠﻣ 
 ﺔﻣوﺎﻘﳌا 
 ﻲﻨﻃﻮﻟا 
 ﻲﻧﺎﺘﺳدﺮﻜﻟا 

 <Ntle>
Kurdistan
Independence

 <desc>
DescripNon:

 <desc>
DescripNon:

 
 ﻒﻴﻛ 
 ﺮﻈﻨﻳ 
 ﺲﻠﺠﻣ 
 ﺔﻣوﺎﻘﳌا 
 ﺔﻴﻨﻃﻮﻟا 
 ﻰﻟا 
 لﻼﻘﺘﺳﻹا 
 How
does
the
NaNonal
Council
of
 ﻞﻤﺘﶈا 
 داﺮﻛﻼﻟ؟ 
 Resistance
relate
to
the
potenNal
 independence
of
Kurdistan?

 <narr>
NarraNve:


 <narr>
NarraNve:


 عﻮﺿﻮﳌا 
 ﻦﻤﻀﺘﻳ 
 صﻮﺼﻧ 
 ﺔﻘﻠﻌﺘﻣ 
 تﺎﻛﺮﺤﺘﺑ 
 ﺲﻠﺠﻣ 
 ﺔﻣوﺎﻘﳌا 
 ﺔﻴﻨﻃﻮﻟا 
 ، 
 تﻻﺎﻘﻣ 
 ثﺪﺤﺘﺗ 
 ﻦﻋ 
 ةدﺎﻴﻗ 
 ArNcles
reporNng
acNviNes
of
the
 نﻼﺟوا 
 ﻦﻤﺿ 
 دﻮﻬﺟ 
 داﺮﻛﻻا 
 لﻼﻘﺘﺳﻼﻟ 
.
 NaNonal
Council
of
Resistance
are
 considered
on
topic.
ArNcles
 discussing
Ocalan's
leadership
 within
the
context
of
the
Kurdish
 efforts
toward
independence
are
 also
considered
on
topic.

 Example
Document
 9


  10. 5/24/09
 Results
 Cross‐lingual
(English/French
to
Arabic)
 Monolingual
(Arabic
to
Arabic)
 • BBN,
Umass,
IBM
used
staNsNcal
models
 • Umass
performance
on
cross‐language
is
roughly
 equal
to
performance
on
monolingual!
 Plots
from
Oard
&
Gey,
“The
TREC‐2002
Arabic/English
CLIR
Track”
 Analysis
 • The
translaNon
model
is
imperfect
 – It
assigns
probabiliNes
to
almost
every
pair
of
 words
 – There
are
many
errors
in
translaNon
 • So
how
could
cross‐lingual
be
almost
as
good
 as
monolingual?
 • Hypotheses:
 – TranslaNon
process
disambiguates
some
terms
 – TranslaNon
process
smooths
query
models
 10


Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend