CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th - - PDF document

cross language ir
SMART_READER_LITE
LIVE PREVIEW

CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th - - PDF document

5/24/09 CrossLanguageIR CISC489/689010,Lecture#23 Monday,May11 th BenCartereCe CrossLanguageIR Usersubmitsaqueryinonelanguage,gets


slide-1
SLIDE 1

5/24/09
 1


Cross‐Language
IR


CISC489/689‐010,
Lecture
#23
 Monday,
May
11th
 Ben
CartereCe


Cross‐Language
IR


  • User
submits
a
query
in
one
language,
gets


results
in
a
different
language


  • Documents
are
semi‐structured
and


heterogeneous
(as
almost
all
data
in
IR),
and
 also
in
mulNple
languages


  • InformaNon
may
only
be
available
in


documents
wriCen
in
one
of
the
languages


  • Highly
useful
to
intelligence
community

slide-2
SLIDE 2

5/24/09
 2


Approaches
to
CLIR


  • Translate
the
documents
into
the
users’


language,
and
let
the
users
submit
queries
in
 their
own
language


  • Translate
the
users’
queries
into
target


language(s)
and
use
the
translated
query
for
 retrieval


  • Translate
both
queries
and
documents
to
an


“intermediate”
language


AutomaNc
TranslaNon


  • What
are
some
approaches
to
automaNc


translaNon?


– Language‐to‐language
dicNonaries


  • Languages
do
not
translate
precisely


– One
word
with
several
meanings
in
one
language
 might
translate
to
several
different
words
in
the
other
 – Many
words
with
the
same
meaning
might
all
 translate
to
a
single
word
 – A
word
in
one
language
might
only
be
expressible
as
a
 phrase
in
another
(or
vice‐versa)
 – etc…


slide-3
SLIDE 3

5/24/09
 3


Example


  • TranslaNons
of
“bank”:


– Orilla
(river
bank)
 – Terraplen (bank
of
earth)
 – Banco (bank
of
clouds)
 – Bateria (bank
of
lights)
 – Banco (financial
insNtuNon)
 – Banca
(casino
bank)

  • TranslaNons
of
“fraud”:


– Impostor (fraudulent
person)
 – Fraude
(decepNon)


  • How
would
a
dicNonary‐

based
system
know
which
 pair
of
translaNons
to
use?


  • English
queries
to
retrieve
Spanish
documents

  • System
works
by
translaNng
query
to
Spanish

  • Query:

“bank
fraud”

  • Possibly
correct
translaNon:
  • Fraude bancario


StaNsNcal
Approach


  • Instead
of
trying
to
translate
directly,
apply


staNsNcal
methods


  • Learn
“translaNon
probabiliNes”
P(f
|
e)
–


probability
of
translaNng
string
e
in
language
E
 to
string
f
in
language
F


  • E.g.:


– P(orilla
fraude
|
bank
fraud),
P(orilla
impostor
|
 bank
fraud),
P(banco
fraude
|
bank
fraud),
…


slide-4
SLIDE 4

5/24/09
 4


Cross‐Language
Language
Model


  • Recall
query‐likelihood
language
model:

  • Let’s
adapt
this
to
cross‐language
retrieval


using
staNsNcal
translaNon


P(Q|D) =

  • q∈Q

P(q|D) =

  • q∈Q

(1 − αD)tfqD |D| + αD ctfq |C|

P(Qf|De) =

  • qf ∈Qf

P(qf|De) =

  • qf ∈Qf
  • te∈E

P(qf|te)P(te|De) =

  • qf ∈Qf
  • te∈E

P(qf|te)

  • (1 − αDe)tfteDe

|De| + αDe ctfte |Ce|

  • TranslaNon
Model

  • What
is
P(qf
|
te)?

  • The
transla6on model:

probability
of


translaNng
word
te
in
language
E
to
word
qf
in
 language
F


  • Where
does
it
come
from?


– Maybe
a
dicNonary
approach:

every
possible
 translaNon
of
te
has
equal
probability
 – e.g.
P(orilla
|
bank)
=
P(banco
|
bank)
=
P(banca
|
 bank)
=
…


slide-5
SLIDE 5

5/24/09
 5


StaNsNcal
TranslaNon
Model


  • An
alternaNve
approach:

parallel corpora


StaNsNcal
TranslaNon
with
Parallel
 Corpora


  • Parallel
corpora
consist
of
documents
in
two

  • r
more
languages
that
are
known
to
be


translaNons
of
one
another


  • The
parallel
copora
are
aligned:

string
e
and


string
f
are
marked
as
translaNons
of
each


  • ther

  • We
can
use
these
alignments
to
esNmate
a


translaNon
model


slide-6
SLIDE 6

5/24/09
 6


TranslaNon
Model


  • To
esNmate
P(qf
|
te),
count
the
number
of


aligned
string
pairs
(e,
f)
such
that
te
is
a
word
 in
e
and
qf
is
a
word
in
f


  • Divide
by
the
total
number
of
strings
in


language
e
that
contain
te


P(qf|te) = |{(e, f)|te ∈ e and qf ∈ f}| |{e|te ∈ e}|

Simple
Alignment
Example


  • English
sentence:

“The
objecNve
was
clear:
arrest
and


extradite
to
Mexico
the
woman
against
whom
they
had
 charged
for
fraud
to
a
recognized
banking
insNtuNon.”


  • Spanish
sentence:

“El
objeNvo
era
claro:
detener
a
la


mujer
y
enviarla
de
regreso
a
México
pues
habían
 cargos
en
su
contra
por
fraude
a
una
reconocida
 insNtución
bancaria.”


  • Every
pair
of
words
in
these
two
sentences
will
have


some
translaNon
probability


  • Over
many
sentences,
the
highest
probabiliNes
will
be


the
pairs
of
words
that
are
most
closely
related


slide-7
SLIDE 7

5/24/09
 7


Alignments


  • Alignments
can
be
much
more
detailed


Images
from
Brown
et
al.,
“The
 MathemaNcs
of
StaNsNcal
Machine
 TranslaNon”


Parallel
Corpora


  • Where
do
we
get
parallel
corpora?


– Find
documents
that
we
know
to
be
translaNons
 – Canadian
Hansard:

transcripts
of
Canadian
 parliamentary
debates
in
both
English
and
French
 – European
Union
law
in
22
languages


  • Anything
that’s
not
law‐related?


– Wikipedia
arNcles
in
different
languages..

Not
 necessarily
translaNons
though


slide-8
SLIDE 8

5/24/09
 8


CLIR
Experiments


  • CLIR
track
ran
at
TREC
from
1998
through


2002


  • Languages
used
include
English,
German,


French,
Italian,
Chinese,
and
Arabic


  • Other
issues
in
CLIR:


– SegmentaNon,
stemming,
stopping,
phrases
 require
different
approaches
in
different
 languages
 – I
am
going
to
focus
on
high‐level
problem


CLIR
Experiments


  • In
2001
and
2002,
the
main
CLIR
task
was
English


queries
to
retrieve
Arabic
documents


  • Documents:

383,872
news
arNcles
from
Agence


France
Press
from
1994‐2000


  • InformaNon
needs:

25
queries,
descripNons,
and


narraNves
in
English
by
naNve
Arabic
speakers


– Translated
into
Arabic
and
French
as
well


  • ParNcipaNng
sites
could
do
CLIR
(English
to
Arabic

  • r
French
to
Arabic)
or
normal
IR
(Arabic
to


Arabic)


slide-9
SLIDE 9

5/24/09
 9


Example
Topic


<num>
Number:
AR26
 <Ntle>

ﺲﻠﺠﻣ
ﺔﻣوﺎﻘﳌا
ﻲﻨﻃﻮﻟا
ﻲﻧﺎﺘﺳدﺮﻜﻟا

 <desc>
DescripNon:

 
ﻒﻴﻛ
ﺮﻈﻨﻳ
ﺲﻠﺠﻣ
ﺔﻣوﺎﻘﳌا
ﺔﻴﻨﻃﻮﻟا
ﻰﻟا
لﻼﻘﺘﺳﻹا
 ﻞﻤﺘﶈا
داﺮﻛﻼﻟ؟
 <narr>
NarraNve:


 عﻮﺿﻮﳌا
ﻦﻤﻀﺘﻳ
صﻮﺼﻧ
ﺔﻘﻠﻌﺘﻣ
تﺎﻛﺮﺤﺘﺑ
ﺲﻠﺠﻣ
 ﺔﻣوﺎﻘﳌا
ﺔﻴﻨﻃﻮﻟا
،
تﻻﺎﻘﻣ
ثﺪﺤﺘﺗ
ﻦﻋ
ةدﺎﻴﻗ
 نﻼﺟوا
ﻦﻤﺿ
دﻮﻬﺟ
داﺮﻛﻻا
لﻼﻘﺘﺳﻼﻟ
.
 <num>
Number:
AR26
 <Ntle>
Kurdistan
Independence

 <desc>
DescripNon:

 How
does
the
NaNonal
Council
of
 Resistance
relate
to
the
potenNal
 independence
of
Kurdistan?

 <narr>
NarraNve:


 ArNcles
reporNng
acNviNes
of
the
 NaNonal
Council
of
Resistance
are
 considered
on
topic.
ArNcles
 discussing
Ocalan's
leadership
 within
the
context
of
the
Kurdish
 efforts
toward
independence
are
 also
considered
on
topic.



Example
Document


slide-10
SLIDE 10

5/24/09
 10


Results


  • BBN,
Umass,
IBM
used
staNsNcal
models

  • Umass
performance
on
cross‐language
is
roughly


equal
to
performance
on
monolingual!


Monolingual
(Arabic
to
Arabic)
 Cross‐lingual
(English/French
to
Arabic)
 Plots
from
Oard
&
Gey,
“The
TREC‐2002
Arabic/English
CLIR
Track”


Analysis


  • The
translaNon
model
is
imperfect


– It
assigns
probabiliNes
to
almost
every
pair
of
 words
 – There
are
many
errors
in
translaNon


  • So
how
could
cross‐lingual
be
almost
as
good


as
monolingual?


  • Hypotheses:


– TranslaNon
process
disambiguates
some
terms
 – TranslaNon
process
smooths
query
models


slide-11
SLIDE 11

5/24/09
 11


IR
as
StaNsNcal
TranslaNon


  • What
if
we
view
IR
as
a
translaNon
process?


– User
inputs
query
in
English,
system
does
“cross‐ language”
retrieval
from
user‐English
to
system‐ English
 – This
may
account
for
users
not
using
the
right
 keywords
in
their
queries


  • There
is
no
natural
translaNon
model,
so
one


must
be
simulated


  • Berger
&
Lafferty,
SIGIR
1999


IR
TranslaNon
Model


  • Generate
a
translaNon
model
by
aligning


simulated
queries
to
relevant
documents


slide-12
SLIDE 12

5/24/09
 12


Results


TranslaNon
models
compared
to
w‐idf
 LM
coincides
with
Model
0


Conclusion:

staNsNcal
translaNon
works
at
least
as
well
as
w‐idf
or
LM


TranslaNon
for
MulNmedia
Retrieval


  • English‐Arabic
CLIR
works

  • English‐English
CLIR
works

  • What
about
English‐mulNmedia
CLIR?

  • “Translate”
an
image
into
words
to
enable


retrieval
of
images
by
text
queries


  • TranslaNon
model:

P(w
|
I)
is
probability
of


“translaNng”
image
I
to
word
w


slide-13
SLIDE 13

5/24/09
 13


Image
TranslaNon
Model


  • EsNmate
P(w
|
I)
requires
two
things:


– A
feature‐based
representaNon
of
the
image
 – A
set
of
words
that
“align”
with
the
image


  • Use
image
segmentaNon
and
clustering
to


form
a
representaNon
of
images


  • Use
image
capNons
to
align
words
to
image


Image
RepresentaNon:

“Blobs”


From
Jeon
et
al.,
“AutomaNc
Image
AnnotaNon
and
Retrieval
Using
Cross‐Media
Relevance
Models”


slide-14
SLIDE 14

5/24/09
 14


Cross‐Media
Relevance
Model


  • Retrieval
is
by
query‐likelihood
P(Q
|
I)


P(Q|I) =

  • q∈Q

P(q|I) ≈

  • q∈Q

P(q|b1, ..., bm) ∝

  • q∈Q
  • J∈C

P(q|J)P(J)

m

  • i=1

P(bi|J)

C
is
the
collecNon
of
images,
J
is
an
image
in
C,
and
b1…bm
are
“blobs”


Example
Results


From
Jeon
et
al.,
“AutomaNc
Image
AnnotaNon
and
Retrieval
Using
Cross‐Media
Relevance
Models”


slide-15
SLIDE 15

5/24/09
 15


Machine
TranslaNon


  • Machine
translaNon
(MT)
is
a
problem
in
NLP/

computaNonal
linguisNcs


  • The
goal
is
to
automaNcally
translate
text
in
one


language
to
another


  • Different
from
CLIR
with
query
translaNon
model


in
that
the
CLIR
model
does
not
require
a
 “coherent”
translaNon
of
the
query


– CLIR
essenNally
uses
every
possible
translaNon


  • Machine
translaNon
should
provide
a
single


“good”
translaNon
that
is
human‐readable


StaNsNcal
MT


  • Though
MT
and
CLIR
are
different
problems,


the
staNsNcal
approaches
are
very
similar


  • IBM
developed
several
staNsNcal
models
for


MT


– “A
staNsNcal
approach
to
machine
translaNon”,
 Brown
et
al.
1990
 – CLIR
models
based
on
IBM’s
models


slide-16
SLIDE 16

5/24/09
 16


IBM
Models


  • Basic
idea:

to
translate
a
sentence
f
in


language
F
to
a
sentence
e
in
language
E,
 esNmate
P(e
|
f)
using
Bayes
Rule


  • The
“right”
translaNon
is
the
one
with
highest


probability


P(e|f) = P(f|e)P(e) P(f)

  • e = arg max

e

P(f|e)P(e)

IBM
Models


  • The
key
is
esNmaNng
P(f
|
e)

  • Brown
et
al.
presented
five
different
models


– Increasingly
complicated,
require
a
lot
of
training
 data
in
the
form
of
parallel
aligned
corpora


  • Google
machine
translaNon
is
based
on


alignment
and
IBM
models,
but
also
based
on
 very
large
amounts
of
unaligned
data


slide-17
SLIDE 17

5/24/09
 17


Google
Machine
TranslaNon


Google’s
translaNon
of
the
Spanish
Wikipedia
page
for
Spain
(hCp://es.wikipedia.org/wiki/Espana)