Data Selection with Fewer Words
1
Amittai Axelrod Philip Resnik Xiaodong He Mari Ostendorf University of Maryland & Johns Hopkins University of Maryland Microsoft Research University of Washington
Data Selection with Fewer Words Amittai Axelrod University of - - PowerPoint PPT Presentation
Data Selection with Fewer Words Amittai Axelrod University of Maryland & Johns Hopkins Philip Resnik University of Maryland Xiaodong He Microsoft Research Mari Ostendorf University of Washington 1 Domain*
1
Amittai Axelrod Philip Resnik Xiaodong He Mari Ostendorf University of Maryland & Johns Hopkins University of Maryland Microsoft Research University of Washington
Data Selection with Fewer Words Amittai Axelrod WMT 2015 2
Here we use “domain” to mean “corpus”.
Data Selection with Fewer Words Amittai Axelrod WMT 2015 3
Data Selection with Fewer Words Amittai Axelrod WMT 2015 4
but improve the input!
Data Selection with Fewer Words Amittai Axelrod WMT 2015 5
Data Selection with Fewer Words Amittai Axelrod WMT 2015 6
corpus
Data Selection with Fewer Words Amittai Axelrod WMT 2015 7
corpus
Data Selection with Fewer Words Amittai Axelrod WMT 2015 8
some text by its perplexity:
pplLMQ(s) = 2− 1
N
PN
i=1 log LMQ(wi|hi) = 2HLMQ(s)
Data Selection with Fewer Words Amittai Axelrod WMT 2015 9
by perplexity with in-domain LM.
not match the target task.
Data Selection with Fewer Words Amittai Axelrod WMT 2015 10
(Also called "XEDiff" or "Moore-Lewis")
argmin
s ∈P OOL
HLMIN (s) − HLMP OOL(s)
Data Selection with Fewer Words Amittai Axelrod WMT 2015 11
with bilingual data, and apply to SMT:
training data (1%-20%) yields translation systems that are smaller, cheaper, faster, and (often) better.
Data Selection with Fewer Words Amittai Axelrod WMT 2015 12
and 3 in the in-domain one, is it really 50% more likely?
(Good-Turing smoothing, singleton pruning...)
Data Selection with Fewer Words Amittai Axelrod WMT 2015 13
syntactic structure = proxy for style.
authorship, genre, etc.
to data selection:
Data Selection with Fewer Words Amittai Axelrod WMT 2015 14
syntactic structure = proxy for style.
authorship, genre, etc.
to data selection: Replace rare words with POS tags
Data Selection with Fewer Words Amittai Axelrod WMT 2015 15
Data Selection with Fewer Words Amittai Axelrod WMT 2015 16
Data Selection with Fewer Words Amittai Axelrod WMT 2015 17
Data Selection with Fewer Words Amittai Axelrod WMT 2015 18
Data Selection with Fewer Words Amittai Axelrod WMT 2015 19
Data Selection with Fewer Words Amittai Axelrod WMT 2015 20
scores, and re-rank the original general corpus.
but with different corpus representation.
Data Selection with Fewer Words Amittai Axelrod WMT 2015 21
data (6m sentence pairs).
Eliminate 97% of the vocabulary
Lang Vocab Kept % En 470,154 10,036 2.1% Zh 729,283 11,440 1.5%
Data Selection with Fewer Words Amittai Axelrod WMT 2015 22
despite (much) smaller selection vocab!
Data Selection with Fewer Words Amittai Axelrod WMT 2015 23
Data Selection with Fewer Words Amittai Axelrod WMT 2015 24
Data Selection with Fewer Words Amittai Axelrod WMT 2015 25
but vocabulary statistics are easy.
generalizing to other unseen words.
picking up words that fill similar roles/patterns in the sentence?
Data Selection with Fewer Words Amittai Axelrod WMT 2015
26
regular data selection.
to ~10k words with robust statistics
vocabulary size "V"
Data Selection with Fewer Words Amittai Axelrod WMT 2015 27
Questions?
Data Selection with Fewer Words Amittai Axelrod WMT 2015 28
[ this slide intentionally left blank ]