Data Selection with Fewer Words Amittai Axelrod University of - - PowerPoint PPT Presentation

data selection with fewer words
SMART_READER_LITE
LIVE PREVIEW

Data Selection with Fewer Words Amittai Axelrod University of - - PowerPoint PPT Presentation

Data Selection with Fewer Words Amittai Axelrod University of Maryland & Johns Hopkins Philip Resnik University of Maryland Xiaodong He Microsoft Research Mari Ostendorf University of Washington 1 Domain*


slide-1
SLIDE 1

Data Selection
 with Fewer Words

1

Amittai Axelrod
 
 Philip Resnik
 Xiaodong He Mari Ostendorf University of Maryland
 & Johns Hopkins
 University of Maryland
 Microsoft Research
 University of Washington

slide-2
SLIDE 2

Data Selection with Fewer Words Amittai Axelrod WMT 2015 2

Domain* Adaptation

  • * Defined by construction.
  • Ideally based on some notion of textual similarity:
  • Lexical choice
  • Grammar
  • Topic
  • Style
  • Genre
  • Register
  • Intent
  • Domain = particular contextual setting.


Here we use “domain” to mean “corpus”.

slide-3
SLIDE 3

Data Selection with Fewer Words Amittai Axelrod WMT 2015 3

Domain Adaptation

  • Training data doesn’t always match desired tasks.
  • Have bilingual:
  • Parliament proceedings
  • Newspaper articles
  • Web scrapings
  • Want to translate:
  • Travel scenarios
  • Facebook updates
  • Realtime conversations
  • Sometimes want a specific kind of language, not just breadth!
slide-4
SLIDE 4

Data Selection with Fewer Words Amittai Axelrod WMT 2015 4

Data Selection

  • "filter Big Data down to Relevant Data"

  • Use your regular pipeline,


but improve the input!


  • Not all sentences are equally valuable...
slide-5
SLIDE 5

Data Selection with Fewer Words Amittai Axelrod WMT 2015 5

Data Selection

  • For a particular translation task:
  • Identify the most relevant training data.
  • Build a model on only this subset.
  • Goal:
  • Better task-specific performance
  • Cheaper (computation, size, time)
slide-6
SLIDE 6

Data Selection with Fewer Words Amittai Axelrod WMT 2015 6

Data Selection Algorithm

  • Quantify the domain
  • Compute similarity of sentences in pool to the in-domain

corpus

  • Sort pool sentences by score
  • Select top n%
slide-7
SLIDE 7

Data Selection with Fewer Words Amittai Axelrod WMT 2015 7

Data Selection Algorithm

  • Quantify the domain
  • Compute similarity of sentences in pool to the in-domain

corpus

  • Sort pool sentences by score
  • Select top n%
  • Use n% to build task-specific MT system
  • Combine with system trained on in-domain data (optional)
  • Apply task-specific system to task.
slide-8
SLIDE 8

Data Selection with Fewer Words Amittai Axelrod WMT 2015 8

Perplexity-Based Filtering

  • A language model LMQ measures the likelihood of

some text by its perplexity:
 
 
 
 


  • Intuition: Average branching factor of LM
  • Cross-entropy H (of a text w.r.t. an LM) is log( ppl ).


pplLMQ(s) = 2− 1

N

PN

i=1 log LMQ(wi|hi) = 2HLMQ(s)

slide-9
SLIDE 9

Data Selection with Fewer Words Amittai Axelrod WMT 2015 9

Cross-Entropy Difference

  • Perplexity-based filtering:
  • Score and sort sentences in pool


by perplexity with in-domain LM.

  • Then rank, select, etc.
  • However! By construction, the data pool does

not match the target task.

slide-10
SLIDE 10

Data Selection with Fewer Words Amittai Axelrod WMT 2015 10

Cross-Entropy Difference

  • Score and rank by cross-entropy difference:



 
 
 
 (Also called "XEDiff" or "Moore-Lewis")

  • Prefer sentences that both:
  • Are like the target task
  • Are unlike the pool average.

argmin

s ∈P OOL

HLMIN (s) − HLMP OOL(s)

slide-11
SLIDE 11

Data Selection with Fewer Words Amittai Axelrod WMT 2015 11

Bilingual Cross-Entropy Diff.

  • Extend the Moore-Lewis similarity score for use

with bilingual data, and apply to SMT:

  • Training on only the most relevant subset of

training data (1%-20%) yields translation systems that are smaller, cheaper, faster, and (often) better.

(HL1(s1, LMIN) − HL1(s1, LMP OOL)) +(HL2(s2, LMIN) − HL2(s2, LMP OOL))

slide-12
SLIDE 12

Data Selection with Fewer Words Amittai Axelrod WMT 2015 12

Using Fewer Words

  • How much can we trust rare words?
  • If a word is seen 2 times in the general corpus

and 3 in the in-domain one,
 is it really 50% more likely?

  • Low-frequency words often ignored


(Good-Turing smoothing, singleton pruning...)

slide-13
SLIDE 13

Data Selection with Fewer Words Amittai Axelrod WMT 2015 13

Hybrid word/POS Corpora

  • In stylometry,


syntactic structure = proxy for style.

  • POS-tag n-grams used as features to determine

authorship, genre, etc.

  • Incorporate this idea as a pre-processing step

to data selection:
 


slide-14
SLIDE 14

Data Selection with Fewer Words Amittai Axelrod WMT 2015 14

Hybrid word/POS Corpora

  • In stylometry,


syntactic structure = proxy for style.

  • POS-tag n-grams used as features to determine

authorship, genre, etc.

  • Incorporate this idea as a pre-processing step

to data selection:
 
 Replace rare words with POS tags

slide-15
SLIDE 15

Data Selection with Fewer Words Amittai Axelrod WMT 2015 15

  • Replace rare words with POS tags:
  • an earthquake in Port-au-Prince
  • an earthquake in NNP
  • Hybrid word/POS Corpora
slide-16
SLIDE 16

Data Selection with Fewer Words Amittai Axelrod WMT 2015 16

  • Replace rare words with POS tags:
  • an earthquake in Port-au-Prince
  • an NN in NNP
  • Hybrid word/POS Corpora
slide-17
SLIDE 17

Data Selection with Fewer Words Amittai Axelrod WMT 2015 17

  • Replace rare(?) words with POS tags:
  • an earthquake in Port-au-Prince
  • DT NN IN NNP
  • Hybrid word/POS Corpora
slide-18
SLIDE 18

Data Selection with Fewer Words Amittai Axelrod WMT 2015 18

  • Replace rare words with POS tags:
  • an earthquake in Port-au-Prince
  • an earthquake in NNP
  • an earthquake in Kodari
  • Hybrid word/POS Corpora
slide-19
SLIDE 19

Data Selection with Fewer Words Amittai Axelrod WMT 2015 19

  • Replace rare words with POS tags:
  • an earthquake in Port-au-Prince
  • an earthquake in NNP
  • an earthquake in Kodari
  • Threshold: ( if Count < 10 ) in either corpus

Hybrid word/POS Corpora

slide-20
SLIDE 20

Data Selection with Fewer Words Amittai Axelrod WMT 2015 20

Using Fewer Words

  • Use the hybrid word/POS texts instead of the
  • riginal corpora.
  • Train LMs on the corpora, compute sentence

scores, and re-rank the original general corpus.

  • Standard Moore-Lewis / Cross-entropy diff,


but with different corpus representation.

slide-21
SLIDE 21

Data Selection with Fewer Words Amittai Axelrod WMT 2015 21

TED Zh-En Translation

  • Task: Translate TED talks, Chinese-to-English, using LDC

data (6m sentence pairs).

  • Vocabulary reduction from TED+LDC:


Eliminate 97% of the vocabulary
 
 
 
 


  • What happens to SMT performance?

Lang Vocab Kept % En 470,154 10,036 2.1% Zh 729,283 11,440 1.5%

slide-22
SLIDE 22

Data Selection with Fewer Words Amittai Axelrod WMT 2015 22

TED Zh-En Translation

  • Slightly better scores,


despite (much) smaller selection vocab!

slide-23
SLIDE 23

Data Selection with Fewer Words Amittai Axelrod WMT 2015 23

In-Domain Lexical Coverage

  • Up to 10% more in-domain coverage
slide-24
SLIDE 24

Data Selection with Fewer Words Amittai Axelrod WMT 2015 24

General-Domain Coverage

  • Hybrid-selected data covers 10-15% more

  • f the general lexicon.
slide-25
SLIDE 25

Data Selection with Fewer Words Amittai Axelrod WMT 2015 25

Hybrid Word/POS Selection

  • Must re-compute for every task/pool,


but vocabulary statistics are easy.

  • Aggregating the statistics for rare terms allows

generalizing to other unseen words.

  • Perhaps preserving sentence structure,


picking up words that fill similar roles/patterns in the sentence?


slide-26
SLIDE 26

Data Selection with Fewer Words Amittai Axelrod WMT 2015

Hybrid Word/POS Selection

26

  • Replace all rare words with POS tags, then run

regular data selection.

  • Reduces active lexicon by 97%,


to ~10k words with robust statistics

  • Potentially helpful for algorithms bound by

vocabulary size "V"

  • Selection LM is 25% smaller
slide-27
SLIDE 27

Data Selection with Fewer Words Amittai Axelrod WMT 2015 27

Questions?

slide-28
SLIDE 28

Data Selection with Fewer Words Amittai Axelrod WMT 2015 28

[ this slide intentionally left blank ]