tuning smt systems on the training set
play

Tuning SMT Systems on the Training Set Chris Dyer, Patrick Simianer, - PowerPoint PPT Presentation

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler Tuning SMT Systems on the Training Set Chris Dyer, Patrick Simianer, Stefan Riezler, Phil Blunsom, Eva Hasler Project Report MT Marathon 2011 FBK Trento Tuning SMT Systems on the Training Set


  1. ToTS Dyer, Simianer, Riezler, Blunsom, Hasler Tuning SMT Systems on the Training Set Chris Dyer, Patrick Simianer, Stefan Riezler, Phil Blunsom, Eva Hasler Project Report MT Marathon 2011 FBK Trento

  2. Tuning SMT Systems on the Training Set ToTS Dyer, Simianer, Riezler, Goal: Discriminative training using sparse features on Blunsom, Hasler the full training set

  3. Tuning SMT Systems on the Training Set ToTS Dyer, Simianer, Riezler, Goal: Discriminative training using sparse features on Blunsom, Hasler the full training set Approach: Picky-picky / elitist learning:

  4. Tuning SMT Systems on the Training Set ToTS Dyer, Simianer, Riezler, Goal: Discriminative training using sparse features on Blunsom, Hasler the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples .

  5. Tuning SMT Systems on the Training Set ToTS Dyer, Simianer, Riezler, Goal: Discriminative training using sparse features on Blunsom, Hasler the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples . Feature selection according to various regularization criteria.

  6. Tuning SMT Systems on the Training Set ToTS Dyer, Simianer, Riezler, Goal: Discriminative training using sparse features on Blunsom, Hasler the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples . Feature selection according to various regularization criteria. Leave-one-out estimation : Leave out sentence/shard currently being trained on when extracting rules/features in training.

  7. SMT Framework + Data ToTS Dyer, Simianer, Riezler, Blunsom, Hasler cdec decoder (https://github.com/redpony/cdec)

  8. SMT Framework + Data ToTS Dyer, Simianer, Riezler, Blunsom, Hasler cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars

  9. SMT Framework + Data ToTS Dyer, Simianer, Riezler, Blunsom, Hasler cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus

  10. SMT Framework + Data ToTS Dyer, Simianer, Riezler, Blunsom, Hasler cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus 132,755 parallel sentences

  11. SMT Framework + Data ToTS Dyer, Simianer, Riezler, Blunsom, Hasler cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus 132,755 parallel sentences German-to-English

  12. Learning Framework: SGD for Pairwise Ranking ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

  13. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking:

  14. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking: First sample translations according to their model score.

  15. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking: First sample translations according to their model score. Then sample pairs.

  16. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking: First sample translations according to their model score. Then sample pairs. Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise approx. BLEU) to each other.

  17. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking: First sample translations according to their model score. Then sample pairs. Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise approx. BLEU) to each other. Sampling will also speed up learning.

  18. Constraint Selection = Sampling of Pairs ToTS Dyer, Simianer, Riezler, Random sampling of pairs from full chart for pairwise Blunsom, Hasler ranking: First sample translations according to their model score. Then sample pairs. Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise approx. BLEU) to each other. Sampling will also speed up learning. Lots of variations on sampling possible ...

  19. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler

  20. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier

  21. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule

  22. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule

  23. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set

  24. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features

  25. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features

  26. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features word alignments in rules

  27. Candidate Features ToTS Dyer, Simianer, Riezler, Blunsom, Efficient computation of features from local rule context: Hasler Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features word alignments in rules ... and many more!

  28. Feature Selection ToTS Dyer, Simianer, Riezler, Blunsom, ℓ 1 /ℓ 2 -regularization Hasler

  29. Feature Selection ToTS Dyer, Simianer, Riezler, Blunsom, ℓ 1 /ℓ 2 -regularization Hasler Compute ℓ 2 -norm of column vectors (= vector of examples/shards for each of n features), then ℓ 1 -norm of resulting n -dimensional vector.

  30. Feature Selection ToTS Dyer, Simianer, Riezler, Blunsom, ℓ 1 /ℓ 2 -regularization Hasler Compute ℓ 2 -norm of column vectors (= vector of examples/shards for each of n features), then ℓ 1 -norm of resulting n -dimensional vector.

  31. Feature Selection ToTS Dyer, Simianer, Riezler, Blunsom, ℓ 1 /ℓ 2 -regularization Hasler Compute ℓ 2 -norm of column vectors (= vector of examples/shards for each of n features), then ℓ 1 -norm of resulting n -dimensional vector. Effect is to choose small subset of features that are useful across all examples/shards

  32. Feature Selection, done properly ToTS Dyer, Simianer, Incremental gradient-based selection of column vectors Riezler, Blunsom, (Obozinski, Taskar, Jordan: Joint covariant selection and Hasler joint subspace selection for multiple classification problems. Stat Comput (2010))

  33. Feature Selection, done properly ToTS Dyer, Simianer, Incremental gradient-based selection of column vectors Riezler, Blunsom, (Obozinski, Taskar, Jordan: Joint covariant selection and Hasler joint subspace selection for multiple classification problems. Stat Comput (2010))

  34. Feature Selection, quick and dirty ToTS Dyer, Simianer, Riezler, Blunsom, Hasler Combine feature selection with averaging:

  35. Feature Selection, quick and dirty ToTS Dyer, Simianer, Riezler, Blunsom, Hasler Combine feature selection with averaging: Keep only those features with large enough ℓ 2 -norm computed over examples/shards.

  36. Feature Selection, quick and dirty ToTS Dyer, Simianer, Riezler, Blunsom, Hasler Combine feature selection with averaging: Keep only those features with large enough ℓ 2 -norm computed over examples/shards. Then average feature values over examples/shards.

  37. How far did we get in a few days? ToTS Dyer, Simianer, Riezler, Blunsom, First full training run finished! Hasler

  38. How far did we get in a few days? ToTS Dyer, Simianer, Riezler, Blunsom, First full training run finished! Hasler 150k parallel sentences from news commentary data, German-to-English

  39. How far did we get in a few days? ToTS Dyer, Simianer, Riezler, Blunsom, First full training run finished! Hasler 150k parallel sentences from news commentary data, German-to-English pairwise ranking perceptron

Recommend


More recommend