Discriminative Training February 19, 2013 Tuesday, February 19, 13

Noisy Channels Again p ( e ) English source Tuesday, February 19, 13

Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13

Noisy Channels Again p ( g | e ) p ( e ) English source German decoder e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Tuesday, February 19, 13

Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Tuesday, February 19, 13

Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e = arg max log p ( g | e ) + log p ( e ) e  1 � >  log p ( g | e ) � = arg max 1 log p ( e ) e | {z } | {z } w > h ( g , e ) Tuesday, February 19, 13

Noisy Channels Again e ∗ = arg max p ( e | g ) e p ( g | e ) × p ( e ) = arg max p ( g ) e = arg max p ( g | e ) × p ( e ) e Does this look familiar? = arg max log p ( g | e ) + log p ( e ) e  1 � >  log p ( g | e ) � = arg max 1 log p ( e ) e | {z } | {z } w > h ( g , e ) Tuesday, February 19, 13

The Noisy Channel -log p ( g | e ) -log p ( e ) Tuesday, February 19, 13

As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

As a Linear Model -log p ( g | e ) Improvement 1: change to find better translations ~ ~ w w g -log p ( e ) Tuesday, February 19, 13

As a Linear Model -log p ( g | e ) ~ w -log p ( e ) Tuesday, February 19, 13

As a Linear Model -log p ( g | e ) Improvement 2: Add dimensions to make points separable ~ w -log p ( e ) Tuesday, February 19, 13

Linear Models e ⇤ = arg max w > h ( g , e ) e • Improve the modeling capacity of the noisy channel in two ways • Reorient the weight vector • Add new dimensions ( new features ) • Questions • What features? h ( g , e ) • How do we set the weights? w Tuesday, February 19, 13

beißt Hund Mann 18 Tuesday, February 19, 13

beißt Hund Mann x BITES y 18 Tuesday, February 19, 13

beißt Hund Mann x BITES y Mann beißt Hund Mann beißt Hund bites cat man chase dog man Mann beißt Hund Mann beißt Hund bite cat bite dog man man Mann beißt Hund Mann beißt Hund dog bites man bites dog man 19 Tuesday, February 19, 13

Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” 20 Tuesday, February 19, 13

Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” Configurational Are semantic/syntactic relations preserved? “Dog bites man” vs. “Man bites dog” 20 Tuesday, February 19, 13

Feature Classes Lexical Are lexical choices appropriate? bank = “River bank” vs. “Financial institution” Configurational Are semantic/syntactic relations preserved? “Dog bites man” vs. “Man bites dog” Grammatical Is the output fluent / well-formed? “Man bites dog” vs. “Man bite dog” 20 Tuesday, February 19, 13

What do lexical features look like? Mann beißt Hund bites cat man 21 Tuesday, February 19, 13

What do lexical features look like? Mann beißt Hund bites cat man First attempt: score ( g , e ) = w > h ( g , e ) ( 1 , ∃ i, j : g i = Hund , e j = cat h 15 , 342 ( g , e ) = 0 , otherwise 21 Tuesday, February 19, 13

What do lexical features look like? Mann beißt Hund bites cat man First attempt: score ( g , e ) = w > h ( g , e ) ( 1 , ∃ i, j : g i = Hund , e j = cat h 15 , 342 ( g , e ) = 0 , otherwise But what if a cat is being chased by a Hund ? 21 Tuesday, February 19, 13

What do lexical features look like? Mann beißt Hund bites cat man Latent variables enable more precise features: score ( g , e , a ) = w > h ( g , e , a ) ( 1 , if g i = Hund , e j = cat X h 15 , 342 ( g , e , a ) = 0 , otherwise ( i,j ) 2 a 22 Tuesday, February 19, 13

Standard Features • Target side features • log p(e) [ n -gram language model ] • Number of words in hypothesis • Non-English character count • Source + target features • log relative frequency e | f of each rule [ log #(e,f) - log #(f) ] • log relative frequency f | e of each rule [ log #(e,f) - log #(e) ] • “lexical translation” log probability e | f of each rule [ ≈ log p model1 (e|f) ] • “lexical translation” log probability f | e of each rule [ ≈ log p model1 (f|e) ] • Other features • Count of rules/phrases used • Reordering pattern probabilities 23 Tuesday, February 19, 13

Parameter Learning 24 Tuesday, February 19, 13

Hypothesis Space h 1 h 2 25 Tuesday, February 19, 13

Hypothesis Space h 1 Hypotheses h 2 25 Tuesday, February 19, 13

Hypothesis Space h 1 References h 2 26 Tuesday, February 19, 13

Preliminaries We assume a decoder that computes: h e ⇤ , a ⇤ i = arg max h e , a i w > h ( g , e , a ) And K -best lists of, that is: { � e ⇤ i , a ⇤ h e , a i w > h ( g , e , a ) i ⇥ } i = K i =1 = arg i th - max Standard, efficient algorithms exist for this. 27 Tuesday, February 19, 13

Learning Weights • Try to match the reference translation exactly • Conditional random field • Maximize the conditional probability of the reference translations • “Average” over the different latent variables • Max-margin • Find the weight vector that separates the reference translation from others by the maximal margin • Maximal setting of the latent variables 28 Tuesday, February 19, 13

Problems • These methods give “full credit” when the model exactly produces the reference and no credit otherwise • What is the problem with this? • There are many ways to translate a sentence • What if we have multiple reference translations? • What about partial credit? 29 Tuesday, February 19, 13

Cost-Sensitive Training • Assume we have a cost function that gives a score for how good/bad a translation is � (ˆ e , E ) ⇥� [0 , 1] • Optimize the weight vector by making reference to this function • We will talk about two ways to do this 30 Tuesday, February 19, 13

K-Best List Example h 1 ~ w h 2 31 Tuesday, February 19, 13

K-Best List Example #2#1 h 1 #3 ~ w #6 #5 #4 #7 #8 #9 #10 h 2 31 Tuesday, February 19, 13

Discriminative Training February 19, 2013 Tuesday, February 19, 13 - PowerPoint PPT Presentation

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e ) English source Tuesday, February 19, 13 Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13 Noisy

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

HMM-based acoustic model adaptation and discriminative training Steven Wegmann ICSI 11 April

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu

Dynamic Re-ordering in Mining Top- k Productive Discriminative Patterns Yoshitaka Kameya * and

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa

On Discriminative Learning of Prediction Uncertainty Vojtch Franc, Daniel Pra Department of

Generative Models for Discriminative Problems Chris Dyer DeepMind ASRU 2017 December 19,

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and

Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should

Outline Last time: window-based generic object detection Discriminative classifiers

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

Staying Safe around Dogs & Reducing the Risk of Dog Related Injuries Canine Behaviour &

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Schedule and Cost Stewart Loken Physics Division Lawrence Berkeley National Laboratory PDG

Small Modular Reactor Design and Deployment IREA 10/14/2018 www.inl.gov INL SMR Activities

Communication and Language Chapter 22 Chapter 22 1 Outline Communication Grammar

Grammar and word order Grammar and word order Grammar Grammar Includes morphology and syntax

For Monday Read chapter 23, sections 1-2 FOIL exercise due Program 4 Any questions?

Autonomous Intelligent Robotics Instructor: Shiqi Zhang