Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , - PowerPoint PPT Presentation

Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , Ashish Vaswani 2 , Daniel Marcu and Kevin Knight USC Information Sciences Institute 1 Univ of Amsterdam, 2 Google Brain I am not Ke Tran https://github.com/ketranm/neuralHMM

Bayesian Models • HMMs, CFGs, … have been standard workhorses + of the NLP community • Generative models lend themselves to + unsupervised estimation • Bayesian models have elegant, but often very - parametrically expensive smoothing approaches

Why Neuralize Bayesian Models? • Unsupervised structure learning + • Simple modular extensions + • Embeddings and vector representations have been + shown to generalize well.

This is a nice direction Relevant EMNLP 2016 Papers: Online Segment to Segment Neural Transduction. Lei Yu, Jan Buys, and Phil Blunsom. Unsupervised Neural Dependency Parsing. Yong Jiang, Wenjuan Han, and Kewei Tu.

Hidden Markov Models Given an observed sequence of text: x p ( x t | z t ) × P ( z t | z t − 1 ) Probability of a given token: n +1 n Y Y p ( x , z ) = p ( z t | z t − 1 ) p ( x t | z t ) t =1 t =1 z t +1 z 1 z t − 1 z t z N x t +1 x 1 x t − 1 x t x N

Supervised POS Tagging The orange man will lose the election DT JJ NN MD VB DT NN Goal: Predict the correct class for each word in the sentence Solution: Count and divide p (orange | JJ ) = | orange , JJ | p ( JJ | DT ) = | DT , JJ | | JJ | | DT | Parameters: V × K K × K

Simple Supervised Neural HMM The orange man will lose the election DT JJ NN MD VB DT NN Replace parameter matrices with NNs + Softmax Train with Cross Entropy JJ orange DT JJ Emission Network Transition Network

Unsupervised Neural HMM The orange man will lose the election ? ? ? ? ? ? ? ? orange ? ? Emission Network Transition Network

Bayesian POS Tag Induction The orange man will lose the election C 1 C 2 C 4 C 14 C 12 C 1 C 4 Goal: Discover the set of classes which best model the observed data. Solution: Baum-Welch

Posteriors Probability of a specific cluster assignment p ( z t = i | x ) Probability of a specific cluster transition p ( z t = i, z t +1 = j | x ) Bayesian update: Count and Divide

Count and Divide Compute Initialize Normalize Posteriors 0.3 50 0.55 0.1 2 0.02 0.2 4 0.04 0.4 35 0.38 p ( w i | C j ) X p ( w i , C j ) p ( w i | C j ) ˆ corpus

Unsupervised Neural HMM The orange man will lose the election ? ? ? ? ? ? ? z t orange z t z t +1 p ( z t = i | x ) p ( z t = i, z t +1 = j | x ) Emission Network Transition Network

Generalized EM ln p ( x | θ ) = E q ( z ) [ln p ( x , z | θ )] + H[ q ( z )] + KL[ q ( z ) || p ( z | x , θ )] E-Step Compute Surrogate q M-Step Maximize Expectation

What is the gradient? Set q ( z ) = p ( z | x , θ ) E q ( z ) [ln p ( x , z | θ )] + H[ q ( z )] + KL[ q ( z ) || p ( z | x , θ )] 0 Take Derivative w.r.t. θ Jason Eisner probably E q ( z ) [ln p ( x , z | θ )] + H[ q ( z )] has something to say here 0 p ( z | x ) ∂ ln p ( x , z | θ ) X J ( θ ) = ∂θ z

Initial Evaluation

Induction Metrics • 1-1: Bijection between induced and gold classes • M-1: Map induced class to its closest gold class • V-M: Harmonic mean of H(c,g) and H(g,c) Higher numbers are better

Evaluation 1-1 M-1 V-M HMM 41.4 62.5 53.3 Neural HMM 45.7 59.8 54.2 The neural model has access to no additional information

Morphology V CNN based embeddings provide morphological Emission Matrix information SoftMax Char-CNN Char-CNN ReLU kernels = {1,2,3,4,5,6,7} State feature_maps = embeddings {50, 100, 128, 128, 128, 128, 128}

Evaluation 1-1 M-1 V-M HMM 41.4 62.5 53.3 Neural HMM 45.7 59.8 54.2 + Conv 48.3 74.1 66.1

Extended Context Traditional: p ( z t | z t − 1 ) K 2 Bi-gram transition p ( z t | z t − 1 , z t − 2 ) K 3 Tri-gram transition p ( z t | z t − 1 , z t − 2 , ..., z t − n ) K n +1 N-gram transition Alternative: V × K 2 p ( z t | z t − 1 , x t − 1 ) Previous tag and word p ( z t | z t − 1 , x t − 1 , ..., x 0 ) V t × K 2 Previous tag and sentence

LSTM Context LSTM consumes the sentence and produces a transition matrix p ( z t | z t − 1 , x t − 1 , ..., x 0 ) T t − 1 ,t x t − 1 x t x 1 x T

Evaluation 1-1 M-1 V-M HMM 41.4 62.5 53.3 Neural HMM 45.7 59.8 54.2 + Conv 48.3 74.1 66.1 + LSTM 52.4 65.1 60.4 + Conv & LSTM 60.7 79.1 71.7 Blunsom 2011 77.4 69.8 Yatbaz 2012 80.2 72.1

Types / Cluster Gold LSTM FF Conv Conv+LSTM 14,000 10,500 7,000 3,500 0

Clusterings Largest Cluster Numbers LSTM Conv LSTM Conv million % years of billion million trading in cents year sales to points share president for point cents companies on trillion 1/2 prices from

What’s a good clustering? C 15 C 25 American Corp. British Inc. National Co. Congress Board NNP Japan Group San Bank Federal Inc West Bush Dow Department

Future Work • Harnessing Extra Data • Modifying the objective function • Multilingual experiments • Using this approach with other generative models

Thanks! https://github.com/ketranm/neuralHMM Parameter Initialization, Tricks, Ablation in paper and in Github README

Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , - PowerPoint PPT Presentation

Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , Ashish Vaswani 2 , Daniel Marcu and Kevin Knight USC Information Sciences Institute 1 Univ of Amsterdam, 2 Google Brain I am not Ke Tran https://github.com/ketranm/neuralHMM

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

cryptography for trust and data services Sbastien Canard Orange Labs Applied Crypto Group

2018 Risk-Limiting Audit Pilots Orange County Registrar of Voters Profile of Orange County

Control Flow Integrity Lujo Bauer 18-732 Spring 2015 Control Hijacking Arms Race Control

Dynamic address durations in RIPE Atlas probes Ramakrishna Padmanabhan, Emile Aben, Amogh

MAGNETISM Concept tests Leon Abelmann A THE EUROPEAN SCHOOL ON MAGNETISM The H field

Anella Cientfica: Optical core at 100 Gbps Mara Isabel Ganda Carriedo 2nd TERENA Network

Recursive Descent Parsing and CYK ANLP: Lecture 13 Shay Cohen 14 October 2019 1 / 1 Last Class

Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 data_dir <-

Sambuz

Useful Links

Newsletter

Mail Us

Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , - PowerPoint PPT Presentation

Unsupervised Neural Hidden Markov Models Ke Tran 1 , Yonatan Bisk , Ashish Vaswani 2 , Daniel Marcu and Kevin Knight USC Information Sciences Institute 1 Univ of Amsterdam, 2 Google Brain I am not Ke Tran https://github.com/ketranm/neuralHMM

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

cryptography for trust and data services Sbastien Canard Orange Labs Applied Crypto Group

2018 Risk-Limiting Audit Pilots Orange County Registrar of Voters Profile of Orange County

Control Flow Integrity Lujo Bauer 18-732 Spring 2015 Control Hijacking Arms Race Control

Dynamic address durations in RIPE Atlas probes Ramakrishna Padmanabhan, Emile Aben, Amogh

MAGNETISM Concept tests Leon Abelmann A THE EUROPEAN SCHOOL ON MAGNETISM The H field

Anella Cientfica: Optical core at 100 Gbps Mara Isabel Ganda Carriedo 2nd TERENA Network

Recursive Descent Parsing and CYK ANLP: Lecture 13 Shay Cohen 14 October 2019 1 / 1 Last Class

Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 data_dir &lt;-

Sambuz

Useful Links

Newsletter

Mail Us

Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 data_dir <-