Spectral Learning Algorithms for Natural Language Processing Shay - PowerPoint PPT Presentation

Canonical Correlation Analysis (CCA) ◮ Data consists of paired samples: ( x ( i ) , y ( i ) ) for i = 1 . . . n ◮ As in co-training, x ( i ) ∈ R d and y ( i ) ∈ R d ′ are two “views” of a sample point View 1 View 2 x (1) = (1 , 0 , 0 , 0) y (1) = (1 , 0 , 0 , 1 , 0 , 1 , 0) x (2) = (0 , 0 , 1 , 0) y (2) = (0 , 1 , 0 , 0 , 0 , 0 , 1) . . . . . . x (100000) = (0 , 1 , 0 , 0) y (100000) = (0 , 0 , 1 , 0 , 1 , 1 , 1) Spectral Learning for NLP 29

Example of Paired Data: Webpage Classification (Blum and Mitchell, 98) ◮ Determine if a webpage is an course home page course 1 ↓ course home page instructor’s TA’s · · · Announcements · · · − → ← − home home Lectures · · · TAs page page · · · Information · · · ↑ course 2 ◮ View 1. Words on the page: “Announcements”, “Lectures” ◮ View 2. Identities of pages pointing to the page: instructror’s home page, related course home pages ◮ Each view is sufficient for the classification! Spectral Learning for NLP 30

Example of Paired Data: Named Entity Recognition (Collins and Singer, 99) ◮ Identify an entity’s type as either Organization, Person, or Location . . . , says Mr. Cooper, a vice president of . . . ◮ View 1. Spelling features: “Mr.”, “Cooper” ◮ View 2. Contextual features: appositive=president ◮ Each view is sufficient to determine the entity’s type! Spectral Learning for NLP 31

Example of Paired Data: Bigram Model (the, dog) H (I, saw) (ran, to) X Y (John, was) . . . p ( h, x, y ) = p ( h ) × p ( x | h ) × p ( y | h ) ◮ EM can be used to estimate the parameters of the model ◮ Alternatively, CCA can be used to derive vectors which can be used in a predictor     0 . 3 − 1 . 5 . .  .   .  the = ⇒ dog = ⇒ . .     1 . 1 − 0 . 4 Spectral Learning for NLP 32

Projection Matrices ◮ Project samples to lower dimensional space x ∈ R d = ⇒ x ′ ∈ R p ◮ If p is small, we can learn with far fewer samples! Spectral Learning for NLP 33

Projection Matrices ◮ Project samples to lower dimensional space x ∈ R d = ⇒ x ′ ∈ R p ◮ If p is small, we can learn with far fewer samples! ◮ CCA finds projection matrices A ∈ R d × p , B ∈ R d ′ × p ◮ The new data points are a ( i ) ∈ R p , b ( i ) ∈ R p where a ( i ) x ( i ) b ( i ) y ( i ) = A ⊤ = B ⊤ �� p × 1 p × d d × 1 p × 1 p × d ′ d ′ × 1 Spectral Learning for NLP 33

Mechanics of CCA: Step 1 ◮ Compute ˆ C XY ∈ R d × d ′ , ˆ C XX ∈ R d × d , and ˆ C Y Y ∈ R d ′ × d ′ n C XY ] jk = 1 � ( x ( i ) x j )( y ( i ) [ ˆ − ¯ k − ¯ y k ) j n i =1 n C XX ] jk = 1 � ( x ( i ) x j )( x ( i ) [ ˆ − ¯ k − ¯ x k ) j n i =1 n C Y Y ] jk = 1 � [ ˆ ( y ( i ) y j )( y ( i ) − ¯ k − ¯ y k ) j n i =1 x = � y = � i x ( i ) /n and ¯ i y ( i ) /n where ¯ Spectral Learning for NLP 34

Mechanics of CCA: Step 2 C − 1 / 2 C − 1 / 2 ◮ Do SVD on ˆ XX ˆ C XY ˆ ∈ R d × d ′ Y Y C − 1 / 2 ˆ XX ˆ C XY ˆ C − 1 / 2 = U Σ V ⊤ SVD Y Y Let U p ∈ R d × p be the top p left singular vectors. Let V p ∈ R d ′ × p be the top p right singular vectors. Spectral Learning for NLP 37

Mechanics of CCA: Step 3 ◮ Define projection matrices A ∈ R d × p and B ∈ R d ′ × p A = ˆ C − 1 / 2 B = ˆ C − 1 / 2 XX U p Y Y V p ◮ Use A and B to project each ( x ( i ) , y ( i ) ) for i = 1 . . . n : x ( i ) ∈ R d = ⇒ A ⊤ x ( i ) ∈ R p y ( i ) ∈ R d ′ = ⇒ B ⊤ y ( i ) ∈ R p Spectral Learning for NLP 38

Input and Output of CCA x ( i ) = (0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , . . . , 0) ∈ R 50 , 000 ↓ a ( i ) = ( − 0 . 3 . . . 0 . 1) ∈ R 100 y ( i ) = (497 , 0 , 1 , 12 , 0 , 0 , 0 , 7 , 0 , 0 , 0 , 0 , . . . , 0 , 58 , 0) ∈ R 120 , 000 ↓ b ( i ) = ( − 0 . 7 . . . − 0 . 2) ∈ R 100 Spectral Learning for NLP 39

Overview Basic concepts Linear Algebra Refresher Singular Value Decomposition Canonical Correlation Analysis: Algorithm Canonical Correlation Analysis: Justification Lexical representations Hidden Markov models Latent-variable PCFGs Conclusion Spectral Learning for NLP 40

Justification of CCA: Correlation Coefficients ◮ Sample correlation coefficient for a 1 . . . a n ∈ R and b 1 . . . b n ∈ R is � n a )( b i − ¯ i =1 ( a i − ¯ b ) Corr ( { a i } n i =1 , { b i } n i =1 ) = �� n �� n i =1 ( b i − ¯ a ) 2 b ) 2 i =1 ( a i − ¯ a = � b = � i a i /n , ¯ where ¯ i b i /n b Correlation ≈ 1 a Spectral Learning for NLP 41

Simple Case: p = 1 ◮ CCA projection matrices are vectors u 1 ∈ R d , v 1 ∈ R d ′ ◮ Project x ( i ) and y ( i ) to scalars u 1 · x ( i ) and v 1 · y ( i ) Spectral Learning for NLP 42

Simple Case: p = 1 ◮ CCA projection matrices are vectors u 1 ∈ R d , v 1 ∈ R d ′ ◮ Project x ( i ) and y ( i ) to scalars u 1 · x ( i ) and v 1 · y ( i ) ◮ What vectors does CCA find? Answer: � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u 1 , v 1 = arg max Corr i =1 u,v Spectral Learning for NLP 42

Finding the Next Projections ◮ After finding u 1 and v 1 , what vectors u 2 and v 2 does CCA find? Answer: � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u 2 , v 2 = arg max Corr i =1 u,v subject to the constraints � � { u 2 · x ( i ) } n i =1 , { u 1 · x ( i ) } n Corr = 0 i =1 � � { v 2 · y ( i ) } n i =1 , { v 1 · y ( i ) } n Corr = 0 i =1 Spectral Learning for NLP 43

CCA as an Optimization Problem ◮ CCA finds for j = 1 . . . p (each column of A and B ) � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u j , v j = arg max Corr i =1 u,v subject to the constraints � � { u j · x ( i ) } n i =1 , { u k · x ( i ) } n Corr = 0 i =1 � � { v j · y ( i ) } n i =1 , { v k · y ( i ) } n Corr = 0 i =1 for k < j Spectral Learning for NLP 44

Guarantees for CCA H X Y ◮ Assume data is generated from a Naive Bayes model ◮ Latent-variable H is of dimension k , variables X and Y are of dimension d and d ′ (typically k ≪ d and k ≪ d ′ ) ◮ Use CCA to project X and Y down to k dimensions (needs ( x, y ) pairs only!) ◮ Theorem: the projected samples are as good as the original samples for prediction of H (Foster, Johnson, Kakade, Zhang, 2009) ◮ Because k ≪ d and k ≪ d ′ we can learn to predict H with far fewer labeled examples Spectral Learning for NLP 45

Guarantees for CCA (continued) Kakade and Foster, 2007 - cotraining-style setting: ◮ Assume that we have a regression problem: predict some value z given two “views” x and y ◮ Assumption: either view x or y is sufficient for prediction ◮ Use CCA to project x and y down to a low-dimensional space ◮ Theorem: if correlation coefficients drop off to zero quickly, we will need far fewer samples to learn when using the projected representation ◮ Very similar setting to cotraining, but: ◮ No assumption of independence between the two views ◮ CCA is an exact algorithm - no need for heuristics Spectral Learning for NLP 46

Summary of the Section ◮ SVD is an efficient optimization technique ◮ Low-rank matrix approximation ◮ CCA derives a new representation of paired data that maximizes correlation ◮ SVD as a subroutine ◮ Next: use of CCA in deriving vector representations of words (“eigenwords”) Spectral Learning for NLP 47

Overview Basic concepts Lexical representations ◮ Eigenwords found using the thin SVD between words and context capture distributional similarity contain POS and semantic information about words are useful features for supervised learning Hidden Markov Models Latent-variable PCFGs Conclusion Spectral Learning for NLP 48

Uses of Spectral Methods in NLP ◮ Word sequence labeling ◮ Part of Speech tagging (POS) ◮ Named Entity Recognition (NER) ◮ Word Sense Disambiguation (WSD) ◮ Chunking, prepositional phrase attachment, ... ◮ Language modeling ◮ What is the most likely next word given a sequence of words (or of sounds)? ◮ What is the most likely parse given a sequence of words? Spectral Learning for NLP 49

Uses of Spectral Methods in NLP ◮ Word sequence labeling: semi-supervised learning ◮ Use CCA to learn vector representation of words ( eigenwords ) on a large unlabeled corpus. ◮ Eigenwords map from words to vectors, which are used as features for supervised learning. ◮ Language modeling: spectral estimation of probabilistic models ◮ Use eigenwords to reduce the dimensionality of generative models (HMMs,...) ◮ Use those models to compute the probability of an observed word sequence Spectral Learning for NLP 50

The Eigenword Matrix U ◮ U contains the singular vectors from the thin SVD of the bigram count matrix ate cheese ham I You ate 0 1 1 0 0 cheese 0 0 0 0 0 ham 0 0 0 0 0 I 1 0 0 0 0 You 2 0 0 0 0 I ate ham You ate cheese You ate Spectral Learning for NLP 51

The Eigenword Matrix U ◮ U contains the singular vectors from the thin SVD of the bigram matrix ( w t − 1 ∗ w t ) analogous to LSA, but uses context instead of documents ◮ Context can be multiple neighboring words (we often use the words before and after the target) ◮ Context can be neighbors in a parse tree ◮ Eigenwords can also be computed using the CCA between words and their contexts ◮ Words close in the transformed space are distributionally, semantically and syntactically similar ◮ We will later use U in HMMs and parse trees to project words to low dimensional vectors. Spectral Learning for NLP 52

Two Kinds of Spectral Models ◮ Context oblivious ( eigenwords ) ◮ learn a vector representation of each word type based on its average context ◮ Context sensitive ( eigentokens or state ) ◮ estimate a vector representation of each word token based on its particular context using an HMM or parse tree Spectral Learning for NLP 53

Eigenwords in Practice ◮ Work well with corpora of 100 million words ◮ We often use trigrams from the Google n-gram collection ◮ We generally use 30-50 dimensions ◮ Compute using fast randomized SVD methods Spectral Learning for NLP 54

How Big Should Eigenwords Be? ◮ A 40-D cube has 2 40 (about a trillion) vertices. ◮ More precisely, in a 40-D space about 1 . 5 40 ∼ 11 million vectors can all be approximately orthogonal. ◮ So 40 dimensions gives plenty of space for a vocabulary of a million words Spectral Learning for NLP 55

Fast SVD: Basic Method problem Find a low rank approximation to a n × m matrix M . solution Find an n × k matrix A such that M ≈ AA ⊤ M Spectral Learning for NLP 56

Fast SVD: Basic Method problem Find a low rank approximation to a n × m matrix M . solution Find an n × k matrix A such that M ≈ AA ⊤ M Construction A is constructed by: 1. create a random m × k matrix Ω (iid normals) 2. compute M Ω 3. Compute thin SVD of result: UDV ⊤ = M Ω 4. A = U better: iterate a couple times “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” by N. Halko, P. G. Martinsson, and J. A. Tropp. Spectral Learning for NLP 56

Eigenwords for ’Similar’ Words are Close miles 0.2 inches wife sister brother daughter uncle acres pounds son bytes father husband tons 0.1 boss mother girl degrees boy meters barrels 0.0 PC 2 guy farmer doctor woman lawyer man -0.1 teacher citizen pressure stress -0.2 gravity tension temperature permeability density viscosity -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 Spectral Learning for NLP 57

Eigenwords Capture Part of Speech 0.3 river disagree agree 0.2 0.1 house PC 2 word 0.0 boat truck cat carry eat car home talk push listen dog -0.1 sleep drink -0.2 -0.2 0.0 0.2 0.4 Spectral Learning for NLP 58

Eigenwords: Pronouns us 0.3 0.2 PC 2 0.1 our you i we 0.0 them they him he she -0.1 his her -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 Spectral Learning for NLP 59

Eigenwords: Numbers 2007 2008 2006 0.2 2009 2005 2004 2003 2002 2001 1999 0.1 1998 1996 1997 eight nine 1995 seven 2000 five six three two four ten one 0.0 PC 2 -0.1 -0.2 10 9 7 8 6 5 3 4 -0.3 2 1 -0.4 -0.2 0.0 0.2 Spectral Learning for NLP 60

Eigenwords: Names 0.15 tom mike bob joe 0.10 dan michael lisa john david 0.05 liz jennifer betty george paul PC 2 karen richard daniel linda 0.00 robert nancy susan christopher mary helen charles -0.05 thomas dorothy william maria barbara donald patricia betsy elizabeth margaret joseph -0.10 tricia -0.1 0.0 0.1 0.2 Spectral Learning for NLP 61

CCA has Nice Properties for Computing Eigenwords ◮ When computing the SVD of a word × context matrix (as above) we need to decide how to scale the counts ◮ Using raw counts gives more emphasis to common words ◮ Better: rescale ◮ Divide each row by the square root of the total count of the word in that row ◮ Rescale the columns to account for the redundancy ◮ CCA between words and their contexts does this automatically and optimally ◮ CCA ’whitens’ the word-context covariance matrix Spectral Learning for NLP 62

Semi-supervised Learning Problems ◮ Sequence labeleing (Named Entity Recognition, POS, WSD...) ◮ X = target word ◮ Z = context of the target word ◮ label = person / place / organization ... ◮ Topic identification ◮ X = words in title ◮ Z = words in abstract ◮ label = topic category ◮ Speaker identification: ◮ X = video ◮ Z = audio ◮ label = which character is speaking Spectral Learning for NLP 63

Semi-supervised Learning using CCA ◮ Find CCA between X and Z ◮ Recall: CCA finds projection matrices A and B such that = A ⊤ = B ⊤ x x z z �� d × 1 d ′ × 1 k × d k × d ′ k × 1 k × 1 ◮ Project X and Z to estimate hidden state: ( x , z ) ◮ Note: if x is the word and z is its context, then A is the matrix of eigenwords, x is the (context oblivious) eigenword corresponding to work x , and z gives a context-sensitive “eigentoken” ◮ Use supervised learning to predict label from hidden state ◮ and from hidden state of neighboring words Spectral Learning for NLP 64

Theory: CCA has Nice Properties ◮ If one uses CCA to map from target word and context (two views, X and Z ) to reduced dimension hidden state and then uses that hidden state as features in a linear regression to predict a y , then we have provably almost as good a fit in the reduced dimsion (e.g. 40) as in the original dimension (e.g. million word vocabulary). ◮ In contrast, Principal Components Regression (PCR: regression based on PCA, which does not “whiten” the covariance matrix) can miss all the signal [Foster and Kakade, ’06] Spectral Learning for NLP 65

Semi-supervised Results ◮ Find spectral features on unlabeled data ◮ RCV-1 corpus: Newswire ◮ 63 million tokens in 3.3 million sentences. ◮ Vocabulary size: 300k ◮ Size of embeddings: k = 50 ◮ Use in discriminative model ◮ CRF for NER ◮ Averaged perceptron for chunking ◮ Compare against state-of-the-art embeddings ◮ C&W, HLBL, Brown, ASO and Semi-Sup CRF ◮ Baseline features based on identity of word and its neighbors ◮ Benefit ◮ Named Entity Recognition (NER): 8% error reduction ◮ Chunking: 29% error reduction ◮ Add spectral features to discriminative parser: 2.6% error reduction Spectral Learning for NLP 66

Section Summary ◮ Eigenwords found using thin SVD between words and context ◮ capture distributional similarity ◮ contain POS and semantic information about words ◮ perform competitively to a wide range of other embeddings ◮ CCA version provides provable guarantees when used as features in supervised learning ◮ Next: eigenwords form the basis for fast estimation of HMMs and parse trees Spectral Learning for NLP 67

A Spectral Learning Algorithm for HMMs ◮ Algorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS 2012) ◮ Algorithm relies on singular value decomposition followed by very simple matrix operations ◮ Close connections to CCA ◮ Under assumptions on singular values arising from the model, has PAC-learning style guarantees (contrast with EM, which has problems with local optima) ◮ It is a very different algorithm from EM Spectral Learning for NLP 68

Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him p ( the dog saw him , 1 2 1 3 ) � �� h 1 ...h 4 x 1 ...x 4 = π (1) × t (2 | 1) × t (1 | 2) × t (3 | 1) Spectral Learning for NLP 69

Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him p ( the dog saw him , 1 2 1 3 ) � �� h 1 ...h 4 x 1 ...x 4 = π (1) × t (2 | 1) × t (1 | 2) × t (3 | 1) × o ( the | 1) × o ( dog | 2) × o ( saw | 1) × o ( him | 3) ◮ Initial parameters: π ( h ) for each latent state h ◮ Transition parameters: t ( h ′ | h ) for each pair of states h ′ , h ◮ Observation parameters: o ( x | h ) for each state h , obs. x Spectral Learning for NLP 69

Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him Throughout this section: ◮ We use m to refer to the number of hidden states ◮ We use n to refer to the number of possible words (observations) ◮ Typically, m ≪ n (e.g., m = 20 , n = 50 , 000 ) Spectral Learning for NLP 70

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: f 0 h = π ( h ) Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 h = h ′ h ′ Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ � f 4 t ( h | h ′ ) o ( him | h ′ ) f 3 h = h ′ h ′ Spectral Learning for NLP 71

HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ � � f 4 t ( h | h ′ ) o ( him | h ′ ) f 3 f 4 h = p ( . . . ) = h ′ h h ′ h Spectral Learning for NLP 71

HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him Spectral Learning for NLP 72

HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) Spectral Learning for NLP 72

HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) Spectral Learning for NLP 72

HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) ◮ Define π as vector with elements π h , 1 as vector of all ones Spectral Learning for NLP 72

HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) ◮ Define π as vector with elements π h , 1 as vector of all ones ◮ Then p ( the dog saw him ) = 1 ⊤ × A him × A saw × A dog × A the × π Forward algorithm through matrix multiplication! Spectral Learning for NLP 72

The Spectral Algorithm: definitions H 1 H 2 H 3 H 4 dog saw the him Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) Easy to derive an estimate: P 2 , 1 ] i,j = Count ( X 2 = i, X 1 = j ) [ ˆ N Spectral Learning for NLP 73

The Spectral Algorithm: definitions H 1 H 2 H 3 H 4 dog saw the him For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) Easy to derive an estimate, e.g.,: P 3 , dog , 1 ] i,j = Count ( X 3 = i, X 2 = dog , X 1 = j ) [ ˆ N Spectral Learning for NLP 74

Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) Spectral Learning for NLP 75

Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m Spectral Learning for NLP 75

Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m ◮ Definition: B x = U ⊤ × P 3 ,x, 1 × V × Σ − 1 �� m × m m × m Spectral Learning for NLP 75

Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m ◮ Definition: B x = U ⊤ × P 3 ,x, 1 × V × Σ − 1 �� m × m m × m ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible Spectral Learning for NLP 75

Why does this matter? ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible ◮ Recall p ( the dog saw him ) = 1 ⊤ A him A saw A dog A the π . Forward algorithm through matrix multiplication! Spectral Learning for NLP 76

Why does this matter? ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible ◮ Recall p ( the dog saw him ) = 1 ⊤ A him A saw A dog A the π . Forward algorithm through matrix multiplication! ◮ Now note that B him × B saw × B dog × B the Spectral Learning for NLP 76

Spectral Learning Algorithms for Natural Language Processing Shay - PowerPoint PPT Presentation

Spectral Learning Algorithms for Natural Language Processing Shay Cohen 1 , Michael Collins 1 , Dean Foster 2 , Karl Stratos 1 and Lyle Ungar 2 1 Columbia University 2 University of Pennsylvania June 10, 2013 Spectral Learning for NLP 1

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Natural Language Understanding We want to communicate with computers using natural language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Scikit Spectral Learning (SpLearn): a toolbox for the spectral learning of weighted automata Denis

10Hz Spectral Lines Joschua Dilly 10Hz Spectral Lines 2 Introduction Ions 50cm Protons 30cm

AIRS In-flight Spectral Calibration Steve Gaiser 1 Steve Gaiser, AIRS in-orbit spectral

The American Legion Auxiliary Department of Virginia Spring Conference 2020 In the Sp irit of

Preparation for Energy Independence Summit Capitol Hill Day Nissan North America, UPS Cummins

What Clean Cities Coalitions Can Do to Ensure a Clean Transportation Future Nissan North America,

AGA Pilot Committee 1 September 20, 2017 Our mission is to provide all students with an

Track 1 Paper: Good Usability Practices in Scientific Software Development Francisco Queiroz 1

FDRs Use Of The Radio Ray Anaya Daisy Zho Yuan Hsiao Shendy Kurnia Background

Learning Objectives Provide participants with an introduction on the fundamentals of grading

1) Introduction: goal *best resolution during all times: (few failures, stable)

Sambuz

Useful Links

Newsletter

Mail Us