Learning Sentence Embeddings through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Sentence Embeddings through Tensor Methods Anima Anandkumar Joint work with Dr. Furong Huang .. ACL Workshop 2016

Representations for Text Understanding The weather is good. tree Her life spanned years of soccer incredible change for women. Mary lived through an era of football liberating reform for women. Word Embedding Word Sequence Embedding Word embeddings: Incorporates short range relationships, Easy to train. Sentence embeddings: Incorporates long range relationships, hard to train.

Various Frameworks for Sentence Embeddings Compositional Models (M. Iyyer etal ‘15, T. Kenter ‘16) Composition of word embedding vectors: usually simple averaging. Compositional operator (averaging weights) based on neural nets. Weakly supervised (only averaging weights based on labels) or strongly supervised (joint training). Paragraph Vector (Q. V. Le & T. Mikolov ‘14) Augmented representation of paragraph + word embeddings. Supervised framework to train paragraph vector. For both frameworks Pros: Simple and cheap to train. Can use existing word embeddings. Cons: Word order not incorporated. Supervised. Not universal.

Skip thought Vectors for Sentence Embeddings Learn sentence embedding based on joint probability of words, represented using RNN.

Skip thought Vectors for Sentence Embeddings Learn sentence embedding based on joint probability of words, represented using RNN. Pros: Incorporates word order, unsupervised, universal. Cons: Requires contiguous long text, lots of data, slow training time. Cannot use domain specific training. R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, S. Fidler, “ Skip-Thought Vectors, ” NIPS 2015

Convolutional Models for Sentence Embeddings (N. Kalchbrenner, E. Grefenstette, P. Blunsom ‘14) Activation = Maps * A sample sentence = * Word order max-k pooling Word encoding Label = * * =

Convolutional Models for Sentence Embeddings (N. Kalchbrenner, E. Grefenstette, P. Blunsom ‘14) Activation = Maps * A sample sentence = * Word order max-k pooling Word encoding Label = * * = Pros: Incorporates word order. Detect polysemy. Cons: Supervised training. Not universal.

Convolutional Models for Sentence Embeddings (F. Huang & A. ‘15) Activation Maps * = + A sample sentence * Word order max-k pooling Word encoding Label * = + *

Convolutional Models for Sentence Embeddings (F. Huang & A. ‘15) Activation Maps * = + A sample sentence * Word order max-k pooling Word encoding Label * = + * Pros: Word order, polysemy, unsupervised, universal. Cons: Difficulty in training.

Intuition behind Convolutional Model Shift invariance natural in images: image templates in different locations. Dictionary elements Image

Intuition behind Convolutional Model Shift invariance natural in images: image templates in different locations. Dictionary elements Image Shift invariance in language: phrase templates in different parts of the sentence

Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2

Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2 � f i ∗ w i � 2 Training objective: min f i ,w i � x − 2 i

Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2 � f i ∗ w i � 2 Training objective: f i ,w i � x − min 2 i Challenges Nonconvex optimization: no guaranteed solution in general. Alternating minimization: Fix w i ’s to update f i ’s and viceversa. Not guaranteed to reach global optimum (or even a stationary point!) Expensive in large sample regime: needs updating of w i ’s.

Convex vs. Non-convex Optimization Guarantees for mostly convex.. But non-convex is trending! Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima Guaranteed approaches for reaching global optima?

Non-convex Optimization in High Dimensions Critical/statitionary points: x : ∇ x f ( x ) = 0 . Curse of dimensionality: exponential number of critical points. Saddle points slow down improvement. Lack of stopping criteria for local search methods. local maxima Saddle points local minima Fast escape from saddle points in high dimensions?

Outline Introduction 1 Why Tensors? 2 Tensor Decomposition Methods 3 Other Applications 4 Conclusion 5

Example: Discovering Latent Factors Classics Physics Music Math List of scores for students in different tests Alice Learn hidden factors for Verbal and Mathematical Bob Intelligence [C. Spearman 1904] Carol Dave Eve Score (student,test) = student verbal-intlg × test verbal + student math-intlg × test math

Matrix Decomposition: Discovering Latent Factors Classics Physics Music Math Verbal Math Alice Bob = + Carol Dave Eve Identifying hidden factors influencing the observations Characterized as matrix decomposition

Matrix Decomposition: Discovering Latent Factors Classics Physics Music Math Verbal Math Alice Bob = + Carol Dave Eve = + Decomposition is not necessarily unique. Decomposition cannot be overcomplete.

Tensor: Shared Matrix Decomposition Classics Physics Music Math Verbal Math Alice Bob = + (Oral) Carol Dave Eve Alice Bob = + (Written) Carol Dave Eve Shared decomposition with different scaling factors Combine matrix slices as a tensor

Tensor Decomposition V erbal Math Written Oral Alice Bob = + Carol Dave Eve Math Classics Physics music Outer product notation: T = u ⊗ v ⊗ w + ˜ u ⊗ ˜ v ⊗ ˜ w � T i 1 ,i 2 ,i 3 = u i 1 · v i 2 · w i 3 + ˜ u i 1 · ˜ v i 2 · ˜ w i 3

Identifiability under Tensor Decomposition + · · · = + T = v 1 ⊗ 3 + v 2 ⊗ 3 + · · · , Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal

Identifiability under Tensor Decomposition + · · · = + T = v 1 ⊗ 3 + v 2 ⊗ 3 + · · · , Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal λ 2 a 2 λ 2 a 2 λ 2 a 2 λ 1 a 1 λ 1 a 1 λ 1 a 1

Moment-based Estimation Matrix: Pairwise Moments E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . M = uu ⊤ is rank-1 and M i,j = u i u j . Tensor: Higher order Moments E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] . T = u ⊗ u ⊗ u is rank-1 and T i,j,k = u i u j u k .

Moment forms for Linear Dictionary Models =

Moment forms for Linear Dictionary Models = Independent components analysis (ICA) Independent coefficients, e.g. Bernoulli Gaussian. Can be relaxed to sparse coefficients with limited dependency. � Fourth order cumulant: M 4 = κ j a j ⊗ a j ⊗ a j ⊗ a j . j ∈ [ k ] .... = +

Convolutional dictionary model + = = ∗ ∗ x F ∗ w ∗ x f ∗ w ∗ f ∗ w ∗ 1 1 L L (a) Convolutional model (b) Reformulated model � � Cir ( f i ) w i = F ∗ w ∗ f i ∗ w i = x = i i

Moment forms and optimization � � Cir ( f i ) w i = F ∗ w ∗ f i ∗ w i = x = i i Assume coefficients w i are independent (convolutional ICA model) Cumulant tensor has decomposition with components F ∗ i . + = +...+ +...+ 1 ) ⊗ 3 1 ) ⊗ 3 shift ( F ∗ 2 ) ⊗ 3 ( F ∗ ( F ∗ 2 ) ⊗ 3 M 3 shift ( F ∗ Learning Convolutional model through Tensor Decomposition

Outline Introduction 1 Why Tensors? 2 Tensor Decomposition Methods 3 Other Applications 4 Conclusion 5

Notion of Tensor Contraction Extends the notion of matrix product Matrix product Tensor Contraction � � T ( u, v, · ) = Mv = u i v j T i,j, : v j M j i,j j = + = + + +

Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i i3 i2 = + i1

Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. i3 i2 = + i1

Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. Tensor unfolding i3 i2 = + i1

Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. Tensor unfolding i2 = + i1

Learning Sentence Embeddings through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Sentence Embeddings through Tensor Methods Anima Anandkumar Joint work with Dr. Furong Huang .. ACL Workshop 2016 Representations for Text Understanding The weather is good. tree Her life spanned years of soccer incredible change

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences

SENTENCE STRUCTURE ATI TEAS ENGLISH AND LANGUAGE USAGE SENTENCE STRUCTURE Sentence Structure

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Optimal Use Of Verbal Instructions For Multi-Robot Human Navigation Guidance Harel Yedidsion ,

Multicompetence than one language in the same mind or the same Cook, 2010 community

Computational Intelligence A Logical Approach David Poole Alan Mackworth Randy Goebel Oxford

METACULUS - Confidential - All predictions are probabilities Two useful notions of probability

Thank You For Coming Leading for Program Quality: Using self-awareness and inten8on to effect

Lecture 1: What is AI? Julia Hockenmaier juliahmr@illinois.edu Welcome to CS440/ECE448

Advancing Women in Aviation Roundtable Luncheon Hong Kong Tuesday, November 1, 2016 Amelia

Constrained Decompositions of Integer Matrices and their Applications to Intensity Modulated