Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine

Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain. Learn features φ ( x ) through latent variables related to x, y .

Conditional Latent Variable Models: Two Cases y x y x

Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) φ ( x ) x y x

Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) x y x

Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y x

Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) x

Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) h x

Challenges in Learning LVMs Challenge: Identifiability Conditions When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity needs to be low for high-dimensional regime. Guaranteed and efficient learning through tensor methods

Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5

Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples { x i } , find projection P with Rank ( P ) = k s.t. 1 � � x i − Px i � 2 . min n P i ∈ [ n ] Result: Eigen-decomposition of S = Cov ( X ) . Supervised Setting: CCA For centered samples { x i , y i } , find y x a ⊤ ˆ E [ xy ⊤ ] b max . � a,b a ⊤ ˆ E [ xx ⊤ ] a b ⊤ ˆ E [ yy ⊤ ] b � a, x � � b, y � Result: Generalized eigen decomposition.

Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? ◮ PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? ◮ Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors?

Moment Matrices and Tensors Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 . Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ ( x ) based on input distribution?

Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.

Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score 2-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.

Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2

Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )]

Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )] Beyond vector features?

Matrix and Tensor-valued Features Higher order score functions S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x ) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E [ y ⊗ φ ( x )] .

Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] .

Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . ,

Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma.

Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� j ∈ [ k ] m times

Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� j ∈ [ k ] m times Construct σ ( u ⊤ j x ) for some nonlinearity σ .

Automated Extraction of Discriminative Features

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine - PowerPoint PPT Presentation

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba,

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Truth Conditional Meaning of Sentences Ling324; Fall 2004; Chung-hye Han Reading: Meaning and

Conditional Planning Section 11.3 Sec. 11.3 p.1/18 Outline Fully observable environments

Reducing the Cost of Conditional Transfers of Control by Using Comparison Specifications May 30,

Live-Range Reordering Sven Verdoolaege 1 Albert Cohen 2 1 Polly Labs and KU Leuven 2 INRIA and

cedram Math literature Math E-literature DML Implementation Conclusions Outline The

Preparing for Your Reviews Debbie Calhoun, MS, RD, SNS School Meals Program Specialist OSPI

Fairness-Aware Learning for Continuous Attributes and Treatments Jrmie Mary, Criteo AI Lab