Beating the Perils of Non-Convexity: Machine Learning using Tensor - PowerPoint PPT Presentation

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine

Learning with Big Data Learning is finding needle in a haystack

Learning with Big Data Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

Learning with Big Data Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning with big data: statistically and computationally challenging!

Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning Clustering k -means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models Supervised Learning Output Optimizing a neural network with Neuron respect to a loss function Input

Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Images taken from https://www.facebook.com/nonconvex

Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Real world is mostly non-convex! Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima

Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima

Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima How to deal with non-convexity?

Outline Introduction 1 Guaranteed Training of Neural Networks 2 Overview of Other Results on Tensors 3 Conclusion 4

Training Neural Networks Tremendous practical impact with deep learning. Algorithm: backpropagation. Highly non-convex optimization

Toy Example: Failure of Backpropagation y y =1 y = − 1 x 2 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x x 1 Labeled input samples Goal: binary classification Our method: guaranteed risk bounds for training neural networks

Backpropagation vs. Our Method Weights w 2 randomly drawn and fixed Backprop (quadratic) loss surface 650 600 550 500 450 400 350 300 250 200 −4 −3 −2 −1 4 0 3 2 1 1 2 0 −1 3 −2 w 1 (1) −3 w 1 (2) 4 −4

Backpropagation vs. Our Method Weights w 2 randomly drawn and fixed Backprop (quadratic) loss surface Loss surface for our method 200 650 180 600 160 550 140 500 120 100 450 80 400 60 350 40 300 20 250 0 200 −4 −4 −3 −3 −2 −2 −1 4 −1 0 3 4 2 0 1 3 1 2 1 2 0 1 −1 2 0 3 −2 −1 w 1 (1) −3 w 1 (2) 4 3 −2 −4 w 1 (1) −3 w 1 (2) 4 −4

Overcoming Hardness of Training In general, training a neural network is NP hard. How does knowledge of input distribution help?

Generative vs. Discriminative Models 0.12 1.2 Class y = 1 Class y = 0 Class y = 1 0.1 Class y = 0 1 0.08 p ( x, y ) p ( y | x ) 0.8 0.06 0.6 0.04 0.4 0.02 0.2 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Input data x Input data x Generative models: Encode domain knowledge. Discriminative: good classification performance. Neural Network is a discriminative model. Do generative models help in discriminative tasks?

Feature Transformation for Training Neural Networks y Feature learning: Learn φ ( · ) from input data. φ ( x ) How to use φ ( · ) to train neural networks? x

Feature Transformation for Training Neural Networks y Feature learning: Learn φ ( · ) from input data. φ ( x ) How to use φ ( · ) to train neural networks? x Multivariate Moments: Many possibilities, . . . E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] , . . .

Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA on matrices? Matrix E [ x ⊗ y ] ∈ R d × d is a second order tensor. E [ x ⊗ y ] i 1 ,i 2 = E [ x i 1 y i 2 ] . For matrices: E [ x ⊗ y ] = E [ xy ⊤ ] . Tensor E [ x ⊗ x ⊗ y ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ y ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 y i 3 ] . In general, E [ φ ( x ) ⊗ y ] is a tensor. What class of φ ( · ) useful for training neural networks?

Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) S 1 ( x ) ∈ R d Input: x ∈ R d

Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 1 ( x ) ∈ R d Input: x ∈ R d

Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 1 ( x ) ∈ R d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 2 ( x ) ∈ R d × d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 3 ( x ) ∈ R d × d × d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) A 1 · · · x 1 x 2 x 3 x d x x · · ·

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · u j ⊗ u j ⊗ u j j ∈ [ k ]

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ]

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = + λ 11 ( A 1 ) 1 λ 12 ( A 1 ) 2

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 2 = E [ y · S 2 ( x )] = λ 2 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = + λ 11 ( A 1 ) 1 ⊗ ( A 1 ) 1 λ 12 ( A 1 ) 2 ⊗ ( A 1 ) 2

Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 3 = E [ y · S 3 ( x )] = λ 3 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = +

Beating the Perils of Non-Convexity: Machine Learning using Tensor - PowerPoint PPT Presentation

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine Learning with Big Data Learning is finding needle in a haystack Learning with Big

Beating The Best - The Santander Bank Kaggle Beating The Best - The Santander Bank Kaggle Beating

Beating the No Win Scenario Joe DeVivo @joedevivo Tuesday, 26 March 2013 Beating the No Win

The Perils of Poor Communication Paul Kenny Pensions Ombudsman Perils of Poor communication

Businesses and Tax The Perils of Perception October 2015 Business and Tax Perils of

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

The Stochastic Matching Problem: Beating Half with a Non-Adaptive Algorithm Sepehr Assadi

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Beating the bookie A look at statistical models for prediction of football matches Helge Langseth

Phototaxis in Volvox 18.S995 - L28 the beating of thousands of flagellated cells despite the

-Beating, dispersion and coupling correction in the LHC R. Toms, R. Calaga, O. Bruning, S.

DEALING WITH ALIASING USING DEALING WITH ALIASING USING CONTRACTS CONTRACTS BEATING FORTRAN'S

Lecture 4: Spectrum Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020 Beating

MACHINE LEARNING Probably Approximately Correct (PAC) Learning Alessandro Moschitti Department

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and

Statistical Machine Learning Lecture 01: Introduction Kristian Kersting TU Darmstadt Summer

9/12/17 Universal Design Ron Rogers for Learning @ronbrogers Ron_Rogers@ocali.org 101 Goals

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Phillips Hall, 407 Instructor : Karthik

INQUIRY-BASED LEARNING PROBLEM SETS IN AN OUTREACH PROGRAM FOR HIGH SCHOOL GIRLS Increasing

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Hardness and advantages of Module-SIS and Module-LWE Adeline Roux-Langlois EMSEC: Univ Rennes,