E ffi cient Modeling of Latent Information in Supervised Learning - PowerPoint PPT Presentation

E ffi cient Modeling of Latent Information in Supervised Learning using Gaussian Processes Mauricio A. ´ Zhenwen Dai Alvarez Neil D. Lawrence Gaussian Process Approximation Workshop, 2017

Motivation I Machine learning has been very successful in providing tools for learning a function mapping from an input to an output. y = f ( x ) + ✏ I The modeling in terms of function mapping assumes a one/many to one mapping between input and output. I In other words, ideally the input should contain su ffi cient information to uniquely determine/disambiguise the output apart from some sensory noise.

Data: a Combination of Multiple Scenarios I In most of cases, this assumption does not hold. I We often collect data as a combination of multiple scenarios, e.g., the voice recording of multiple persons, the images taken from di ff erent models of cameras. I We only have some labels to identify these scenarios in our data, e.g., we can have the names of the speakers and the specifications of the used cameras. I These labels are represented as categorical data in some database.

How to model these labels? I A common practice in this case would be to ignore the di ff erence of scenarios, but fails to model the corresponding variations. I Model each scenario separately. I Use a one-hot encoding. I In both of these cases, generalization/transfer to new scenario is not possible. I Any better solutions? Latent variable models!

A Toy Problem: The Braking Distance of a Car I To model the braking distance of a car in a completely data-driven way. I Input: the speed when starting to brake I Output: the distance that the car moves before fully stopped I We know that the braking distance depends on the friction coe ffi cient. I We can conduct experiments with a set of di ff erent tyre and road conditions, each associated with a condition ID . I How can we model the relation between the speed and distance in a data-driven way, so that we can extrapolate to a new condition with only one experiment ? ! " # $

Common Modeling Choices with Non-parametric Regression I A straight-forward modeling choice to ignore the di ff erence in conditions. The relation between the speed and distance can be modeled as y = f ( x ) + ✏ , f ⇠ GP , I Alternatively, we can model each condition separately, i.e., f d ⇠ GP , d = 1 , . . . , D . 60 Mean ground truth Distance 10 Data data Braking Distance 40 Confidence 0 20 10 Distance 0 0 − 20 0 2 4 6 8 10 0 2 4 6 8 10 Speed Speed

Modeling the Conditions Jointly I A probabilistic approach is to assume a latent variable. I With a latent variable h d , the relation between speed and distance for the condition d is, then, modeled as y = f ( x , h d ) + ✏ , f ⇠ GP , h d ⇠ N (0 , I ) . (1) I A special Bayesian GPLVM? I E ffi ciency, O ( N 3 D 3 ) or O ( NDM 2 ). I The balance among di ff erent conditions in inference. Braking Distance 1 ground truth 10 data Latent Variable 0 0 Braking Distance − 1 10 − 2 0 2 . 5 5 . 0 7 . 5 10 . 0 0 2 4 6 8 10 Initial Speed 1 /µ

Latent Variable Multiple Output Gaussian Processes (LVMOGP) I We propose a new model which assumes the covariance matrix can be decomposed as a Kronecker product of the covariance matrix of the latent variables K H and the covariance matrix of the inputs K X . I The probabilistic distributions of LVMOGP is defined as ⇣ F : | 0 , K H ⌦ K X ⌘ Y : | F : , � 2 I � � p ( Y : | F : ) = N p ( F : | X , H ) = N (2) , , where the latent variables H have unit Gaussian priors, h d ⇠ N (0 , I ) I This is a special case of the model in (1).

Closed-form Variational Lower Bound (SVI-GP) I It is known that the optimal posterior distribution of q ( U ) is a Gaussian distribution [Titsias, 2009, Matthews et al., 2016]. With an explicit Gaussian U | M , Σ U � definition of q ( U ) = N � , the integral in F has a closed-form solution: F = � ND 1 1 2 log 2 ⇡� 2 � ⇣ ⌘ K � 1 uu Φ K � 1 : + Σ U ) 2 � 2 Y > uu ( M : M > : Y : � 2 � 2 Tr + 1 1 : Ψ K � 1 K � 1 � 2 Y > � � �� uu M : � � tr uu Φ 2 � 2 ⌦ K > ↵ where = h tr ( K ff ) i q ( H ) , Ψ = h K fu i q ( H ) and Φ = fu K fu q ( H ) I The computational complexity of the closed-form solution is O ( NDM 2 X M 2 H ).

More E ffi cient Formulation I The Kronecker product decomposition of covariance matrices are not exploited. I Firstly, the expectation computation can be decomposed, ⇣ ⌘ Ψ = Ψ H ⌦ K X Φ = Φ H ⌦ ⇣ ⌘ = H tr K X ( K X fu ) > K X fu ) (4) ff fu , , , where H = q ( H ) , Ψ H = q ( H ) and Φ H = ⌦ � K H �↵ ⌦ K H ↵ ⌦ ( K H fu ) > K H ↵ tr q ( H ) . ff fu fu

More E ffi cient Formulation I Secondly, we assume a Kronecker product decomposition of the covariance matrix of q ( U ), i.e., Σ U = Σ H ⌦ Σ X . I The number of variational parameters in the covariance matrix from M 2 X M 2 H to M 2 X + M 2 H . I The direct computation of Kronecker products is completely avoided. F = � ND 1 2 log 2 ⇡� 2 � 2 � 2 Y > : Y : 1 ⇣ uu ) � 1 ⌘ uu ) � 1 Φ C ( K X uu ) � 1 ) M ( K H uu ) � 1 Φ H ( K H M > (( K X � 2 � 2 tr 1 ⇣ uu ) � 1 Σ H ⌘ ⇣ uu ) � 1 Σ X ⌘ uu ) � 1 Φ H ( K H uu ) � 1 Φ X ( K X ( K H ( K X � 2 � 2 tr tr . . .

Prediction I Given both a set of new inputs X ⇤ with a set of new scenarios H ⇤ , the prediction of noiseless observation F ⇤ can be computed in closed-form. Z q ( F ⇤ : | X ⇤ , H ⇤ ) = p ( F ⇤ : | U : , X ⇤ , H ⇤ ) q ( U : )d U : ⇣ ⌘ : | K f ∗ u K � 1 uu M : , K f ∗ f ∗ � K f ∗ u K � 1 f ∗ u + K f ∗ u K � 1 uu Σ U K � 1 F ⇤ uu K > uu K > = N f ∗ u , I For a regression problem, we are often more interested in predicting for the existing condition from the training data. We can approximate the prediction by integrating the above prediction equation with q ( H ), Z q ( F ⇤ : | X ⇤ ) = q ( F ⇤ : | X ⇤ , H ) q ( H )d H .

Missing Data I The model described previously assumes that for N di ff erent inputs, we observe them in all the D di ff erent conditions. I In real world problems, we often collect data at a di ff erent set of inputs for each scenario, i.e., for each condition d , d = 1 , . . . , D . I The proposed model can be extended to handle this case by reformulating the F as D � N d 1 1 ⇣ ⌘ X 2 log 2 ⇡� 2 Y > K � 1 uu Φ d K � 1 uu ( M : M > : + Σ U ) F = d � d Y d � Tr 2 � 2 2 � 2 d d d =1 + 1 1 Y > d Ψ d K � 1 K � 1 � � �� uu M : � d � tr uu Φ d � 2 2 � 2 , d d ⇣ ⌘ ⇣ ⌘ where Φ d = Φ H ( K X f d u ) > K X , Ψ d = Ψ H d ⌦ K X f d u , d = H K X d ⌦ f d u ) d ⌦ tr f d f d

Related Works I Multiple Output Gaussian Processes /Multi-task Gaussian proccesses: lvarez et al. [2012] [Goovaerts, 1997] [Bonilla et al., 2008] I Our method reduces computationally complexity to O (max( N , M H ) max( D , M X ) max( M X , M H )) when there are no missing data. I An additional advantage of our method is that it can easily be parallelized using mini-batches like in [Hensman et al., 2013]. I The idea of modeling latent information about di ff erent conditions jointly with the modeling of data points is related to the style and content model by Tenenbaum and Freeman [2000].

Experiments on Synthetic Data I 100 di ff erent uniformly sampled input locations (50 for training and 50 for testing), where each corresponds to 40 di ff erent conditions. An observation noise with variance 0.3 is added onto the training data I We compare LVMOGP with two other methods: GP with independent output dimensions (GP-ind) and LMC (with a full rank coregionalization matrix). I First dataset without missing data. 0 . 7 0 . 6 0 . 5 RMSE 0 . 4 0 . 3 0 . 2 GP-ind LMC LVMOGP

Experiments on Synthetic Data with Missing Data I To generate a dataset with uneven numbers of training data in di ff erent conditions, we group the conditions into 10 groups. Within each group, the numbers of training data in four conditions are generated through a three-step stick breaking procedure with a uniform prior distribution (200 data points in total). I We compare LVMOGP with two other methods: GP with independent output dimensions (GP-ind) and LMC (with a full rank coregionalization matrix). I GP-ind: 0 . 43 ± 0 . 06, LMC:0 . 47 ± 0 . 09, LVMOGP 0 . 30 ± 0 . 04 2 test 0 . 7 train 0 − 2 0 . 6 GP-ind 2 RMSE 0 . 5 0 − 2 0 . 4 LMC 2 0 . 3 0 − 2 0 . 2 − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 GP-ind LMC LVMOGP LVMOGP

Experiment on Servo Data I We apply our method to a servo modeling problem, in which the task to predict the rise time of a servomechanism in terms of two (continuous) gain settings and two (discrete) choices of mechanical linkages [Quinlan, 1992]. I The two choices of mechanical linkages: 5 types of motors and 5 types of lead screws. I We take 70% of the dataset as training data and the rest as test data, and randomly generated 20 partitions. I GP-WO: 1 . 03 ± 0 . 20, GP-ind: 1 . 30 ± 0 . 31, GP-OH: 0 . 73 ± 0 . 26, LMC:0 . 69 ± 0 . 35, LVMOGP 0 . 52 ± 0 . 16 2 . 0 1 . 5 RMSE 1 . 0 0 . 5 GP-WO GP-ind GP-OH LMC LVMOGP

E ffi cient Modeling of Latent Information in Supervised Learning - PowerPoint PPT Presentation

E ffi cient Modeling of Latent Information in Supervised Learning using Gaussian Processes Mauricio A. Zhenwen Dai Alvarez Neil D. Lawrence Gaussian Process Approximation Workshop, 2017 Motivation I Machine learning has been very

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

15 E ffi cient mesh models Steve Marschner CS5625 Spring 2020 Follows chapter 16 in RTR 4e Basics

E ffi cient Document Scoring VSM, session 5 CS6200: Information Retrieval Slides by: Jesse

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

1 Latent variable models In the next section we will discuss latent variable models for

E ffi cient and E ff ective Query Auto-Completion Giulio Ermanno Pibiri Simon Gog Rossano

Solid State Drive Based Energy E ffi cient Cloud Storage Jesus Ramos Alexis Je ff erson Ti ff any

E ffi cient, Cost E ff ective and Sustainable Self-Delivery of Asphalt for Small Works 1

A Large Scale Study of the Small Sample Performance of Random Coe ffi cient Models of Demand

E ffi cient and Incentive-Compatible Liver Exchange Haluk Ergin Tayfun Snmez M. Utku nver U

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

E ffi cient use of semidefinite programming for the selection of rotamers in protein conformation

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Modeling nonignorable missingness in multidimensional latent class IRT models Silvia Bacci 1 ,

Convergence of latent mixing measures in finite and infinite mixture models Long Nguyen

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray

La Latent-sp space Dynam Dynamics ics for r Re Reduced Deformable Simulation Lawson Fulton

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic