Applications of optimal transport to machine learning and signal - PowerPoint PPT Presentation

Applications of optimal transport to machine learning and signal processing Présentation par Nicolas Courty Maître de conférences HDR / Université de Bretagne Sud Laboratoire IRISA http://people.irisa.fr/Nicolas.Courty/

Motivations • Optimal transport is a perfect tool to compare empirical probability distributions • In the context of machine learning/signal processing, one often has to deal with collections of samples that can be interpreted as probability distributions

Motivations • Optimal transport is a perfect tool to compare empirical probability distributions • In the context of machine learning/signal processing, one often has to deal with collections of samples that can be interpreted as probability distributions a piano note with proper normalization: probability distribution !

Motivations • I will showcase 2 successful examples of application of OT in the contexte of machine learning and signal processing • First one: OT for transfer learning (domain adaptation) • using the coupling to interpolate multidimensional data • special note on the out-of-sample problem • Second: OT for music transcription • using the metric to adapt to the specificity of the data

Forenote on implementation • All these examples have been implemented using POT, the Python Optimal Transport toolbox • Available here : https://github.com/rflamary/POT • Some use cases will be given along the examples

Optimal Transport for domain adaptation introduction to domain adaptation regularization helps out of samples formulation Joint work with Rémi Flamary, Devis Tuia, Alain Rakotomamonjy, Michael Perrot, Amaury Habrard

Domain Adaptation problem Traditional machine learning hypothesis I We have access to training data. I Probability distribution of the training set and the testing are the same. I We want to learn a classifier that generalizes to new data. Our context I Classification problem with data coming from di ff erent sources (domains). I Distributions are di ff erent but related.

Domain Adaptation problem Amazon DLSR Feature extraction Feature extraction Probability Distribution Functions over the domains Our context I Classification problem with data coming from di ff erent sources (domains). I Distributions are di ff erent but related.

Unsupervised domain adaptation problem Amazon DLSR no labels ! Feature extraction Feature extraction + Labels not working !!!! decision function Source Domain Target Domain Problems I Labels only available in the source domain , and classification is conducted in the target domain . I Classifier trained on the source domain data performs badly in the target domain

Domain adaptation short state of the art Reweighting schemes [Sugiyama et al., 2008] I Distribution change between domains. I Reweigh samples to compensate this change. Subspace methods I Data is invariant in a common latent subspace. I Minimization of a divergence between the projected domains [Si et al., 2010]. I Use additional label information [Long et al., 2014]. Gradual alignment I Alignment along the geodesic between source and target subspace [R. Gopalan and Chellappa, 2014]. I Geodesic flow kernel [Gong et al., 2012].

Generalization error in domain adaptation Theoretical bounds [Ben-David et al., 2010] The error performed by a given classifier in the target domain is upper-bounded by the sum of three terms : I Error of the classifier in the source domain; I Divergence measure between the two pdfs in the two domains; I A third term measuring how much the classification tasks are related to each other. Our proposal [Courty et al., 2016] I Model the discrepancy between the distribution through a general transformation. I Use optimal transport to estimate the transportation map between the two distributions. I Use regularization terms for the optimal transport problem that exploits labels from the source domain.

Optimal transport for domain adaptation Classification on transported samples Dataset Optimal transport Class 1 Class 2 Samples Samples Samples Samples Samples Samples Classifier on Classifier on Assumptions I There exist a transport T between the source and target domain. I The transport preserves the conditional distributions: P s ( y | x s ) = P t ( y | T ( x s )) . 3-step strategy 1. Estimate optimal transport between distributions. 2. Transport the training samples onto the target distribution. 3. Learn a classifier on the transported training samples.

Optimal Transport for domain adaptation introduction to domain adaptation regularization helps out of samples formulation

Optimal transport for empirical distributions Empirical distributions n s n t X p s X p t µ s = µ t = (4) i δ x s i , i δ x t i i =1 i =1 I δ x i is the Dirac at location x i 2 R d and p s i and p t i are probability masses. I P n s i = P n t 1 1 i =1 p s i =1 p t i = 1 , in this work p s n s and p t i = i = n t . ns ] > and X t = [ x t I Samples stored in matrices: X s = [ x s 1 , . . . , x s 1 , . . . , x t nt ] > I The cost is set to the squared Euclidean distance C i,j = k x s i � x t j k 2 . I Same optimization problem, di ff erent C .

E ffi cient regularized optimal transport Transportation cost matric C Optimal matrix γ (Sinkhorn) Entropic regularization [Cuturi, 2013] γ λ 0 = arg min h γ , C i F � λ h ( γ ) , (5) γ ∈ P where h ( γ ) = � P i,j γ ( i, j ) log γ ( i, j ) computes the entropy of γ . I Entropy introduces smoothness in γ λ 0 . I Sinkhorn-Knopp algorithm (e ffi cient implementation in parallel, GPU). I General framework using Bregman projections [Benamou et al., 2015].

Transporting the discrete samples Barycentric mapping [Ferradans et al., 2014] I The mass of each source sample is spread onto the target samples (line of γ 0 ). I The source samples becomes a weighted sum of dirac (impractical for ML). I We estimate the transported position for each source with: X c γ 0 ( i, j ) c ( x , x t x s i = arg min j ) . (6) x j I Position of the transported samples for squared Euclidean loss: ˆ ˆ X s = diag ( γ 0 1 n t ) � 1 γ 0 X t X t = diag ( γ > 0 1 n s ) � 1 γ > and (7) 0 X s .

In POT

In POT LP Sinkhorn

Regularization for domain adaptation Optimization problem min h γ , C i F + λ Ω s ( γ ) + η Ω ( γ ) , (8) γ ∈ P where I Ω s ( γ ) Entropic regularization [Cuturi, 2013]. I η � 0 and Ω c ( · ) is a DA regularization term. I Regularization to avoid overfitting in high dimension and encode additional information. Regularization terms for domain adaptation Ω ( γ ) I Class based regularization [Courty et al., 2014] to encode the source label information. I Graph regularization [Ferradans et al., 2014] to promote local sample similarity conservation. I Semi-supervised regularization when some target samples have known labels.

Entropic regularization Entropic regularization [Cuturi, 2013] X Ω s ( γ ) = γ ( i, j ) log γ ( i, j ) i,j I Extremely e ffi cient optimization scheme (Sinkhorn Knopp). I Solution is not sparse anymore due to the regularization. I Strong regularization force the samples to concentrate on the center of mass of the target samples.

Class-based regularization Group lasso regularization [Courty et al., 2016] I We group components of γ using classes from the source domain: X X || γ ( I c , j ) || p Ω c ( γ ) = (9) q , c j I I c contains the indices of the lines related to samples of the class c in the source domain. I || · || p q denotes the ` q norm to the power of p . I For p ≤ 1 , we encourage a target domain sample j to receive masses only from “same class” source samples.

Optimization problem min h γ , C i F + � Ω s ( γ ) + ⌘ Ω ( γ ) , γ ∈ P Special cases I ⌘ = 0 : Sinkhorn Knopp [Cuturi, 2013]. I � = 0 and Laplacian regularization: Large quadratic program solved with conditionnal gradient [Ferradans et al., 2014]. I Non convex group lasso ` p � ` 1 : Majoration Minimization with Sinkhorn Knopp [Courty et al., 2014]. General framework with convex regularization Ω ( γ ) I Can we use e ffi cient Sinkhorn Knopp scaling to solve the global problem? I Yes using generalized conditional gradient [Bredies et al., 2009]. I Linearization of the second regularization term but not the entropic regularization.

Applications of optimal transport to machine learning and signal - PowerPoint PPT Presentation

Applications of optimal transport to machine learning and signal processing Prsentation par Nicolas Courty Matre de confrences HDR / Universit de Bretagne Sud Laboratoire IRISA http://people.irisa.fr/Nicolas.Courty/ Motivations Optimal

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

Optimal Transport for Machine Learning Aude Genevay CEREMADE (Universit Paris-Dauphine) DMA

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Performance in heavy -ion beam tests of a high time resolution and two-dimensional position

Music Informatics Alan Smaill Feb 4th 2014 Alan Smaill Music Informatics Feb 4th 2014 1/29

DP ProtoDUNE Technical Design Review 24 th April 17 C.Cantini on behalf of ETHZ Group 24/04/2017

Writing Research Grant Applications Andrew Derrington Parker Derrington Ltd Programme Things

From P an . inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck Institute for

Physical Design of a 3D-Stacked Heterogeneous Multi-core Processor W. Rhett Davis , Randy

Pitch Detection: Music, Physics, and the Brain Tom Goodman University of Birmingham Wednesday 30

A Talk about How to Give a Talk Part II Bertram Fronhfer International Center for

Sambuz

Useful Links

Newsletter

Mail Us