Rémi Gribonval Inria Rennes - Bretagne Atlantique
remi.gribonval@inria.fr
Rmi Gribonval Inria Rennes - Bretagne Atlantique - - PowerPoint PPT Presentation
Rmi Gribonval Inria Rennes - Bretagne Atlantique remi.gribonval@inria.fr Contributors & Collaborators Anthony Bourrier Nicolas Keriven Yann Traonmilin Gilles Puy Gilles Blanchard Mike Davies Tomer Peleg Patrick Perez 2 R.
Rémi Gribonval Inria Rennes - Bretagne Atlantique
remi.gribonval@inria.fr
Contributors & Collaborators
2
Anthony Bourrier Nicolas Keriven Yann Traonmilin Tomer Peleg Gilles Puy Mike Davies Patrick Perez Gilles Blanchard
Agenda
From Compressive Sensing to Compressive Learning ? Information-preserving projections & sketches Compressive Clustering / Compressive GMM Conclusion
3
Machine Learning
Available data
training collection of feature vectors = point cloud
Goals
infer parameters to achieve a certain task generalization to future samples with the same probability distribution
Examples
4
PCA
principal subspace
Dictionary learning
dictionary
Clustering
centroids
Classification
classifier parameters (e.g. support vectors)
X
Point cloud = large matrix of feature vectors
Challenging dimensions
5
X
Point cloud = large matrix of feature vectors
Challenging dimensions
5
x1
X
Point cloud = large matrix of feature vectors
Challenging dimensions
5
x1 x2
X
Point cloud = large matrix of feature vectors
Challenging dimensions
5
x1 x2 xN
… X X
Point cloud = large matrix of feature vectors
High feature dimension n Large collection size N
Challenging dimensions
5
x1 x2 xN
… X X
Point cloud = large matrix of feature vectors
High feature dimension n Large collection size N
Challenging dimensions
5
x1 x2 xN
… X X
Challenge: compress before learning ?
X
Compressive Machine Learning ?
Point cloud = large matrix of feature vectors
6
x1 x2 xN
… X X
yN y2
…
y1
Y = MX M
Compressive Machine Learning ?
Point cloud = large matrix of feature vectors Reduce feature dimension
[Calderbank & al 2009, Reboredo & al 2013]
(Random) feature projection Exploits / needs low-dimensional feature model
6
x1 x2 xN
… X X
yN y2
…
y1
Y = MX
Challenges of large collections
Feature projection: limited impact
7
X Y = MX
Challenges of large collections
Feature projection: limited impact
7
X Y = MX
“Big Data” Challenge: compress collection size
Compressive Machine Learning ?
Point cloud = … empirical probability distribution
8
X
Compressive Machine Learning ?
Point cloud = … empirical probability distribution Reduce collection dimension
coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing
see e.g. [Thaper & al 2002, Cormode & al 2005]
8
X
Compressive Machine Learning ?
Point cloud = … empirical probability distribution Reduce collection dimension
coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing
see e.g. [Thaper & al 2002, Cormode & al 2005]
8
X M z ∈ Rm
Sketching operator nonlinear in the feature vectors linear in their probability distribution
Compressive Machine Learning ?
Point cloud = … empirical probability distribution Reduce collection dimension
coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing
see e.g. [Thaper & al 2002, Cormode & al 2005]
8
X M z ∈ Rm
Sketching operator nonlinear in the feature vectors linear in their probability distribution
Example: Compressive Clustering
9
X M z ∈ Rm
Recovery algorithm
estimated centroids ground truth
N = 1000; n = 2 m = 60
Computational impact of sketching
10
Ph.D. A. Bourrier & N. Keriven Computation time Memory
Collection size N Collection size N
Time (s)
Memory (bytes)
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
= Z h`(x)p(x)dx
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
= Z h`(x)p(x)dx
nonlinear in the feature vectors linear in the distribution p(x)
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
= Z h`(x)p(x)dx
y
Signal space
x
Observation space Signal Processing
inverse problems compressive sensing
M M
Probability space Sketch space Machine Learning
method of moments compressive learning
z p
Linear “projection” nonlinear in the feature vectors linear in the distribution p(x)
Information preservation ?
Data distribution Sketch
The Sketch Trick
11
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
= Z h`(x)p(x)dx
y
Signal space
x
Observation space Signal Processing
inverse problems compressive sensing
M M
Probability space Sketch space Machine Learning
method of moments compressive learning
z p
Linear “projection” nonlinear in the feature vectors linear in the distribution p(x)
The Sketch Trick
Data distribution Sketch
Dimension reduction ?
12
X ∼ p(x)
z` = 1 N
N
X
i=1
h`(xi)
≈ Eh`(X)
= Z h`(x)p(x)dx
y
Signal space
x
Observation space Signal Processing
inverse problems compressive sensing
M M
Probability space Sketch space Machine Learning
method of moments compressive learning
z p
Linear “projection” nonlinear in the feature vectors linear in the distribution p(x)
Information preserving projections
Stable recovery
Signal space Observation space
Linear “projection”
x
Ex: set of k-sparse vectors
14
M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Σk = {x 2 Rn, kxk0 k}
Stable recovery
Signal space Observation space
Linear “projection”
x
Ex: set of k-sparse vectors
14
M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Σk = {x 2 Rn, kxk0 k}
Recovery algorithm
= “decoder”
∆ Ideal goal: build decoder with the guarantee that (instance optimality [Cohen & al 2009]) ∆
kx ∆(Mx + e)k Ckek, 8x 2 Σ
Stable recovery
Signal space Observation space
Linear “projection”
x
Ex: set of k-sparse vectors
14
M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Σk = {x 2 Rn, kxk0 k}
Recovery algorithm
= “decoder”
∆ Ideal goal: build decoder with the guarantee that (instance optimality [Cohen & al 2009]) ∆
kx ∆(Mx + e)k Ckek, 8x 2 Σ Are there such decoders?
Stable recovery of k-sparse vectors
Typical decoders
L1 minimization
LASSO [Tibshirani 1994],Basis Pursuit [Chen & al 1999]
Greedy algorithms
(Orthonormal) Matching Pursuit [Mallat & Zhang 1993], Iterative Hard Thresholding (IHT) [Blumensath & Davies 2009], …
Guarantees
Assume Restricted isometry property
[Candès & al 2004] Exact recovery Stability to noise Robustness to model error
15
∆(y) := arg min
x:Mx=y
kxk1
1 δ kMzk2
2
kzk2
2
1 + δ
when kzk0 2k
Stable recovery
Low-dimensional model
Sparse
16
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Stable recovery
Low-dimensional model
Sparse Sparse in dictionary D
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
17
Stable recovery
Low-dimensional model
Sparse Sparse in dictionary D Co-sparse in analysis operator A
total variation, physics-driven sparse models ..
18
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Stable recovery
Low-dimensional model
Sparse Sparse in dictionary D Co-sparse in analysis operator A
total variation, physics-driven sparse models …
Low-rank matrix or tensor
matrix completion, phase-retrieval, blind sensor calibration …
19
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Low-dimensional model
Sparse Sparse in dictionary D Co-sparse in analysis operator A
total variation, physics-driven sparse models …
Low-rank matrix or tensor
matrix completion, phase-retrieval, blind sensor calibration …
Manifold / Union of manifolds
detection, estimation, localization, mapping …
Matrix with sparse inverse
Gaussian graphical models
Given point cloud
database indexing
Stable recovery
20
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Low-dimensional model
Sparse Sparse in dictionary D Co-sparse in analysis operator A
total variation, physics-driven sparse models …
Low-rank matrix or tensor
matrix completion, phase-retrieval, blind sensor calibration …
Manifold / Union of manifolds
detection, estimation, localization, mapping …
Matrix with sparse inverse
Gaussian graphical models
Given point cloud
database indexing
Gaussian Mixture Model…
Stable recovery
21
Signal space Observation space
Linear “projection”
x M y
m ⌧ n
Rm Rn
Model set
= signals of interest
Σ
Vector spaceH
Low-dimensional model
arbitrary set
General stable recovery
22
Observation space
Linear “projection”
x M y
m ⌧ n
Rm
Model set
= signals of interest
Σ
Σ ⊂ H
Signal space Rn Vector spaceH
Recovery algorithm
= “decoder”
∆ Ideal goal: build decoder with the guarantee that (instance optimality [Cohen & al 2009]) ∆
kx ∆(Mx + e)k Ckek, 8x 2 Σ Are there such decoders?
Theorem 1: RIP is necessary
Definition: (general) Restricted Isometry Property (RIP) on secant set RIP holds as soon as there exists an instance optimal decoder
Stable recovery from arbitrary model sets
23
α kMzk kzk β when z 2 Σ Σ := {x x0, x, x0 2 Σ}
up to renormalization
α = √ 1 − δ; β = √ 1 + δ
Theorem 1: RIP is necessary
Definition: (general) Restricted Isometry Property (RIP) on secant set RIP holds as soon as there exists an instance optimal decoder
Theorem 2: RIP is sufficient
RIP implies existence of decoder with performance guarantees:
Exact recovery Stable to noise Bonus: robust to model error [Cohen & al 2009] for [Bourrier & al 2014] for arbitrary
kx ∆(Mx + e)k C(δ)kek + C0(δ)dΣ(x, Σ)
Stable recovery from arbitrary model sets
23
α kMzk kzk β when z 2 Σ Σ := {x x0, x, x0 2 Σ}
up to renormalization
α = √ 1 − δ; β = √ 1 + δ
Σk
Σ
Theorem 1: RIP is necessary
Definition: (general) Restricted Isometry Property (RIP) on secant set RIP holds as soon as there exists an instance optimal decoder
Theorem 2: RIP is sufficient
RIP implies existence of decoder with performance guarantees:
Exact recovery Stable to noise Bonus: robust to model error [Cohen & al 2009] for [Bourrier & al 2014] for arbitrary
kx ∆(Mx + e)k C(δ)kek + C0(δ)dΣ(x, Σ)
Stable recovery from arbitrary model sets
23
α kMzk kzk β when z 2 Σ Σ := {x x0, x, x0 2 Σ}
up to renormalization
α = √ 1 − δ; β = √ 1 + δ
Σk
Σ
Theorem 1: RIP is necessary
Definition: (general) Restricted Isometry Property (RIP) on secant set RIP holds as soon as there exists an instance optimal decoder
Theorem 2: RIP is sufficient
RIP implies existence of decoder with performance guarantees:
Exact recovery Stable to noise Bonus: robust to model error [Cohen & al 2009] for [Bourrier & al 2014] for arbitrary
kx ∆(Mx + e)k C(δ)kek + C0(δ)dΣ(x, Σ)
Stable recovery from arbitrary model sets
23
α kMzk kzk β when z 2 Σ Σ := {x x0, x, x0 2 Σ}
up to renormalization
α = √ 1 − δ; β = √ 1 + δ
Σk
Σ
Distance to model set
Compressive Learning Examples
Compressive Machine Learning
Point cloud = empirical probability distribution Reduce collection dimension = sketching
25
X z` = 1 N
N
X
i=1
h`(xi) 1 ≤ ` ≤ m M z ∈ Rm
Sketching operator
Choosing information preserving sketch ?
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized need “incoherent” sampling choose Fourier sampling sample characteristic function choose sampling frequencies
Example: Compressive Clustering
26
z` = 1 N
N
X
i=1
ejw>
` xi
ω` ∈ Rn
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized need “incoherent” sampling choose Fourier sampling sample characteristic function choose sampling frequencies
Example: Compressive Clustering
26
z` = 1 N
N
X
i=1
ejw>
` xi
ω` ∈ Rn
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized need “incoherent” sampling choose Fourier sampling sample characteristic function choose sampling frequencies
Example: Compressive Clustering
26
z` = 1 N
N
X
i=1
ejw>
` xi
ω` ∈ Rn
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized need “incoherent” sampling choose Fourier sampling sample characteristic function choose sampling frequencies
Example: Compressive Clustering
26
z` = 1 N
N
X
i=1
ejw>
` xi
ω` ∈ Rn
How ? see poster N. Keriven
Goal: find k centroids
27
X M z ∈ Rm N = 1000; n = 2
Sampled Characteristic Function
m = 60
Example: Compressive Clustering
Goal: find k centroids
27
ground truth
X M z ∈ Rm N = 1000; n = 2
Sampled Characteristic Function
m = 60
z = Mp ≈
K
X
k=1
αkMpθk p ≈
K
X
k=1
αkpθk
Density model=GMM with variance = identity
Example: Compressive Clustering
Goal: find k centroids
27
estimated centroids ground truth
X M z ∈ Rm N = 1000; n = 2
Sampled Characteristic Function
m = 60
z = Mp ≈
K
X
k=1
αkMpθk p ≈
K
X
k=1
αkpθk
Density model=GMM with variance = identity inspired by Iterative Hard Thresholding Recovery algorithm
= “decoder”
∆
Example: Compressive Clustering
Goal: find k centroids
27
estimated centroids ground truth
X M z ∈ Rm N = 1000; n = 2
Sampled Characteristic Function
m = 60
z = Mp ≈
K
X
k=1
αkMpθk p ≈
K
X
k=1
αkpθk
Density model=GMM with variance = identity inspired by Iterative Hard Thresholding Recovery algorithm
= “decoder”
∆
Example: Compressive Clustering
Compressive Hierarchical Splitting (CHS) = extension to general GMM similar to OMP with Replacement
Application: Speaker Verification Results (DET-curves)
28
MFCC coefficients xi ∈ R12
N = 300 000 000
~ 50 Gbytes
~ 1000 hours of speech
Application: Speaker Verification Results (DET-curves)
28
MFCC coefficients xi ∈ R12 After silence detection
N = 60 000 000 N = 300 000 000
~ 50 Gbytes
~ 1000 hours of speech
Application: Speaker Verification Results (DET-curves)
28
MFCC coefficients xi ∈ R12 After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000 N = 300 000 000
~ 50 Gbytes
~ 1000 hours of speech
Application: Speaker Verification Results (DET-curves)
29
MFCC coefficients xi ∈ R12 After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000 N = 300 000 000
~ 50 Gbytes
~ 1000 hours of speech
CHS
for EM for CHS
CHS
Application: Speaker Verification Results (DET-curves)
30
MFCC coefficients xi ∈ R12 After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000 N = 300 000 000
~ 50 Gbytes
~ 1000 hours of speech
for EM for CHS
Application: Speaker Verification Results (DET-curves)
31
~ 50 Gbytes
~ 1000 hours of speech
m= 500
close to EM 7 200 000-fold compression
CHS
Application: Speaker Verification Results (DET-curves)
31
~ 50 Gbytes
~ 1000 hours of speech
m= 1000
same as EM 3 600 000-fold compression two QR codes 40-L
m= 500
close to EM 7 200 000-fold compression
CHS
m= 5 000
better than EM exploit whole collection 720 000-fold compression fit 80 on 3”1/2 floppy disk
Application: Speaker Verification Results (DET-curves)
31
~ 50 Gbytes
~ 1000 hours of speech
m= 1000
same as EM 3 600 000-fold compression two QR codes 40-L
m= 500
close to EM 7 200 000-fold compression
CHS
Computational Efficiency
Computational Aspects
Sketching
empirical characteristic function
33
z` = 1 N
N
X
i=1
ejw>
` xi
Computational Aspects
Sketching
empirical characteristic function
33
z` = 1 N
N
X
i=1
ejw>
` xi
X
Computational Aspects
Sketching
empirical characteristic function
33
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·)
Computational Aspects
Sketching
empirical characteristic function
33
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
Computational Aspects
Sketching
empirical characteristic function
33
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
~ One-layer random neural net
DNN ~ hierarchical sketching ?
see also [Bruna & al 2013, Giryes & al 2015]
Computational Aspects
Sketching
empirical characteristic function
34
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
Privacy-reserving
sketch and forget
~ One-layer random neural net
DNN ~ hierarchical sketching ?
see also [Bruna & al 2013, Giryes & al 2015]
Computational Aspects
Sketching
empirical characteristic function
35
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
Streaming algorithms
One pass; online update
Computational Aspects
Sketching
empirical characteristic function
35
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
streaming
…
Streaming algorithms
One pass; online update
Computational Aspects
Sketching
empirical characteristic function
36
z` = 1 N
N
X
i=1
ejw>
` xi
X
W
WX h(WX) h(·) = ej(·) z
average
… … … …
DIS TRI BU TED
Distributed computing
Decentralized (HADOOP) / parallel (GPU)
Conclusion
Projections & Learning
38
y
Signal space
x
Observation space Signal Processing
compressive sensing
M M
Probability space Sketch space Machine Learning
compressive learning
z p
Linear “projection”
Compressive sensing random projections of data items Compressive learning with sketches
random projections of collections
nonlinear in the feature vectors linear in their probability distribution Reduce dimension of data items Reduce size of collection
Summary
Compressive clustering & Compressive GMM
Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 2013 Keriven & G.. Compressive Gaussian Mixture Estimation by Orthogonal Matching Pursuit with Replacement. SPARS 2015, Cambridge, United Kingdom Keriven & al, Sketching for Large-Scale Learning of Mixture Models (draft)
Unified framework covering projections & sketches Instance Optimal Decoders Restricted Isometry Property
Bourrier & al, Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 2014
39
Challenge: compress before learning ?
X
Information preservation ?
Details: poster N. Keriven
Recent / ongoing work / challenges
Sufficient dimension for RIP
Puy, Davies & G., Recipes for stable linear embeddings from Hilbert spaces to ℝ^m, hal-01203614, see also EUSIPCO 2015 and [Dirksen 2014]
RIP for sketches in RKHS applied to compressive GMM
upcoming, Keriven, Bourrier, Perez & G.
Compressive statistical learning: intrinsic dimension of PCA and other related learning tasks
work in progress, Blanchard & G.
RIP-based guarantees for general (convex & nonconvex) regularizers
Traonmilin & G, Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all, arXiv:1510.00504 extends sharp RIP 1/sqrt(2) [Cai & Zhang 2014] beyond sparsity (low-rank; block/structured …)
40
m = O(dB(Σ − Σ))
Dimension reduction ? Decoders?
Details: poster G. Puy
✓ theoretical and algorithmic foundations of large-scale
machine learning & signal processing
✓ funded by ERC project PLEASE
Interested ? Joint the team
41