SLIDE 1
On the Statistical Rate of Nonlinear Recovery in Generative Models - - PowerPoint PPT Presentation
On the Statistical Rate of Nonlinear Recovery in Generative Models - - PowerPoint PPT Presentation
On the Statistical Rate of Nonlinear Recovery in Generative Models with Heavy-tailed Data Xiaohan Wei , Zhuoran Yang, and Zhaoran Wang University of Southern California, Princeton University and Northwestern University June 12th, 2019 Generative
SLIDE 2
SLIDE 3
Generative Model vs Sparsity in Signal Recovery
Classical sparsity: structure of the signals depend on basis. Generative model: explicit parametrization of low-dimensional signal manifold.
SLIDE 4
Generative Model vs Sparsity in Signal Recovery
Classical sparsity: structure of the signals depend on basis. Generative model: explicit parametrization of low-dimensional signal manifold. Previous works: [Bora et al. 2017] [Hand et al. 2018] [Mardani et al. 2017].
SLIDE 5
Nonlinear Recovery via Generative Models
Given: Generative model G : Rk → Rd and measurement matrix X ∈ Rm×d.
SLIDE 6
Nonlinear Recovery via Generative Models
Given: Generative model G : Rk → Rd and measurement matrix X ∈ Rm×d. Goal: Recovery G(θ∗) up to scaling from nonlinear observations y = f(XG(θ∗)).
SLIDE 7
Nonlinear Recovery via Generative Models
Given: Generative model G : Rk → Rd and measurement matrix X ∈ Rm×d. Goal: Recovery G(θ∗) up to scaling from nonlinear observations y = f(XG(θ∗)). Challenges:
1
High-dimensional recovery: k ≪ d, m ≪ d.
SLIDE 8
Nonlinear Recovery via Generative Models
Given: Generative model G : Rk → Rd and measurement matrix X ∈ Rm×d. Goal: Recovery G(θ∗) up to scaling from nonlinear observations y = f(XG(θ∗)). Challenges:
1
High-dimensional recovery: k ≪ d, m ≪ d.
2
Non-Gaussian X and unknown non-linearity f.
SLIDE 9
Nonlinear Recovery via Generative Models
Given: Generative model G : Rk → Rd and measurement matrix X ∈ Rm×d. Goal: Recovery G(θ∗) up to scaling from nonlinear observations y = f(XG(θ∗)). Challenges:
1
High-dimensional recovery: k ≪ d, m ≪ d.
2
Non-Gaussian X and unknown non-linearity f.
3
Observations y can be heavy-tailed.
SLIDE 10
Our Method: Stein + Adaptive Thresholding
Suppose the rows of X := [X1, · · · , Xm]T ∈ Rm×d have density p : Rd → R. Define the (row-wise) score transformation: Sp(X) := [Sp(X1), · · · , Sp(Xm)]T = [∇ log p(X1), · · · , ∇ log p(Xm)]T .
SLIDE 11
Our Method: Stein + Adaptive Thresholding
Suppose the rows of X := [X1, · · · , Xm]T ∈ Rm×d have density p : Rd → R. Define the (row-wise) score transformation: Sp(X) := [Sp(X1), · · · , Sp(Xm)]T = [∇ log p(X1), · · · , ∇ log p(Xm)]T . (First-order) Stein’s identity: when Ef ′(Xi, G(θ∗)) > 0, E
- Sp(X)T y
- ∝ G(θ∗).
(Second-order) Stein’s identity: when Ef ′′(Xi, G(θ∗)) > 0, δ is a constant, E
- Sp(X)T diag(y)Sp(X)
- ∝ G(θ∗)G(θ∗)T + δ · Id×d.
SLIDE 12
Our Method: Stein + Adaptive Thresholding
Suppose the rows of X := [X1, · · · , Xm]T ∈ Rm×d have density p : Rd → R. Define the (row-wise) score transformation: Sp(X) := [Sp(X1), · · · , Sp(Xm)]T = [∇ log p(X1), · · · , ∇ log p(Xm)]T . (First-order) Stein’s identity: when Ef ′(Xi, G(θ∗)) > 0, E
- Sp(X)T y
- ∝ G(θ∗).
(Second-order) Stein’s identity: when Ef ′′(Xi, G(θ∗)) > 0, δ is a constant, E
- Sp(X)T diag(y)Sp(X)
- ∝ G(θ∗)G(θ∗)T + δ · Id×d.
Adaptive thresholding: suppose yiLq < ∞, q > 4, and τm ∝ m2/q,
- yi = sign(yi) · (|yi| ∧ τm), i ∈ {1, 2, · · · , m}
SLIDE 13
Our Method: Stein + Adaptive Thresholding
Least-squares estimator:
- θ ∈ argminθ∈Rk
- G(θ) − 1
m Sp(X)T y
- 2
2
.
SLIDE 14
Our Method: Stein + Adaptive Thresholding
Least-squares estimator:
- θ ∈ argminθ∈Rk
- G(θ) − 1
m Sp(X)T y
- 2
2
. Main performance theorem:
Theorem (Wei, Yang and Wang, 2019)
For any accuracy level ε ∈ (0, 1], suppose (1) Ef ′(Xi, G(θ∗)) > 0, (2) the generative model G is a ReLU network with zero bias, (3) the number of measurements m ∝ kε−2 log d. Then, with high probability,
- G(
θ) G( θ)2 − G(θ∗) G(θ∗)2
- 2
≤ ε. Similar results hold for more general Lipschitz generators G.
SLIDE 15
Our Method: Stein + Adaptive Thresholding
PCA type estimator:
- θ ∈ argmaxG(θ)2=1 G(θ)T Sp(X)T diag(
y)Sp(X)G(θ)
SLIDE 16
Our Method: Stein + Adaptive Thresholding
PCA type estimator:
- θ ∈ argmaxG(θ)2=1 G(θ)T Sp(X)T diag(
y)Sp(X)G(θ) Main performance theorem:
Theorem (Wei, Yang and Wang, 2019)
For any accuracy level ε ∈ (0, 1], suppose (1) Ef ′′(Xi, G(θ∗)) > 0, (2) the generative model G is a ReLU network with zero bias, (3) the number of measurements m ∝ kε−2 log d. Then, with high probability,
- G(
θ) − G(θ∗) G(θ∗)2
- 2
≤ ε. Similar results hold for more general Lipschitz generators G.
SLIDE 17