Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 - - PowerPoint PPT Presentation

nonconvex demixing from bilinear measurements
SMART_READER_LITE
LIVE PREVIEW

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 - - PowerPoint PPT Presentation

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind deconvolution meets blind demixing T woVignettes: Implicitly regularized Wirtinger flow Why nonconvex optimization? Implicitly


slide-1
SLIDE 1

Nonconvex Demixing from Bilinear Measurements

Yuanming Shi

1

slide-2
SLIDE 2

Outline

 Motivations

  • Blind deconvolution meets blind demixing

 T

woVignettes:

  • Implicitly regularized Wirtinger flow

 Why nonconvex optimization?  Implicitly regularized Wirtinger flow

  • Matrix optimization over manifolds

 Why manifold optimization?  Riemannian optimization for blind demixing

2

slide-3
SLIDE 3

Motivations: Blind deconvolution meets blind demixing

3

slide-4
SLIDE 4

Blind deconvolution

 In many science and engineering problems, the observed signal can be

modeled as: where is the convolution operator

  • is a physical signal of interest
  • is the impulse response of the sensory system

 Applications: astronomy, neuroscience, image processing, computer

vision, wireless communications, microscopy data processing,…

 Blind deconvolution: estimate

and given

4

slide-5
SLIDE 5

Image deblurring

 Blurred images due to camera shake can be modeled as a convolution of

the latent sharp image and a kernel capturing the motion of the camera

5

kernel natural image

How to find the high-resolution image and the blurring kernel simultaneously?

  • Fig. credit: Chi
slide-6
SLIDE 6

Microscopy data analysis

 Defects: the electronic structure of the material is contaminated by

randomly and sparsely distributed “defects”

6

How to determine the locations and characteristic signatures of the defects?

Doped Graphene

  • Fig. credit: Wright
slide-7
SLIDE 7

Blind demixing

 The received measurement consists of the sum of all convolved signals  Applications: IoT, dictionary learning, neural spike sorting,…  Blind demixing: estimate

and given

7

low-latency communication for IoT convolutional dictionary learning (multi kernel)

slide-8
SLIDE 8

 The observation signal is the superposition of several convolutions

Convolutional dictionary learning

8

experiment on synthetic image experiment on microscopy image How to recover multiple kernels and the corresponding activation signals?

  • Fig. credit: Wright
slide-9
SLIDE 9

Low-latency communications for IoT

 Packet structure: metadata (preamble (PA) and header (H)) and data  Proposal: transmitters just send overhead-free signals, and the receiver

can still extract the information

9

long data packet in current wireless systems short data packet in IoT How to detect data without channel estimation in multi-user environments?

slide-10
SLIDE 10

Demixing from bilinear model?

10

slide-11
SLIDE 11

Bilinear model

 Translate into the frequency domain…  Subspace assumptions:

and lie in some known low-dimensional subspaces where , and

 Demixing from bilinear measurements:

11

: partial Fourier basis

slide-12
SLIDE 12

An equivalent view: low-rank factorization

 Lifting: introduce

to linearize constraints

 Low-rank matrix optimization problem

12

slide-13
SLIDE 13

13

Convex relaxation

 Ling and Strohmer (TIT’2017) proposed to solve the nuclear norm

minimization problem:

  • Sample-efficient:

samples for exact recovery if is incoherent w.r.t.

  • Computational-expensive: SDP in the lifting space

13

Can we solve the nonconvex matrix optimization problem directly? : partial Fourier basis

slide-14
SLIDE 14

14

Vignettes A: Implicitly regularized Wirtinger flow

slide-15
SLIDE 15

Why nonconvex optimization?

15

slide-16
SLIDE 16

Nonconvex problems are everywhere

 Empirical risk minimization is usually nonconvex

  • low-rank matrix completion
  • blind deconvolution/demixing
  • dictionary learning
  • phase retrieval
  • mixture models
  • deep learning

16

slide-17
SLIDE 17

Nonconvex optimization may be super scary

 Challenges: saddle points, local optima, bumps,…  Fact: they are usually solved on a daily basis via simple algorithms like

(stochastic) gradient descent

17

  • Fig. credit: Chen
slide-18
SLIDE 18

Statistical models come to rescue

 Blessings: when data are generated by certain statistical models,

problems are often much nicer than worst-case instances

18

  • Fig. credit: Chen
slide-19
SLIDE 19

First-order stationary points

 Saddle points and local minima:

19

Local minima Saddle points/local maxima

slide-20
SLIDE 20

First-order stationary points

 Applications: PCA, matrix completion, dictionary learning etc.

  • Local minima: either all local minima are global minima or all local minima

as good as global minima

  • Saddle points: very poor compared to global minima; several such points

 Bottomline: local minima much more desirable than saddle points

20

How to escape saddle points efficiently?

slide-21
SLIDE 21

Statistics meets optimization

 Proposal: separation of landscape analysis and generic algorithm design

21

landscape analysis (statistics) generic algorithms (optimization) all local minima are global minima all the saddle points can be escaped

  • dictionary learning (Sun et al. ’15)
  • phase retrieval (Sun et al. ’16)
  • matrix completion (Ge et al. ’16)
  • synchronization (Bandeira et al. ’16)
  • inverting deep neural nets (Hand et al. ’17)
  • ...
  • gradient descent (Lee et al. ’16)
  • trust region method (Sun et al. ’16)
  • perturbed GD (Jin et al. ’17)
  • cubic regularization (Agarwal et al. ’17)
  • Natasha (Allen-Zhu ’17)
  • ...

Issue: conservative computational guarantees for specific problems (e.g., phase retrieval, blind deconvolution, matrix completion)

  • Fig. credit: Chen
slide-22
SLIDE 22

Solution: blending landscape and convergence analysis

22

implicitly regularized Wirtinger flow

slide-23
SLIDE 23

A natural least-squares formulation

 Goal: demixing from bilinear measurements

  • Pros: computational-efficient in the natural parameter space
  • Cons:

is nonconvex: bilinear constraint, scaling ambiguity

23

Given:

slide-24
SLIDE 24

Wirtinger flow

 Least-square minimization viaWirtinger flow (Candes, Li, Soltanolkotabi ’14)

  • Spectral initialization by top eigenvector of
  • Gradient iterations

24

slide-25
SLIDE 25

T wo-stage approach

 Initialize within local basin sufficiently close to ground-truth (i.e.,

strongly convex, no saddle points/ local minima)

 Iterative refinement via some iterative optimization algorithms

25

  • Fig. credit: Chen
slide-26
SLIDE 26

Gradient descent theory

 Two standard conditions that enable geometric convergence of GD

  • (local) restricted strong convexity
  • (local) smoothness

26

slide-27
SLIDE 27

Gradient descent theory

 Question: which region enjoys both strong convexity and smoothness?

  • is not far away from

(convexity)

  • is incoherent w.r.t. sampling vectors (incoherence region for smoothness)

27

Prior works suggest enforcing regularization (e.g., regularized loss [Ling & Strohmer’17]) to promote incoherence

slide-28
SLIDE 28

Our finding: WF is implicitly regularized

 WF (GD) implicitly forces iterates to remain incoherent with

  • cannot be derived from generic optimization theory
  • relies on finer statistical analysis for entire trajectory of GD

28

region of local strong convexity and smoothness

slide-29
SLIDE 29

Key proof idea: leave-one-out analysis

 introduce leave-one-out iterates

by runningWF without l-th sample

 leave-one-out iterate

is independent of

 leave-one-out iterate

true iterate

is nearly independent of (i.e., nearly orthogonal to)

29

slide-30
SLIDE 30

Theoretical guarantees

 With i.i.d. Gaussian design,WF (regularization-free) achieves

  • Incoherence
  • Near-linear convergence rate

 Summary:

  • Sample size:
  • Stepsize:

vs.

[Ling & Strohmer’17]

  • Computational complexity:

vs.

[Ling & Strohmer’17]

30

slide-31
SLIDE 31

Numerical results

 stepsize:  number of users:  sample size:

31

linear convergence: WF attains - accuracy within iterations

slide-32
SLIDE 32

Is carefully-designed initialization necessary?

32

slide-33
SLIDE 33

Numerical results of randomly initialized WF

33

Randomly initialized WF enters local basin within iterations  stepsize:  number of users:  sample size:  initial point:

slide-34
SLIDE 34

Analysis: population dynamics

 Signal strength: , is the alignment parameter  Size of residual component:  State evolution

34

Population level (infinite sample) local basin

slide-35
SLIDE 35

Analysis: population dynamics

 Signal strength:  Size of residual component:  State evolution

35

, is the alignment parameter Population level (infinite sample) local basin

slide-36
SLIDE 36

Analysis: finite-sample analysis

 Population-level analysis holds approximately if 

is well-controlled if is independent of

 Key

analysis ingredient: show is “nearly independent” of each

36

  • Fig. credit: Chen

is well-controlled in this region

slide-37
SLIDE 37

Theoretical guarantees

 With i.i.d. Gaussian design,WF with random initialization achieves Summary:

  • Stepsize:
  • Sample size:
  • Stage I: reach local basin

within iterations

  • Stage II: linear convergence
  • Computational complexity:

37

slide-38
SLIDE 38

Vignettes B: Matrix optimization over manifolds

38

Optimization over Riemannian Manifolds (non-Euclidean geometry)

slide-39
SLIDE 39

Why manifold optimization?

39

slide-40
SLIDE 40

What is manifold optimization?

 Manifold (or manifold-constrained) optimization problem

  • is a smooth function
  • is a Riemannian manifold: spheres, orthonormal bases (Stiefel), rotations,

positive definite matrices, fixed-rank matrices, Euclidean distance matrices, semidefinite fixed-rank matrices, linear subspaces (Grassmann), phases, essential matrices, fixed-rank tensors, Euclidean spaces...

40

slide-41
SLIDE 41

Convergence results of manifold optimization

 Convergence guarantees for Riemannian trust regions

  • Global convergence to second-order critical points
  • Quadratic convergence rate locally
  • Reach
  • second order stationary point

and

in iterations under Lipschitz assumptions [Cartis & Absil’16]

41

Escape strict saddle points via finding second-order stationary point

slide-42
SLIDE 42

Recent applications of manifold optimization

 High-dimensional data analysis: matrix/tensor completion/recovery:

[Vandereycken’13], [Boumal-Absil’15], [Kasai-Mishra’16]; phase retrieval: [Sun-Qu-Wright’17]; community detection: [Boumal’16], [Bandeira- Boumal-Voroninski’16],…

 Machine and deep learning: Gaussian mixture models: [Hosseini-

Sra’15]; dictionary learning: [Sun-Qu-Wright’17]; deep metric learning: [Roy-Mhammedi-Harandi’18],…

 Wireless transceivers design: [Shi-Zhang-Letaief’16], [Yu-Shen-Zhang-K.

  • B. Letaief’16], [Shi-Mishra-Chen’17],…

42

Exploit manifold geometry to address non-convex problems

slide-43
SLIDE 43

The power of manifold optimization paradigms

 Generalize Euclidean gradient (Hessian) to Riemannian gradient (Hessian)  We need Riemannian geometry: 1) linearize search space

into a tangent space ; 2) pick a metric on to give intrinsic notions of gradient and Hessian

43

Riemannian Gradient Euclidean Gradient Retraction Operator

slide-44
SLIDE 44

44

An excellent book Optimization algorithms on matrix manifolds A Matlab toolbox

slide-45
SLIDE 45

Taking a close look at gradient descent

45

slide-46
SLIDE 46

Optimization on the manifold: main idea

46

slide-47
SLIDE 47

Optimization on the manifold: main idea

47

slide-48
SLIDE 48

Optimization on the manifold: main idea

48

slide-49
SLIDE 49

Optimization on the manifold: main idea

49

slide-50
SLIDE 50

Example: Rayleigh quotient

 Optimization over (sphere) manifold

  • The cost function is smooth on

, symmetric matrix  Step 1: Compute the Euclidean gradient in  Step 2: Compute the Riemannian gradient on

via projecting to the tangent space using the orthogonal projector

50

slide-51
SLIDE 51

Riemannian optimization for blind demixing

51

slide-52
SLIDE 52

Blind demixing via low-rank optimization

 Linear mapping: from bilinear model to linear model

  •  Proposal: (non-convex) low-rank optimization problem
  • Challenges: nonconvex constraints, complex asymmetric matrices

52

slide-53
SLIDE 53

Blind demixing via Riemannian optimization

 Handle complex asymmetric matrices

  • Define linear map

as  Matrix optimization over the product manifolds

  • Key observations: rank-one Hermitian positive semidefinite matrices is a

manifold; multiple rank-one constraints construct a manifold

53

slide-54
SLIDE 54

54

Riemannian optimization over product manifolds

 Elementwise extension principles

  • The manifold topology of the product manifold is equivalent to the product

topology

54

slide-55
SLIDE 55

Element-wise optimization-related ingredients

 Riemannian optimization for blind demixing

55

slide-56
SLIDE 56

Numerical results

 Optimize over the product of multiple rank-one Hermitian positive

semidefinite matrices

56

Riemannian algorithms: 1) exploit the rank structure in a principled way; 2) develop second-order algorithms systematically; 3) scalable, SVD-free

slide-57
SLIDE 57

Concluding remarks

 Implicitly regularized Wirtinger flow

  • Implicit regularization: vanilla gradient descent automatically forces iterates to

stay incoherent

  • Even simplest nonconvex methods are remarkably efficient under suitable

statistical models  Matrix optimization over manifolds

  • Exploit

the manifold geometry

  • f

multiple rank-one Hermitian positive semidefinite matrices

  • Develop second-order algorithms systematically: escape saddle points, quadratic

convergence rate  Future works: sparse blind demixing, convolutional dictionary learning

[Wright, CVPR’17], convolutional neural network [Papyan, et al., SPM’18],…

57

slide-58
SLIDE 58

Reference

 J. Dong and Y. Shi, “Nonconvex demixing from bilinear measurements,”

  • IEEETrans. Signal Process., vol. 66, no. 19, pp. 5152-5166, Oct., 2018.

 J. Dong, K. Yang, and

  • Y. Shi, “Blind

demixing for low-latency communication,” IEEE Trans. Wireless Commun., vol. 18, no. 2, pp. 897-911, Feb., 2019.

 J. Dong, Y. Shi, and Z. Ding, “Blind over-the-air computation and data

fusion via provableWirtinger flow,” https://arxiv.org/abs/1811.04644.

 J. Dong and Y. Shi, “Blind Demixing via Wirtinger Flow with Random

initialization,” in Proc. Int. Conf.Artificial Intell. Stat. (AISTATS), 2019.

58

slide-59
SLIDE 59

59

Tha hank nks