Dictionary learning in geoscience Michael Bianco UCSD Noise Lab, - - PowerPoint PPT Presentation

dictionary learning in geoscience
SMART_READER_LITE
LIVE PREVIEW

Dictionary learning in geoscience Michael Bianco UCSD Noise Lab, - - PowerPoint PPT Presentation

Dictionary learning in geoscience Michael Bianco UCSD Noise Lab, Scripps Institution of Oceanography noiselab.ucsd.edu 5/9/18 Dictionary learning Means of estimating sparse causes for given classes of signals, e.g. natural images, audio


slide-1
SLIDE 1

Dictionary learning in geoscience

Michael Bianco

UCSD Noise Lab, Scripps Institution of Oceanography noiselab.ucsd.edu

5/9/18

slide-2
SLIDE 2

Dictionary learning

  • Means of estimating sparse causes for given classes of signals,

e.g. natural images, audio

  • Originated in neuroscience to estimate structure of V1 visual

cortex cells from natural images

  • Useful for regularization of general image denoising inverse

problem, but only recent applications in the geosciences

  • Seismic survey image denoising
  • Dictionary learning of ocean sound speed profiles (SSPs)

2 Olshausen 2009 Beckouche 2014

10 40 70

depth (m)

  • 1

1

amplitude

Bianco and Gerstoft 2017

slide-3
SLIDE 3

dictionary error

Background: sparse modeling of arbitrary signal y

  • Measurement vector y is expressed as sparse linear combination of columns or

"atoms" from dictionary D

  • y could be (for example) segments of speech or vectorized 2D image patches
  • Dictionary atoms represent elemental patterns that generate y, e.g. wavelets or

learned from the data using dictionary learning

  • x is estimated using sparsity inducing constraint, example " -norm" regularization:
  • norm "counts" # non-zero coefficients
slide-4
SLIDE 4

dictionary error

Background: sparse modeling of arbitrary signal y

  • Measurement vector y is expressed as sparse linear combination of columns or

"atoms" from dictionary D

  • y could be (for example) segments of speech or vectorized 2D image patches
  • Dictionary atoms represent elemental patterns that generate y, e.g. wavelets or

learned from the data using dictionary learning

  • x is estimated using sparsity inducing constraint, example " -norm" regularization:
  • norm "counts" # non-zero coefficients

??

slide-5
SLIDE 5 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Background: sparsity and dictionary learning

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

"Natural" images, patches shown in magenta Learn dictionary D describing

  • Dictionary learning obtains "optimal" sparse modeling dictionaries directly from data
  • Dictionary learning was developed in neuroscience (a.k.a. sparse coding) to help

understand mammalian visual cortex structure

  • Assumes (1) Redundancy in data: image patches are repetitions of a smaller set of

elemental shapes; and (2) Sparsity: each patch is represented with few atoms from dictionary

Olshausen 2009

  • Each patch is signal
  • Set of all patches

Sparse model for patch composed of few atoms from D

slide-6
SLIDE 6 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Background: sparsity and dictionary learning

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

"Natural" images, patches shown in magenta Learn dictionary D describing

Olshausen 2009

Sparse model for patch composed of few atoms from D

  • Dictionary learning obtains "optimal" sparse modeling dictionaries directly from data
  • Dictionary learning was developed in neuroscience (a.k.a. sparse coding) to help

understand mammalian visual cortex structure

  • Assumes (1) Redundancy in data: image patches are repetitions of a smaller set of

elemental shapes; and (2) Sparsity: each patch is represented with few atoms from dictionary

  • Each patch is signal
  • Set of all patches
slide-7
SLIDE 7

Olshausen and Field 1997: image model with sparse prior

7

Assume that each image patch described by linear system Goal: estimate bases from observations Probability of image patch arising from bases phi is

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Image patches Likelihood Prior Posterior , with Likelihood Independent, sparse prior

slide-8
SLIDE 8

Olshausen and Field 1997- sparse prior induces sparse coefficients

Sparsity inducing prior "Cauchy distribution" Derivative of prior induces sparsity in solution, as we’ll see…

slide-9
SLIDE 9

Olshausen and Field 1997 - derivation of Error function

Learn basis functions by minimizing Kullback-Leibler (KL) divergence between true images and those reproduced by model Since is fixed, KL is minimized by maximizing log-likelihood (or minimizing negative log-likelihood) of image patches generated from model, hence

Given:

slide-10
SLIDE 10

Olshausen and Field 1997 - derivation of Error function cont’d

Learn basis functions by minimizing Kullback-Leibler (KL) divergence between true images and those reproduced by model

slide-11
SLIDE 11

Olshausen and Field 1997 - derivation of Error function cont’d

Learn basis functions by minimizing Kullback-Leibler (KL) divergence between true images and those reproduced by model

Given:

Obtain:

slide-12
SLIDE 12

Olshausen and Field 1997 - gradients for network model

Rewriting Error function, take derivatives to find gradient Update to with network (inner loop) with Update to with gradient descent (outer loop)

"Hebbian" update

slide-13
SLIDE 13

From Olshausen ’97 method, obtain dictionary atoms that resemble cells from mammalian visual cortex

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Natural image patches Dictionary elements

slide-14
SLIDE 14

Nice to have atoms like cells, but what else is dictionary learning useful for?

slide-15
SLIDE 15

Nice to have atoms like cells, but what else is dictionary learning useful for?

Image restoration tasks Denoising Inpainting (a.k.a. matrix completion)

Elad 2006 Mairal 2009

slide-16
SLIDE 16

Olshausen and Field 1997 - gradients for network model

Can be rephrased with Laplacian prior Coefficients calculated using gradient descent, then dictionary updated by

This idea of iterative refinement is familiar: solving for coefficients, then updating basis functions

slide-17
SLIDE 17
  • 1

1

(a)

  • 1

1

  • 1

1

(b)

17

Vector Quantization and K-means

Vector quantization (VQ): means of compressing a set of data observations using a nearest neighbor metric with codebook 2D example K-means: finds optimal codebook for VQ

slide-18
SLIDE 18

18

Relationship to sparse coding

{ {

Sparse processor VQ

  • perators

Dictionary learning

  • bjective

K-means Gain-shape VQ

K-means G-S VQ

slide-19
SLIDE 19

Objective solved as simple optimization problem

  • 1. Solve for sparse coefficients using

sparse solver

  • 2. Solve for dictionary D using sparse coefficients from step

(1)….. repeat until convergence

Background: a basic dictionary learning framework

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Patches shown in magenta

Given set of patches , learn dictionary D describing them

Dictionary D

Dictionary learning objective

slide-20
SLIDE 20

MOD algorithm:

  • 1. COEFFICIENTS: Solve for coefficients X=[x_1…x_i] for fixed Q

using orthogonal matching pursuit (OMP)

  • 2. DICTIONARY UPDATE: Solve for dictionary Q=[q_1…q_i], by

inverting the coefficient matrix X, and normalizing dictionary entries to have unit norm. …. repeat until convergence

MOD algorithm: Extending K-means to dictionary learning problem

Method of Optimal Directions (MOD) [Engan 2000]

b Q = YXT (XXT )−1

Simple and flexible but, a few drawbacks:

  • computationally expensive to invert coefficient matrix X
  • since keeping coefficients in X fixed during dictionary update, slow convergence

20

slide-21
SLIDE 21
  • 1

1

(a)

  • 1

1

  • 1

1

(b)

2D example

K-SVD algorithm:

  • 1. Solve for coefficients X=[x_1…x_i] for

fixed Q using OMP

  • 2. Solve (1) for dictionary Q=[q_1…q_i],

updating both Q and X from the SVD of representation error update q_k, x_k by SVD …. repeat until convergence

K-SVD algorithm

K-SVD [Aharon 2006]: Learn optimal dictionary for sparse representation

  • f data

Ee

k = USVT

qk = U(:, 1), xk

T = V(:, 1)S(1, 1)

kY QXkF =

Y X

j6=k

qjxj

T

◆ qkxk

T

  • F

{

= kEk qkxk

T k

21

slide-22
SLIDE 22

Image restoration tasks Denoising Inpainting (a.k.a. matrix completion)

Elad 2006 Mairal 2009

slide-23
SLIDE 23

Image restoration tasks Denoising: learning from noisy image patches for specific image

Solved using block-coordinate descent algorithm (also two steps): (1) (2)

Elad 2006

slide-24
SLIDE 24

Why not just use neural networks?

Burger 2012: Multi-layer perceptron competes with state of art denoising algorithms, using 362 million training samples (~one month of GPU time) … at least in geoscience (seimsics, ocean acoustics) we rarely have this much training data

Wipf 2018

Adaptive image denoising-like MLP-like

slide-25
SLIDE 25

Why not just use neural networks? (cont’d)

slide-26
SLIDE 26

Dictionary learning of ocean sound speed profiles Bianco and Gerstoft 2017

  • Acoustic observations from ocean contain information about
  • cean environment
  • The inversion of environment parameters is limited by physics

and signal processing assumptions

26

Source (active or noise) Hydrophones Sound speed profile c(z)

⍴1, c1 ⍴2, c2

slide-27
SLIDE 27

Sound speed profiles

27

  • Sound speed profiles (SSPs) in the ocean are often highly variable

with fine scale fluctuations

  • Acoustic inversion of SSPs is ill-posed and traditionally regularized

using EOFs (=PCA in this case)

  • Dictionaries obtained via unsupervised learning may provide better

representation of SSP dynamics

5 10 15 20 25

UTC (hour)

20 30 40 50 60 70

Depth (m)

1508 1510 1512 1514 1516 1518 1520 1522

HF-97 SSP data

Sound speed (m/s)

slide-28
SLIDE 28

Dictionary learning of sound speed profiles

5 10 15 20 25

UTC (hour)

20 30 40 50 60 70

Depth (m)

1508 1510 1512 1514 1516 1518 1520 1522

Sound speed (m/s)

cm

¯ c

ym

10 40 70

depth (m)

  • 1

1

amplitude

‘Learned Dictionary’

Dictionary Learning

ym = cm − ¯ c

SSP Variability

HF-97 Experiment

  • 30 hours of SSP data
  • Used 1000 profiles for

dictionary learning

  • K = 30 point SSP’s

(interpolated from 15 measurements)

28

Bianco and Gerstoft JASA 2017 (published)

slide-29
SLIDE 29

29

Example: Denoising alphabet with K-SVD algorithm

True alphabet Recovered alphabet (no noise, K-SVD) Recovered alphabet (noise std = .5, K-SVD) Recovered alphabet (no noise, PCA)

slide-30
SLIDE 30

1 2 3 4 5 6 7

T

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

ME (m/s)

EOF OMP EOF LS LD (N = 90)

SSP reconstruction error using Dictionary Learning

  • One entry from Learned Dictionary fits SSP data better than 6 EOFs
  • Learned dictionary (LD) reconstruction error less than 50% of EOF

error

Based on 1000 profiles from HF-97

30

LS: Least squares OMP: Sparse processor Mean Error (ME):

ME = 1 KM kY b Yk1

slide-31
SLIDE 31

HF-97: One coefficient from Learned Dictionary vs. One EOF coefficient

SSP reconstruction using Dictionary Learning

31

slide-32
SLIDE 32

Learning dictionary from HF-97 SSP variation

Q random initialized, converges within 15 iterations

slide-33
SLIDE 33

LD solution space much smaller than EOFs

1 2 3 4 5 6 7

T

100 105 1010 1015 1020 1025

S

Sfixed Scomb

Inversion for SSP: Assuming a potentially non-linear mapping:

  • EOF solution: T leading order coefficients

(fixed indices)

  • LD solution: T-non-zero coefficients

(combinatorial indices) EOF sol’n LD sol’n

Scomb = HT N! T!(N − T)! Sfixed = HT

  • Since 6 EOFs or 1 LD entry required, if coefficients discretized in H=100

coefficients number of possible solutions are

EOFs: LD: Scomb = 104 solutions

Sfixed = 1012 solutions

33

slide-34
SLIDE 34

Dictionary learning in travel time tomography

Bianco and Gerstoft 2018

Long Beach seismic array (5200 stations, ~14 million rays)

10 km 7 km

1 Hz Rayleigh wave phase speed map

LST

  • The Earth contains both smooth and discontinuous variations in slowness (e.g. Moho, faults)

at multiple spatial scales

  • Most existing travel time inversion methods are ad hoc: regularize inversion assuming

exclusively smooth or discontinuous slownesses

  • Propose locally-sparse 2D travel time tomography (LST) method with three main ingredients:
  • Sparsity constraint on slowness patches
  • Dictionary learning (unsupervised machine learning)
  • Damped least squares regularization on overall slowness map
slide-35
SLIDE 35

Consider simple travel time model

2D map slowness map

For slowness field, get travel time:

c =

x1 x3 x2 x4

Range (length) Range (length)

wave speed slowness For straight-rays get simple formulation: "tomography matrix"

  • Propose LST tomography ingredients:
  • Sparsity constraint on slowness patches
  • Dictionary learning (unsupervised machine learning)
  • Damped least squares regularization on overall slowness map
slide-36
SLIDE 36

1. Sparsity constraint on slowness patches 2. Dictionary learning (unsupervised machine learning) 3. Damped least squares regularization on overall slowness map

Proposed locally-sparse tomography (LST) basics

“Local” model “Global” model

LST approach three ingredients: classified as local and global models

“Local” model: Models small-scale features as patches “Global” model: Models larger-scale features with damped least squares

Synthetic "checkerboard" slowness example

slide-37
SLIDE 37

1. Sparsity constraint on slowness patches 2. Dictionary learning (unsupervised machine learning) 3. Damped least squares regularization on overall slowness map

Proposed locally-sparse tomography (LST) basics

“Local” model “Global” model

LST approach three ingredients: classified as local and global models

“Local” model: Models small-scale features as patches “Global” model: Models larger-scale features with damped least squares

Synthetic "checkerboard" slowness example

slide-38
SLIDE 38

Local model: slowness patches related to dictionary entries

Dictionary

10x10 pixel patches

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

Olshausen 2009

Slowness Natural image

slide-39
SLIDE 39

Pixels and “patches”

LST slowness image and sampling

Slowness map and sampling:

  • Discrete slowness map N=W1 x W2 pixels
  • verlapping pixel patches
  • M straight-ray paths

“Local” model “Global” model

Tomography matrix (straight ray)

x’s are stations

Slowness dictionary

slide-40
SLIDE 40

Formulation of LST and algorithm

Bayesian MAP objective:

Solution via block-coordinate descent

  • Global model: the global slowness is solved as
  • Local model: sparse coding and dictionary learning, decoupled from MAP objective
  • The sparse slowness is then solved from

( )

Dictionary learning by iterative thresholding and signed K- means (ITKM) algorithm, Schnass 2015

"Slowness at pixel n"

slide-41
SLIDE 41

Synthetic slownesses and dictionaries

Checkerboard "Fault" profile Ray sampling (64 stations, 2016 rays) Ray density Dictionaries: Prescribed and Learned

slide-42
SLIDE 42

LST vs. conventional method: synthetic inversions without noise

with Conventional tomography method (Rodgers 2000)

(a)

0.021

1 20 40 60 80 100

Range (km)

0.3 0.4 0.5 Slowness (s/km)

0.3 0.4 0.5

Slowness (s/km)

(b)

True Estimated

1 20 40 60 80 100

Range (km)

(c) (d)

0.022

(e) (f) (g)

0.019

(h) (i) (j)

0.010

1 20 40 60 80 100

Range (km)

1 20 40 60 80 100

Range (km)

(k) 0.3 0.33 0.36

Slowness (s/km)

(l)

(a)

0.047

1 20 40 60 80 100

Range (km)

0.3 0.4 0.5 Slowness (s/km)

0.3 0.4 0.5

Slowness (s/km)

(b)

True Estimated

(c)

0.056

(d) (e)

0.050

(f) (g)

0.021

1 20 40 60 80 100

Range (km)

1 20 40 60 80 100

Range (km)

(h)

Each example took ~5 min on MacBook Pro Slowness RMSE (s/km) written on 2D estimates

slide-43
SLIDE 43

LST vs. conventional method: synthetic inversions with travel time noise Checkerboard

with Conventional tomography method (Rodgers 2000)

  • Slowness RMSE (s/km) written on 2D estimates
  • Noise is Gaussian with STD 2% of mean travel time
slide-44
SLIDE 44

with Conventional tomography method (Rodgers 2000)

LST vs. conventional method: synthetic inversions with travel time noise Fault profile

  • Slowness RMSE (s/km) written on 2D estimates
  • Noise is Gaussian with STD 2% of mean travel time
slide-45
SLIDE 45

Imaging Long Beach, CA using LST: Big Data task

7 km 10 km

  • In March 2011, 5200 seismic stations were deployed in Long Beach, California over 70km2 area
  • Ambient seismic noise cross-correlations were obtained for all unique virtual source-receiver

pairs (~14 million ray paths) using 3 weeks of data

  • We consider only the 1Hz vertical component data, corresponding to Rayleigh surface waves

(from Lin et al. 2013)

  • After quality control there were ~8 million ray paths
slide-46
SLIDE 46

1Hz Rayleigh wave phase speed from LST

Long Beach array footprint

10 km 7 km

High-resolution LST phase speed map from 8 million cross-correlations

  • For LST we generate a 300x200 pixel slowness

map with 8 million rays (tomography matrix A has dimensions M=8 million, N=60000)

  • 10 iterations, used ~2 cpu-hours
  • Since we are not imposing global correlations
  • n pixels, can treat A as sparse matrix, get fast

inversion for global model (which is bottleneck)

  • Newport-Inglewood fault network shown as

black line

slide-47
SLIDE 47

LST comparison with eikonal tomography (Lin et al. 2013)

Eikonal tomography LST

  • We observe the same general trends between eikonal and LST
  • From LST have improved contrast along fault lines, for example near Signal Hill
  • The LST results are preliminary and they can likely be improved with more careful

preprocessing (future work)