Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for - - PowerPoint PPT Presentation

sparse proteomics analysis spa
SMART_READER_LITE
LIVE PREVIEW

Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for - - PowerPoint PPT Presentation

Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit at Berlin Winter School on Compressed Sensing December 5, 2015 Outline Biological Background 1


slide-1
SLIDE 1

Sparse Proteomics Analysis (SPA)

Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel

Technische Universit¨ at Berlin

Winter School on Compressed Sensing December 5, 2015

slide-2
SLIDE 2

Outline

1

Biological Background

2

Sparse Proteomics Analysis (SPA)

3

Theoretical Foundation by High-dimensional Estimation Theory

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 2 / 19

slide-3
SLIDE 3

1

Biological Background

2

Sparse Proteomics Analysis (SPA)

3

Theoretical Foundation by High-dimensional Estimation Theory

slide-4
SLIDE 4

What is Proteomics?

The pathological mechanisms of many diseases, such as cancer, are manifested on the level of protein activities.

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19

slide-5
SLIDE 5

What is Proteomics?

The pathological mechanisms of many diseases, such as cancer, are manifested on the level of protein activities. To improve clinical treatment options and early diagnostics, we need to understand protein structures and their interactions! Proteins are long chains of amino acids, controlling many biological and chemical processes in the human body.

http://www.topsan.org/Proteins/JCSG/3qxb Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19

slide-6
SLIDE 6

What is Proteomics?

The pathological mechanisms of many diseases, such as cancer, are manifested on the level of protein activities. To improve clinical treatment options and early diagnostics, we need to understand protein structures and their interactions! Proteins are long chains of amino acids, controlling many biological and chemical processes in the human body. The entire set of proteins at a certain point of time is called a proteome. Proteomics is the large-scale study of the human proteome.

http://www.topsan.org/Proteins/JCSG/3qxb Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19

slide-7
SLIDE 7

What is Mass Spectrometry? How to “capture” a proteome?

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19

slide-8
SLIDE 8

What is Mass Spectrometry? How to “capture” a proteome?

Mass spectrometry (MS) is a popular technique to detect the abundance

  • f proteins in samples (blood, urine, etc.).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19

slide-9
SLIDE 9

What is Mass Spectrometry? How to “capture” a proteome?

Mass spectrometry (MS) is a popular technique to detect the abundance

  • f proteins in samples (blood, urine, etc.).

Schematic Work-Flow

Detector

Laser

Intensity (cts) Mass (m/z)

+ + +

  • -

+ +

Sample

Mass spectrum

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19

slide-10
SLIDE 10

Real-World MS-Data

Mass (m/z) Intensity (cts)

MS-vector: x = (x1, . . . , xd) ∈ Rd, d ≈ 104 . . . 106 Index ˆ = Mass/Feature, Entry ˆ = Intensity/Amplitude

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19

slide-11
SLIDE 11

Real-World MS-Data

Mass (m/z) Intensity (cts)

MS-vector: x = (x1, . . . , xd) ∈ Rd, d ≈ 104 . . . 106 Index ˆ = Mass/Feature, Entry ˆ = Intensity/Amplitude

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19

slide-12
SLIDE 12

Real-World MS-Data

Mass (m/z) Intensity (cts)

MS-vector: x = (x1, . . . , xd) ∈ Rd, d ≈ 104 . . . 106 Index ˆ = Mass/Feature, Entry ˆ = Intensity/Amplitude

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19

slide-13
SLIDE 13

Feature Selection from MS-Data

Goal: Detect a small set of features (disease fingerprint) that allows for an appropriate distinction between the diseased and healthy group.

Schematic Work-Flow

Blood from healthy individual Blood from diseased individual Samples

Mass (m/z)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19

slide-14
SLIDE 14

Feature Selection from MS-Data

Goal: Detect a small set of features (disease fingerprint) that allows for an appropriate distinction between the diseased and healthy group.

Schematic Work-Flow

Mass (m/z)

MS ¡

Mass (m/z)

Blood from healthy individual Blood from diseased individual

MS ¡

Intensity (cts) Intensity (cts)

Samples Mass Spectra Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19

slide-15
SLIDE 15

Feature Selection from MS-Data

Goal: Detect a small set of features (disease fingerprint) that allows for an appropriate distinction between the diseased and healthy group.

Schematic Work-Flow

Mass (m/z)

MS ¡

Mass (m/z)

Blood from healthy individual Blood from diseased individual Disease Fingerprint

Comparing ¡ MS ¡

Intensity (cts) Intensity (cts)

Samples Mass Spectra Feature Selection Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19

slide-16
SLIDE 16

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn). xk ∈ Rd: Mass spectrum of the k-th patient yk ∈ {−1, +1}: Health status of the k-th patient (healthy = +1, diseased = −1)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19

slide-17
SLIDE 17

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn). xk ∈ Rd: Mass spectrum of the k-th patient yk ∈ {−1, +1}: Health status of the k-th patient (healthy = +1, diseased = −1)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19

slide-18
SLIDE 18

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn). xk ∈ Rd: Mass spectrum of the k-th patient yk ∈ {−1, +1}: Health status of the k-th patient (healthy = +1, diseased = −1)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19

slide-19
SLIDE 19

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn). xk ∈ Rd: Mass spectrum of the k-th patient yk ∈ {−1, +1}: Health status of the k-th patient (healthy = +1, diseased = −1) Goal: Learn a feature vector ω ∈ Rd which is sparse, i.e., few non-zero entries, (⇒ stability, avoid overfitting) and its entries correspond to peaks that are highly correlated with the disease. (⇒ interpretability, biological relevance)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19

slide-20
SLIDE 20

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn). xk ∈ Rd: Mass spectrum of the k-th patient yk ∈ {−1, +1}: Health status of the k-th patient (healthy = +1, diseased = −1) Goal: Learn a feature vector ω ∈ Rd which is sparse, i.e., few non-zero entries, (⇒ stability, avoid overfitting) and its entries correspond to peaks that are highly correlated with the disease. (⇒ interpretability, biological relevance)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19

slide-21
SLIDE 21

How to learn a fingerprint ω?

slide-22
SLIDE 22

1

Biological Background

2

Sparse Proteomics Analysis (SPA)

3

Theoretical Foundation by High-dimensional Estimation Theory

slide-23
SLIDE 23

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis is a generic framework to meet this challenge.

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19

slide-24
SLIDE 24

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis is a generic framework to meet this challenge. Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} Compute:

1 Preprocessing (Smoothing, Standardization) 2 Feature Selection (LASSO, ℓ1-SVM, Robust 1-Bit CS) 3 Postprocessing (Sparsification)

Output: Sparse feature vector ω ∈ Rd

⇒ Biomarker extraction, dimension reduction

Mass (m/z)

Blood ¡Sample

Biomarker Identification

Intensity (cts)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19

slide-25
SLIDE 25

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis is a generic framework to meet this challenge. Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} Compute:

1 Preprocessing (Smoothing, Standardization) 2 Feature Selection (LASSO, ℓ1-SVM, Robust 1-Bit CS) 3 Postprocessing (Sparsification)

Output: Sparse feature vector ω ∈ Rd

Rest of this talk

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19

slide-26
SLIDE 26

Feature Selection (Geometric Intuition)

Linear Separation Model: Find a feature vector ω ∈ Rd such that yk = sign(xk, ω) for “many” k ∈ {1, . . . , n}. Moreover, ω should be sparse and interpretable.

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 9 / 19

slide-27
SLIDE 27

Feature Selection via the LASSO

The LASSO (Tibshirani ’96)

min

ω∈Rd n

  • k=1

(yk − xk, ω)2 subject to ω1 ≤ R Multivariate approach, originally designed for linear regression models: yk ≈ xk, ω, k = 1, . . . , n. But also applicable to non-linear models → Next part Later: R ≈ √s to allow for s-sparse solutions (with unit norm).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 10 / 19

slide-28
SLIDE 28

Feature Selection via the LASSO

The LASSO (Tibshirani ’96)

min

ω∈Rd n

  • k=1

(yk − xk, ω)2 subject to ω1 ≤ R Multivariate approach, originally designed for linear regression models: yk ≈ xk, ω, k = 1, . . . , n. But also applicable to non-linear models → Next part Later: R ≈ √s to allow for s-sparse solutions (with unit norm).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 10 / 19

slide-29
SLIDE 29

Some Numerical Results

5-fold cross-validation for real-world pancreas data (156 samples):

1 Learn feature vector ω

by SPA, using 80% of the samples.

2 Classify the remaining

20% of the sample by an

  • rdinary SVM, after

projecting onto supp(ω).

3 Iterate this procedure

12-times for random partitions.

Classification accuracy for different sparsity levels s = # supp(ω)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 11 / 19

slide-30
SLIDE 30

But what about theoretical guarantees?

slide-31
SLIDE 31

1

Biological Background

2

Sparse Proteomics Analysis (SPA)

3

Theoretical Foundation by High-dimensional Estimation Theory

slide-32
SLIDE 32

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-33
SLIDE 33

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-34
SLIDE 34

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n am: Deterministic feature atom, sampled Gaussian peak (∈ Rd) sm,k: Random latent factor specifying the peak amplitude (∈ R) nk: Random baseline noise (∈ Rd)

𝑡",$ % exp −(% −𝑑")- 𝛾"

  • 𝑡",$

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-35
SLIDE 35

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n am: Deterministic feature atom, sampled Gaussian peak (∈ Rd) sm,k: Random latent factor specifying the peak amplitude (∈ R) nk: Random baseline noise (∈ Rd)

𝑡",$ % exp −(% −𝑑")- 𝛾"

  • 𝑡",$

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-36
SLIDE 36

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n am: Deterministic feature atom, sampled Gaussian peak (∈ Rd) sm,k: Random latent factor specifying the peak amplitude (∈ R) nk: Random baseline noise (∈ Rd)

𝑡",$ % exp −(% −𝑑")- 𝛾"

  • 𝑡",$

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-37
SLIDE 37

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n

Supposed that sufficiently many samples are given, can we learn the sparse fingerprint ω0?

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-38
SLIDE 38

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n

Supposed that sufficiently many samples are given, can we learn the sparse fingerprint ω0?

Problem: The vector ω0 is not unique because some features are perfectly correlated ⇒ No hope for support recovery or approximation

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-39
SLIDE 39

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels: yk = sign(xk, ω0), k = 1, . . . , n Forward Model: Explains the random distribution of the data: xk = M

m=1 sm,kam + nk,

k = 1, . . . , n

Supposed that sufficiently many samples are given, can we learn the sparse fingerprint ω0?

Problem: The vector ω0 is not unique because some features are perfectly correlated ⇒ No hope for support recovery or approximation

Idea: Separate the fingerprint from its data representation!

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19

slide-40
SLIDE 40

Combining the Models

xk = M

m=1 sm,kam + nk,

k = 1, . . . , n Assumptions: sk := (s1,k, . . . , sM,k) ∼ N(0, IM) – peak amplitudes nk ∼ N(0, σ2Id) – noise vector a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=    a⊤

1

. . . a⊤

M

   ∈ RM×d

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19

slide-41
SLIDE 41

Combining the Models

xk = M

m=1 sm,kam + nk,

k = 1, . . . , n Assumptions: sk := (s1,k, . . . , sM,k) ∼ N(0, IM) – peak amplitudes nk ∼ N(0, σ2Id) – noise vector a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=    a⊤

1

. . . a⊤

M

   ∈ RM×d Put this into the classification model: yk = sign(xk, ω0) = sign( M

m=1 sm,kam + nk, ω0)

= sign(D⊤sk + nk, ω0)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19

slide-42
SLIDE 42

Combining the Models

xk = M

m=1 sm,kam + nk,

k = 1, . . . , n Assumptions: sk := (s1,k, . . . , sM,k) ∼ N(0, IM) – peak amplitudes nk ∼ N(0, σ2Id) – noise vector a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=    a⊤

1

. . . a⊤

M

   ∈ RM×d Put this into the classification model: yk = sign(xk, ω0) = sign( M

m=1 sm,kam + nk, ω0)

= sign(D⊤sk + nk, ω0) = sign(sk, Dω0

  • =:z0

+ nk, ω0) = sign(sk, z0 + nk, ω0)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19

slide-43
SLIDE 43

Signal Space vs. Coefficient Space

xk = M

m=1 sm,kam + nk = D⊤sk + nk

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19

slide-44
SLIDE 44

Signal Space vs. Coefficient Space

xk = M

m=1 sm,kam = D⊤sk

Let us first assume that nk = 0 (no baseline noise).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19

slide-45
SLIDE 45

Signal Space vs. Coefficient Space

xk = M

m=1 sm,kam = D⊤sk

Let us first assume that nk = 0 (no baseline noise). Then yk = sign(xk, ω0) = sign(sk, z0), where z0 = Dω0.

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19

slide-46
SLIDE 46

Signal Space vs. Coefficient Space

xk = M

m=1 sm,kam = D⊤sk

Let us first assume that nk = 0 (no baseline noise). Then yk = sign(xk, ω0) = sign(sk, z0), where z0 = Dω0. z0 has a (non-unique) representation in the dictionary D with sparse coefficients ω0. z0 “lives” in the signal space RM (independent of specific data type). ω0 “lives” in the coefficient space Rd (data dependent).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19

slide-47
SLIDE 47

Signal Space vs. Coefficient Space

xk = M

m=1 sm,kam = D⊤sk

Let us first assume that nk = 0 (no baseline noise). Then yk = sign(xk, ω0) = sign(sk, z0), where z0 = Dω0. z0 has a (non-unique) representation in the dictionary D with sparse coefficients ω0. z0 “lives” in the signal space RM (independent of specific data type). ω0 “lives” in the coefficient space Rd (data dependent).

⇒ Try to show a recovery result for z0!

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19

slide-48
SLIDE 48

What Does This Mean for the LASSO?

yk = sign(xk, ω0) = sign(sk, z0) with z0 = Dω0

SPA via the LASSO

min

ω∈Rd n

  • k=1

(yk − xk, ω)2 subject to ω1 ≤ R

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19

slide-49
SLIDE 49

What Does This Mean for the LASSO?

yk = sign(xk, ω0) = sign(sk, z0) with z0 = Dω0

SPA via the LASSO

min

ω∈R·Bd

1

n

  • k=1

(yk − xk, ω )2

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19

slide-50
SLIDE 50

What Does This Mean for the LASSO?

yk = sign(xk, ω0) = sign(sk, z0) with z0 = Dω0

SPA via the LASSO

min

ω∈R·Bd

1

n

  • k=1

(yk − xk, ω

=sk,z

)2

z:=Dω ↓

= min

z∈R·DBd

1

n

  • k=1

(yk − sk, z)2

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19

slide-51
SLIDE 51

What Does This Mean for the LASSO?

yk = sign(xk, ω0) = sign(sk, z0) with z0 = Dω0

SPA via the LASSO

min

ω∈R·Bd

1

n

  • k=1

(yk − xk, ω )2

  • Solvable in practice!

z:=Dω ↓

= min

z∈R·DBd

1

n

  • k=1

(yk − sk, z)2

  • Solvable in theory!

Warning: The minimizers “live” in different spaces! Warning: We neither know D nor sk, but just their product.

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19

slide-52
SLIDE 52

What Does This Mean for the LASSO?

yk = sign(xk, ω0) = sign(sk, z0) with z0 = Dω0

SPA via the LASSO

min

ω∈R·Bd

1

n

  • k=1

(yk − xk, ω )2

  • Solvable in practice!

z:=Dω ↓

= min

z∈R·DBd

1

n

  • k=1

(yk − sk, z)2

  • Solvable in theory!

Warning: The minimizers “live” in different spaces! Warning: We neither know D nor sk, but just their product.

Idea: Apply results for the K-LASSO with K = R · DBd

1 !

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19

slide-53
SLIDE 53

A Simplified Version of Roman Vershynin’s Result

Theorem (Plan, Vershynin ’15)

Suppose that sk ∼ N(0, IM), z0 ∈ SM−1, and the observations follow yk = sign(sk, z0), k = 1, . . . , n. Put µ =

  • 2

π and assume that µz0 ∈ K, where K is convex, and

n w(K)2. Then, with high probability, the solution ˆ z of the K-LASSO satisfies ˆ z − µz02

  • w(K)

√n .

The (global) mean width for bounded K ⊂ RM is given by w(K) = sup

u∈K

g, u, where g ∼ N(0, IM).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 16 / 19

slide-54
SLIDE 54

A Simplified Version of Roman Vershynin’s Result

Theorem (Plan, Vershynin ’15)

Suppose that sk ∼ N(0, IM), z0 ∈ SM−1, and the observations follow yk = sign(sk, z0), k = 1, . . . , n. Put µ =

  • 2

π and assume that µz0 ∈ K, where K is convex, and

n w(K)2. Then, with high probability, the solution ˆ z of the K-LASSO satisfies ˆ z − µz02

  • w(K)

√n .

Assume that K = µR · DBd

1

⇒ z0 = Dω0 for some ω0 ∈ R · Bd

1 .

Assume that the columns of D are normalized. Then w(K) R ·

  • log(d).

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 16 / 19

slide-55
SLIDE 55

A Recovery Guarantee for SPA

Theorem (G. ’15)

Suppose that sk ∼ N(0, IM). Let z0 ∈ SM−1 and assume that there exists R > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd

1 . The observations follow

yk = sign(sk, z0) = sign(xk, ω0), k = 1, . . . , n. and the number of samples satisfies n R2 · log(d). Then, with high probability, the solution of the LASSO ˆ z = argmin

z∈R·DBd

1

n

  • k=1

(yk − sk, z)2 satisfies ˆ z −

  • 2

πz02

  • R2·log(d)

n

1/4 .

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19

slide-56
SLIDE 56

A Recovery Guarantee for SPA

Theorem (G. ’15)

Suppose that sk ∼ N(0, IM). Let z0 ∈ SM−1 and assume that there exists R > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd

1 . The observations follow

yk = sign(sk, z0) = sign(xk, ω0), k = 1, . . . , n. and the number of samples satisfies n R2 · log(d). Then, with high probability, the solution of the LASSO ˆ z = argmin

z∈R·DBd

1

n

  • k=1

(yk − sk, z)2 satisfies ˆ z −

  • 2

πz02

  • R2·log(d)

n

1/4 .

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19

slide-57
SLIDE 57

A Recovery Guarantee for SPA

Theorem (G. ’15)

Suppose that sk ∼ N(0, IM). Let z0 ∈ SM−1 and assume that there exists R > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd

1 . The observations follow

yk = sign(sk, z0) = sign(xk, ω0), k = 1, . . . , n. and the number of samples satisfies n R2 · log(d). Then, with high probability, the solution of the LASSO ˆ z = D · ˆ ω = D · argmin

ω∈R·Bd

1

n

  • k=1

(yk − xk, ω)2 satisfies D ˆ ω −

  • 2

πDω02 = ˆ

z −

  • 2

πz02

  • R2·log(d)

n

1/4 .

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19

slide-58
SLIDE 58

Practical Relevance for MS-Data?

Extensions:

◮ Baseline noise nk ∼ N(0, σ2Id) ◮ Non-trivial covariance matrix, i.e., sk ∼ N(0, Σ) ◮ Adversarial bit-flips in the model yk = sign(xk, ω0)

How to achieve normalized columns in D? How to guarantee that R ≈ √s, i.e., s-sparse vectors are allowed? → Standardize the data (centering + normalizing) Given ˆ ω, how to switch over to the signal space? (D is unknown) → Identify supp(ˆ ω) with peaks (manual approach)

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 18 / 19

slide-59
SLIDE 59

Practical Relevance for MS-Data?

Extensions:

◮ Baseline noise nk ∼ N(0, σ2Id) ◮ Non-trivial covariance matrix, i.e., sk ∼ N(0, Σ) ◮ Adversarial bit-flips in the model yk = sign(xk, ω0)

How to achieve normalized columns in D? How to guarantee that R ≈ √s, i.e., s-sparse vectors are allowed? → Standardize the data (centering + normalizing) Given ˆ ω, how to switch over to the signal space? (D is unknown) → Identify supp(ˆ ω) with peaks (manual approach) Message of this talk

An s-sparse disease fingerprint can be accurately recovered from only O(s log(d)) samples!

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 18 / 19

slide-60
SLIDE 60

THANK YOU FOR YOUR ATTENTION!

Further Reading

  • M. Genzel

Sparse Proteomics Analysis: Toward a Mathematical Foundation of Feature Selection and Disease Classification. Master’s Thesis, 2015.

  • Y. Plan, R. Vershynin

The generalized Lasso with non-linear observations. arXiv:1502.04071, 2015.

slide-61
SLIDE 61

What to Do Next?

Development of an abstract framework → What kind of properties should the dictionary D have? Extension/generalization of the results → More complicated models and algorithms Numerical verification of the theory Other examples from real-world applications → Bio-informatics, neuro-imaging, astronomy, chemistry, . . . Dictionary learning / Factor analysis → What can we learn about D?

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 19 / 19