On Inferences from Completed Data Jamie Haddock February 14, 2019 - - PowerPoint PPT Presentation

on inferences from completed data
SMART_READER_LITE
LIVE PREVIEW

On Inferences from Completed Data Jamie Haddock February 14, 2019 - - PowerPoint PPT Presentation

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied Mathematics, UCLA joint with 2019 UCLA REU group (D. Molitor, D. Needell, S. Sambandam, J. Song, S. Sun) 1 Motivation MyLymeData is a large


slide-1
SLIDE 1

On Inferences from Completed Data

Jamie Haddock February 14, 2019

Computational and Applied Mathematics, UCLA

joint with 2019 UCLA REU group (D. Molitor, D. Needell, S. Sambandam, J. Song, S. Sun)

1

slide-2
SLIDE 2

Motivation

MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org (∼12,000 patients, 100s of questions)

2

slide-3
SLIDE 3

Motivation

MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org (∼12,000 patients, 100s of questions)

  • data is highly incomplete due to branching structure of surveys and

missing responses

2

slide-4
SLIDE 4

Motivation

MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org (∼12,000 patients, 100s of questions)

  • data is highly incomplete due to branching structure of surveys and

missing responses

  • research questions of interest do not require individual entries

2

slide-5
SLIDE 5

Motivation

MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org (∼12,000 patients, 100s of questions)

  • data is highly incomplete due to branching structure of surveys and

missing responses

  • research questions of interest do not require individual entries

Question: Can we perform statistical inferences on imputed data?

2

slide-6
SLIDE 6

Main Question

3

slide-7
SLIDE 7

Sampling and Imputation Techniques

Uniform Sampling: Sample each entry with uniform probability p.

4

slide-8
SLIDE 8

Sampling and Imputation Techniques

Uniform Sampling: Sample each entry with uniform probability p. Structured Sampling: Sample zero and nonzero entries with p0 and p1.

4

slide-9
SLIDE 9

Sampling and Imputation Techniques

Uniform Sampling: Sample each entry with uniform probability p. Structured Sampling: Sample zero and nonzero entries with p0 and p1. Nuclear Norm Minimization (NNM): min X∗ s.t. Xij = Mij for all (i, j) ∈ Ω

4

slide-10
SLIDE 10

Sampling and Imputation Techniques

Uniform Sampling: Sample each entry with uniform probability p. Structured Sampling: Sample zero and nonzero entries with p0 and p1. Nuclear Norm Minimization (NNM): min X∗ s.t. Xij = Mij for all (i, j) ∈ Ω ℓ1-Regularized Nuclear Norm Minimization (ℓ1-NNM): min X∗ + αXΩC 1 s.t. Xij = Mij for all (i, j) ∈ Ω

4

slide-11
SLIDE 11

Simple Inferences

Entrywise Mean λ(M): mean of the entries of M

  • Entrywise mean error:

Eλ = |λ( ˆ M) − λ(M)|. Row Mean µ(M): average row of M

  • Normalized row mean error:

Eµ = µ( ˆ M) − µ(M)2 µ(M)2 . ⊲ original matrix, M ⊲ recovered matrix, ˆ M

5

slide-12
SLIDE 12

Experimental Design - Synthetic Data

⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0, 1]

6

slide-13
SLIDE 13

Experimental Design - Synthetic Data

⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0, 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

6

slide-14
SLIDE 14

Experimental Design - Synthetic Data

⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0, 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

6

slide-15
SLIDE 15

Experimental Design - Synthetic Data

⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0, 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

  • ℓ1 regularization parameter α is chosen in {0.05, 0.1, 0.2, . . . , 0.5} to

minimize matrix recovery error

6

slide-16
SLIDE 16

Experimental Design - Synthetic Data

⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0, 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

  • ℓ1 regularization parameter α is chosen in {0.05, 0.1, 0.2, . . . , 0.5} to

minimize matrix recovery error

⊲ matrix recovery error and inference errors averaged over 10 trials

6

slide-17
SLIDE 17

Synthetic Data

⊲ p0 = 0 ⊲ ω is proportion of entries sampled

7

slide-18
SLIDE 18

Synthetic Data

⊲ p0 = 0.2 ⊲ ω is proportion of entries sampled

8

slide-19
SLIDE 19

Synthetic Data

⊲ p0 = 0.4 ⊲ ω is proportion of entries sampled

9

slide-20
SLIDE 20

Experimental Design - MyLymeData

⊲ complete 30 × 16 submatrix of MyLymeData

10

slide-21
SLIDE 21

Experimental Design - MyLymeData

⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

10

slide-22
SLIDE 22

Experimental Design - MyLymeData

⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

10

slide-23
SLIDE 23

Experimental Design - MyLymeData

⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

  • ℓ1 regularization parameter α is chosen in {0.05, 0.1, 0.2, . . . , 0.5} to

minimize matrix recovery error

10

slide-24
SLIDE 24

Experimental Design - MyLymeData

⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices

  • matrix is sampled via uniform sampling and structured sampling

(with listed p0), and completed with NNM and ℓ1-NNM respectively

  • ℓ1 regularization parameter α is chosen in {0.05, 0.1, 0.2, . . . , 0.5} to

minimize matrix recovery error

⊲ matrix recovery error and inference errors averaged over 10 trials

10

slide-25
SLIDE 25

MyLyme Data

⊲ p0 = 0 ⊲ ω is proportion of entries sampled

11

slide-26
SLIDE 26

MyLyme Data

⊲ p0 = 0.2 ⊲ ω is proportion of entries sampled

12

slide-27
SLIDE 27

MyLyme Data

⊲ p0 = 0.4 ⊲ ω is proportion of entries sampled

13

slide-28
SLIDE 28

Preliminary Error Bounds

Inference Error Bound Entrywise Mean |λ(M) − λ( ˆ M)| ≤ (mn)

− 1 q M − ˆ

Mq Row Mean µ(M) − µ( ˆ M)q ≤

  • nq−1

m

1

q M − ˆ

Mq ⊲ M ∈ Rm×n ⊲ recovered matrix, ˆ M

14

slide-29
SLIDE 29

Entrywise Mean Simulation

15

slide-30
SLIDE 30

Row Mean Simulation

16

slide-31
SLIDE 31

Conclusions and Future Directions

  • inference errors can be smaller than the associated matrix recovery

errors

17

slide-32
SLIDE 32

Conclusions and Future Directions

  • inference errors can be smaller than the associated matrix recovery

errors

  • structured sampling and ℓ1-NNM often results in better matrix and

inference recovery than uniform sampling and NNM

17

slide-33
SLIDE 33

Conclusions and Future Directions

  • inference errors can be smaller than the associated matrix recovery

errors

  • structured sampling and ℓ1-NNM often results in better matrix and

inference recovery than uniform sampling and NNM

  • develop exact recovery guarantees for ℓ1-NNM on matrices with
  • bserved entries selected via structured sampling

17

slide-34
SLIDE 34

References and Acknowledgements

[Cand` es and Recht, 2009] Emmanuel J. Cand`

es and Benjamin Recht (2009) Exact Matrix Completion via Convex Optimization Foundations of Computational Mathematics 9, 771 – 772. [Molitor and Needell, 2018] Denali Molitor and Deanna Needell (2018) Matrix Completion for Structured Observations arXiv preprint arXiv:1801.09657 [Eld` en, 2007] Lars Eld` en Matrix Methods in Data Mining and Pattern Recognition, 69 Society for Industrial and Applied Mathematics, Philadelphia, 2007 Thank you to Professor Andrea Bertozzi, Dr. Anna Ma, Lorraine Johnson (LDo CEO), and the patients who contributed to the MyLymeData database!

18

slide-35
SLIDE 35

Thanks!

Questions?

19