On Inferences from Completed Data Jamie Haddock February 14, 2019 - PowerPoint PPT Presentation

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied Mathematics, UCLA joint with 2019 UCLA REU group (D. Molitor, D. Needell, S. Sambandam, J. Song, S. Sun) 1

Motivation MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org ( ∼ 12,000 patients, 100s of questions) 2

Motivation MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org ( ∼ 12,000 patients, 100s of questions) • data is highly incomplete due to branching structure of surveys and missing responses 2

Motivation MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org ( ∼ 12,000 patients, 100s of questions) • data is highly incomplete due to branching structure of surveys and missing responses • research questions of interest do not require individual entries 2

Motivation MyLymeData is a large collection of Lyme disease patient survey data collected by LymeDisease.org ( ∼ 12,000 patients, 100s of questions) • data is highly incomplete due to branching structure of surveys and missing responses • research questions of interest do not require individual entries Question: Can we perform statistical inferences on imputed data? 2

Main Question 3

Sampling and Imputation Techniques Uniform Sampling: Sample each entry with uniform probability p . 4

Sampling and Imputation Techniques Uniform Sampling: Sample each entry with uniform probability p . Structured Sampling: Sample zero and nonzero entries with p 0 and p 1 . 4

Sampling and Imputation Techniques Uniform Sampling: Sample each entry with uniform probability p . Structured Sampling: Sample zero and nonzero entries with p 0 and p 1 . Nuclear Norm Minimization (NNM): min � X � ∗ s.t. X ij = M ij for all ( i , j ) ∈ Ω 4

Sampling and Imputation Techniques Uniform Sampling: Sample each entry with uniform probability p . Structured Sampling: Sample zero and nonzero entries with p 0 and p 1 . Nuclear Norm Minimization (NNM): min � X � ∗ s.t. X ij = M ij for all ( i , j ) ∈ Ω ℓ 1 -Regularized Nuclear Norm Minimization ( ℓ 1 -NNM): min � X � ∗ + α � X Ω C � 1 s.t. X ij = M ij for all ( i , j ) ∈ Ω 4

Simple Inferences Entrywise Mean λ ( M ): mean of the entries of M • Entrywise mean error: E λ = | λ ( ˆ M ) − λ ( M ) | . ⊲ original matrix, M ⊲ recovered matrix, ˆ M Row Mean µ ( M ): average row of M • Normalized row mean error: E µ = � µ ( ˆ M ) − µ ( M ) � 2 . � µ ( M ) � 2 5

Experimental Design - Synthetic Data ⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0 , 1] 6

Experimental Design - Synthetic Data ⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0 , 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices 6

Experimental Design - Synthetic Data ⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0 , 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively 6

Experimental Design - Synthetic Data ⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0 , 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively • ℓ 1 regularization parameter α is chosen in { 0 . 05 , 0 . 1 , 0 . 2 , . . . , 0 . 5 } to minimize matrix recovery error 6

Experimental Design - Synthetic Data ⊲ 30 × 30 rank 5 matrix generated as product of sparse matrices with nonzero entries sampled uniformly from [0 , 1] ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively • ℓ 1 regularization parameter α is chosen in { 0 . 05 , 0 . 1 , 0 . 2 , . . . , 0 . 5 } to minimize matrix recovery error ⊲ matrix recovery error and inference errors averaged over 10 trials 6

Synthetic Data ⊲ p 0 = 0 ⊲ ω is proportion of entries sampled 7

Synthetic Data ⊲ p 0 = 0 . 2 ⊲ ω is proportion of entries sampled 8

Synthetic Data ⊲ p 0 = 0 . 4 ⊲ ω is proportion of entries sampled 9

Experimental Design - MyLymeData ⊲ complete 30 × 16 submatrix of MyLymeData 10

Experimental Design - MyLymeData ⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices 10

Experimental Design - MyLymeData ⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively 10

Experimental Design - MyLymeData ⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively • ℓ 1 regularization parameter α is chosen in { 0 . 05 , 0 . 1 , 0 . 2 , . . . , 0 . 5 } to minimize matrix recovery error 10

Experimental Design - MyLymeData ⊲ complete 30 × 16 submatrix of MyLymeData ⊲ each trial consists of sampling, completion, and inference on original and completed matrices • matrix is sampled via uniform sampling and structured sampling (with listed p 0 ), and completed with NNM and ℓ 1 -NNM respectively • ℓ 1 regularization parameter α is chosen in { 0 . 05 , 0 . 1 , 0 . 2 , . . . , 0 . 5 } to minimize matrix recovery error ⊲ matrix recovery error and inference errors averaged over 10 trials 10

MyLyme Data ⊲ p 0 = 0 ⊲ ω is proportion of entries sampled 11

MyLyme Data ⊲ p 0 = 0 . 2 ⊲ ω is proportion of entries sampled 12

MyLyme Data ⊲ p 0 = 0 . 4 ⊲ ω is proportion of entries sampled 13

Preliminary Error Bounds Inference Error Bound − 1 | λ ( M ) − λ ( ˆ q � M − ˆ Entrywise Mean M ) | ≤ ( mn ) M � q � 1 q � M − ˆ � n q − 1 � µ ( M ) − µ ( ˆ Row Mean M ) � q ≤ M � q m ⊲ M ∈ R m × n ⊲ recovered matrix, ˆ M 14

Entrywise Mean Simulation 15

Row Mean Simulation 16

Conclusions and Future Directions • inference errors can be smaller than the associated matrix recovery errors 17

Conclusions and Future Directions • inference errors can be smaller than the associated matrix recovery errors • structured sampling and ℓ 1 -NNM often results in better matrix and inference recovery than uniform sampling and NNM 17

Conclusions and Future Directions • inference errors can be smaller than the associated matrix recovery errors • structured sampling and ℓ 1 -NNM often results in better matrix and inference recovery than uniform sampling and NNM • develop exact recovery guarantees for ℓ 1 -NNM on matrices with observed entries selected via structured sampling 17

References and Acknowledgements es and Recht, 2009] Emmanuel J. Cand` es and Benjamin Recht (2009) [Cand` Exact Matrix Completion via Convex Optimization Foundations of Computational Mathematics 9, 771 – 772. [Molitor and Needell, 2018] Denali Molitor and Deanna Needell (2018) Matrix Completion for Structured Observations arXiv preprint arXiv:1801.09657 [Eld` en, 2007] Lars Eld` en Matrix Methods in Data Mining and Pattern Recognition, 69 Society for Industrial and Applied Mathematics, Philadelphia, 2007 Thank you to Professor Andrea Bertozzi, Dr. Anna Ma, Lorraine Johnson (LDo CEO), and the patients who contributed to the MyLymeData database! 18

Thanks! Questions? 19

On Inferences from Completed Data Jamie Haddock February 14, 2019 - PowerPoint PPT Presentation

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied Mathematics, UCLA joint with 2019 UCLA REU group (D. Molitor, D. Needell, S. Sambandam, J. Song, S. Sun) 1 Motivation MyLymeData is a large

Chapter 8 Slide 1 Inferences from Two Samples 8-1 Overview 8-2 Inferences about Two Proportions

Unit 1: Introduction to data Ultimate goal: make inferences about populations 1. Data

Consequences Inferences Concepts/Ideas Assumptions Elements of Reasoning Purpose/ Point of

Modal inferences in marked indefinites Maria Aloni [joint work with Angelika Port] [Special

CPSC 121: Models of Computation Module 7: Predicate Logic and Inferences Module 7: Predicate

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Causa Nostra: The Potentially Legitimate Business of Drawing Causal Inferences from Observational

Community Meeting 24 May 2017 Recent Activities MB61L sampling report completed

Facilities Update JUSD Board of Education October 15, 2018 Completed Work Orders Completed Work

100 Pier 4 | Boston, MA Del Ray Tower | Alexandria, VA Beach & Ocean | Huntington Beach, CA

October 18, 2017 Background 16 County Region Completed Regional Water Demand Forecast

Conclusions From Completed Trials in Conclusions From Completed Trials in High Risk Carotid

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

Generalizing inferences about failure-time outcomes from randomized individuals to a target

Should Security Researchers Experiment More and Draw More Inferences? * * With thanks to Walter

GTAS and Closing Package Update Jaime M. Saling April 18, 2018 The Issue: A Disclaimer of Opinion

Chapter 4 Trial Balance and Financial Statements 1 List of account balances and Trial balance $

PROGRAMMING FOR BUSINESS COMPUTING Applications in finance Hsin-Min Lu

Motivation Consider any of the popular/periodic rankings of Javier Estrada mutual fund

PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning

The Long and Winding Road to My Dream Job David Pooley Trinity University dpooley@trinity.edu

The Role of QR Centers in The Role of QR Centers in Supporting Students and Faculty Supporting

How ar are MPAs Man anaged? MPA management Mean High Water Mean Low Water Intertidal Zone

Sambuz

Useful Links

Newsletter

Mail Us

On Inferences from Completed Data Jamie Haddock February 14, 2019 - PowerPoint PPT Presentation

On Inferences from Completed Data Jamie Haddock February 14, 2019 Computational and Applied Mathematics, UCLA joint with 2019 UCLA REU group (D. Molitor, D. Needell, S. Sambandam, J. Song, S. Sun) 1 Motivation MyLymeData is a large

Chapter 8 Slide 1 Inferences from Two Samples 8-1 Overview 8-2 Inferences about Two Proportions

Unit 1: Introduction to data Ultimate goal: make inferences about populations 1. Data

Consequences Inferences Concepts/Ideas Assumptions Elements of Reasoning Purpose/ Point of

Modal inferences in marked indefinites Maria Aloni [joint work with Angelika Port] [Special

CPSC 121: Models of Computation Module 7: Predicate Logic and Inferences Module 7: Predicate

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Causa Nostra: The Potentially Legitimate Business of Drawing Causal Inferences from Observational

Community Meeting 24 May 2017 Recent Activities MB61L sampling report completed

Facilities Update JUSD Board of Education October 15, 2018 Completed Work Orders Completed Work

100 Pier 4 | Boston, MA Del Ray Tower | Alexandria, VA Beach &amp; Ocean | Huntington Beach, CA

October 18, 2017 Background 16 County Region Completed Regional Water Demand Forecast

Conclusions From Completed Trials in Conclusions From Completed Trials in High Risk Carotid

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

Generalizing inferences about failure-time outcomes from randomized individuals to a target

Should Security Researchers Experiment More and Draw More Inferences? * * With thanks to Walter

GTAS and Closing Package Update Jaime M. Saling April 18, 2018 The Issue: A Disclaimer of Opinion

Chapter 4 Trial Balance and Financial Statements 1 List of account balances and Trial balance $

PROGRAMMING FOR BUSINESS COMPUTING Applications in finance Hsin-Min Lu

Motivation Consider any of the popular/periodic rankings of Javier Estrada mutual fund

PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning

The Long and Winding Road to My Dream Job David Pooley Trinity University dpooley@trinity.edu

The Role of QR Centers in The Role of QR Centers in Supporting Students and Faculty Supporting

How ar are MPAs Man anaged? MPA management Mean High Water Mean Low Water Intertidal Zone

Sambuz

Useful Links

Newsletter

Mail Us

100 Pier 4 | Boston, MA Del Ray Tower | Alexandria, VA Beach & Ocean | Huntington Beach, CA