Prediction from low-rank missing data Elad Hazan Roi Livni Yishay - PowerPoint PPT Presentation

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay Mansour Princeton U Hebrew U Tel-Aviv U & Microsoft Research (all of us)

Recommendation systems

Predicting from low-rank missing data Gender? Annual income? Will buy “Halo4”? Likes cats or dogs? 1 0 0 1 1 1 0 1

Formally: predicting w. low-rank missing data Unknown distribution on vectors/rows x’ i in {0,1} n , missing data x i in {*,0,1} n (observed), X has rank k, training data y in {0,1}, every row has >= k observed entries Find: efficient machine M: {*,0,1} n à R s.t. with poly( δ , ε ,k,n) samples, with probability 1- δ : k w k 1 E i [( w > x i − y i ) 2 ] ≤ ✏ E i [( M ( x i ) − y i ) 2 ] − min Kernel version: k w k 1 E i [( w > � ( x i ) − y i ) 2 ] ≤ ✏ E i [( M ( x i ) − y i ) 2 ] − min

Difficulties § Missing data (usually MOST data is missing) § Structure in missing data (low rank) § NP-hard (low-rank reconstruction is a special case) § Can we use a non-proper approach? (distributional assumptions, convex relaxations for reconstruction)

Missing data (statistics & ML) Statistics books: i.i.d missing entries. recovery from (large) constant percentage (MCAR,MAR) Or generative model for missing-ness (MNAR) very different from what we need…

approach 1: Completion & prediction [ Goldberg, Zhu, Recht, Xu, Nowak ‘10 ] Method: add predictions y as another column in X, use matrix completion to reconstruct & predict.

Can we use approach 1? Completion & prediction [ Goldberg, Zhu, Recht, Xu, Nowak ‘10 ] reconstruction is not sufficient nor necessary!!

Can we use approach 1? Completion & prediction [ Goldberg, Zhu, Recht, Xu, Nowak ‘10 ] 1 * 1 * 1 * -1 * -1 -1 1 -1 1 -1 -1 1 * 1 * ?? Both are rank-2 1 -1 1 -1 1 1 1 1 1 1 completions 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 1 -1 1 -1 -1 1 -1 1 -1 -1 1 1 1 1 1

Can we use approach 1? Completion & prediction [ Goldberg, Zhu, Recht, Xu, Nowak ‘10 ] K K Gender? Annual income? Will buy “Halo4”? 0 1 0 1 * * * * 1 1 0 1 0 * * * * 0 0 0 0 0 * * * * 0 1 1 1 1 * * * * 1 * * * * 1 1 1 1 1 * * * * 1 0 1 0 1 * * * * 0 1 0 1 0 * * * * 1 0 1 0 1 There is a recoverable k-dim subspace!!

Our results (approach 2) § Agnostic learning – compete with the best linear predictor that knows all the data, assuming it is rank k (or close) § Provable § Efficient (theoretically & practically) § Significantly improves prediction over standard datasets (Netflix, Jester, ….) § Generalizes to kernel (non-linear) prediction

Our results (approach 2) Formally: Unknown distribution on rows x’ i in {0,1} n , missing data x i in {*,0,1} n (observed), X’ has rank k, training data y in {0,1}, every row has >= k observed entries We build efficient machine M: {*,0,1} n à R s.t. with poly(log δ , k,n log(1/ ε ) ) samples, with probability 1- δ : k w k 1 E i [( w > x i − y i ) 2 ] ≤ ✏ E i [( M ( x i ) − y i ) 2 ] − min Extends to arbitrary kernels, # samples increases w. degree (polynomial kernels)

Warm up: agnostic, non-proper & useless (inefficient) § Data matrix = X of size m *n (X’ is full matrix, X with hidden entries) rank = k every row has k visible entries § “Optimal predictor” = subspace + linear predictor (SVM) § B = basis , k * n matrix § w = predictor, vector in R k § Given x = row in X, unknown label y predict according to: B α = x y = α > w ˆ

Warm up: inefficient, agnostic § Given x = row in X, unknown label y predict according to: ¡ B α = x y = α > w ˆ Inefficiently: learn B, w (bounded sample complexity/regret – compact sets) (distributional world – bounded fat-shattering dimension)

Learning a hidden subspace Learning a hidden subspace is hidden-clique hard! [ Berthet & Rigollet ‘13 ], any hope for efficient algorithms? Hardness applies only for proper learning!!

Efficient agnostic algorithm § Let s be the set of k coordinates that are visible in a certain x. Then: ¡ B α = x ⇔ y = ( B � 1 s x s ) > w ˆ y = α > w ˆ Where B s and x s are the submatrix (vector) corresponding to the coordinates s. “2 operations” – subset of s rows & inverse

Step 1: “rid of inverse” Replace inverse by polynomial (need condition on the eigenvalues): 2 3 1 X w > B � 1 s x s = w > ( I s − B s ) j 5 x s 4 j =1 Let C = I – B, and up to precision independent of k,n: 2 3 q 5 x s + O (1 X w > B � 1 s x s = w > C j q ) 4 s j =1 Thus, consider (non-proper) hypothesis class: 2 3 q X 5 x s g C,w ( x s ) = w > C j 4 s j =1

Step 2: “rid of column selection” Observation: ¡ X g C,w ( x s ) = w ` 1 C ` 1 , ` 2 × ... × C ` | ` | − 1 , ` | ` | · x ` | ` | | ` | ≤ q ` ⊆ s (polynomial in C,w multiplied by coefficients of x) Thus, there is a kernel mapping, and vector v=v(C,w) such that g C,w ( x s ) = v > Φ ( x s ) v = v ( C, w ) ∈ R n q

Observation 3 Kernel inner products take the form: t ) = | s ∩ t | q − 1 s ) · φ ( x (2) x (1) k x (2) φ ( x (1) X k | s ∩ t | − 1 k ∈ s ∩ t Inner product ϕ (x s )* ϕ (x t ) –computed in time n*q ¡

Algorithm Kernel function t ) = | s ∩ t | q − 1 s ) · φ ( x (2) x (1) k x (2) φ ( x (1) X k | s ∩ t | − 1 k ∈ s ∩ t Algorithm: SVM kernel with this particular kernel. Guarantee – agnostic, non-proper, as good as best subspace embedding. Nearly same algorithm for all degree q! ¡

λ - regularity To apply the Taylor series – eigenvalues need to be in unit circle. Reduces to an assumption on appearance of missing data. This is provably necessary. Regret bound (sample complexity) depend on this parameter – which is provably a constant independent of the rank/problem dimensions. Running time – independent of this parameter.

Preliminary benchmarks MAR data ours 0 − imputation 1 mc1/mcb 0.8 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Preliminary benchmarks NMAR data (blocks) 1 ours 0 − imputation 0.8 mcb/mc1 0.6 0.4 0.2 0 500 1000 1500

Preliminary benchmarks real data Karma 0-svm Mcb0 Mcb1 Geom mamographic 0.17 0.17 0.17 0.18 0.17 bands 0.24 0.34 0.41 0.40 0.35 hepatitis 0.23 0.17 0.23 0.21 0.22 wisconsin 0.03 0.03 0.03 0.04 0.04 Horses 0.35 0.36 0.55 0.37 0.36 Movielens 0.16 0.22 0.25 0.25 NaN (age)

Summary Prediction from recommendation data: § Reconstruction+relaxation approach doomed to fail § Non-proper agnostic learning gives provable guarantees, efficient algorithm Thank you! § Benchmarks are promising § Non-reconstructive approach for other types of missing data? Fully-polynomial alg? § When does reconstruction fail and agnostic/non- proper learning work?

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay - PowerPoint PPT Presentation

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay Mansour Princeton U Hebrew U Tel-Aviv U & Microsoft Research (all of us) Recommendation systems Predicting from low-rank missing data Gender? Annual income? Will buy

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Effective Missing Data Prediction for Collaborative Filtering Hao Ma, Irwin King, and Michael R.

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frdric Vivien 2 , 1

TENSOR NETWORK STATES FOR LATTICE GAUGE THEORIES about classical TNS simulations of a

The Comparison of ACI and MCB Methods for Choosing a Set that Contains the Optimal Dynamic

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer

An application of optimal control on neuronal dynamics using koopman operator Putian He

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 eric Vivien 2 Henri

Mapping ideals of quantum group multipliers Jason Crann with M. Alaghmandan and M. Neufang

Differentially Private Markov Chain Monte Carlo o 2 , Onur Dikmen 3 and Antti Honkela 1 a 1

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay - PowerPoint PPT Presentation

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay Mansour Princeton U Hebrew U Tel-Aviv U & Microsoft Research (all of us) Recommendation systems Predicting from low-rank missing data Gender? Annual income? Will buy

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &amp;

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Effective Missing Data Prediction for Collaborative Filtering Hao Ma, Irwin King, and Michael R.

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frdric Vivien 2 , 1

TENSOR NETWORK STATES FOR LATTICE GAUGE THEORIES about classical TNS simulations of a

The Comparison of ACI and MCB Methods for Choosing a Set that Contains the Optimal Dynamic

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer

An application of optimal control on neuronal dynamics using koopman operator Putian He

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 eric Vivien 2 Henri

Mapping ideals of quantum group multipliers Jason Crann with M. Alaghmandan and M. Neufang

Differentially Private Markov Chain Monte Carlo o 2 , Onur Dikmen 3 and Antti Honkela 1 a 1

Supervised Rank Aggregation Approach for Link Prediction in Complex Networks Manisha Pujari &