One-shot learning and big data with n = 2 Lee Dicker Department of Statistics Rutgers University Joint work w/Dean Foster DIMACS, May 16, 2013 DIMACS, May 16, 2013 – 1 / 26
Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Introduction and overview Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 2 / 26
One-shot learning Introduction and � Humans are able to correctly recognize and understand objects based on overview very few training examples. Statistical setting Principal component � e.g. images, words. regression Testing Training Weak consistency and big data with n = 2 Risk approximations and consistency Numerical results Conclusions and future directions − → Flamingo � Flamingo? Flamingo? Flamingo � Flamingo? � Vast literature in cognitive science (Tenenbaum et al., 2006; Kemp et al., 2007), language acquisition (Carey et al., 1978; Xu et al., 2007), and computer vision (Fink, 2005; Fei-Fei et al., 2006) DIMACS, May 16, 2013 – 3 / 26
One-shot learning Introduction and � Successful one-shot learning requires the learner to incorporate overview strong contextual information into the learning algorithm. Statistical setting Principal component � Image recognition: Information on object categories. regression Weak consistency and � Objects tend to be categorized by shape, color, etc. big data with n = 2 Risk approximations and � Word-learning: Common function words are often used in consistency conjunction with a novel word and referent. Numerical results � This is a KOBA. Since this , is , and a are function words Conclusions and future directions that often appear with nouns, KOBA is likely the new referent. � Many recent statistical approaches to one-shot learning are based on hierarchical Bayesian models. � Effective in a variety of examples. DIMACS, May 16, 2013 – 4 / 26
One-shot learning Introduction and � We propose a simple factor model for one-shot learning with continuous overview outcomes. Statistical setting Principal component � Highly idealized, but amenable to theoretical analysis. regression � Novel risk approximations for: Weak consistency and big data with n = 2 (i) assessing the performance of one-shot learning methods and Risk approximations and consistency (ii) gaining insight into the significance of various parameters for one-shot learning. Numerical results Conclusions and future � The methods considered here are variants of principal component directions regression (PCR). � One-shot asymptotic regime: Fixed n , large d , strong contextual information. � See work by Hall, Jung, Marron, and co-authors on “high dimension, low sample size” data (especially work on PCA and classification). � New insights into PCR. � Classical PCR estimator is generally inconsistent in the one-shot regime. � Bias-correction via expansion. DIMACS, May 16, 2013 – 5 / 26
Outline Introduction and overview Statistical setting � Statistical setting. Principal component regression � Principal component regression. Weak consistency and big data with n = 2 � Weak consistency and big data with n = 2 . Risk approximations and consistency Numerical results � Risk approximations and consistency. Conclusions and future directions � Numerical results. � Conclusions and future directions. DIMACS, May 16, 2013 – 6 / 26
Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Statistical setting Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 7 / 26
The model Introduction and � The observed data consists of ( y 1 , x 1 ) , ..., ( y n , x n ) , where y i ∈ R is a overview scalar outcome and x i ∈ R d is an associated d -dimensional “context” Statistical setting vector. Principal component regression � We suppose that y i and x i are related via Weak consistency and big data with n = 2 h i ∼ N (0 , η 2 ) , ξ i ∼ N (0 , σ 2 ) , y i = h i θ + ξ i , Risk approximations and √ consistency ǫ i ∼ N (0 , τ 2 I ) . = h i γ d u + ǫ i , x i Numerical results Conclusions and future � NB: directions � h i , ξ i ∈ R and ǫ i ∈ R d , 1 ≤ i ≤ n , are all assumed to be independent. � h i is a latent factor linking y i and x i . � ξ i and ǫ i are random noise. � The unit vector u ∈ R d and real numbers θ, γ ∈ R are non-random. √ d u || 2 ≍ d is � It is implicit in our normalization that the “ x -signal” || h i γ quite strong. � To simplify notation, we let y = ( y 1 , ..., y n ) and X = ( x 1 , ..., x n ) T . DIMACS, May 16, 2013 – 8 / 26
Predictive risk Introduction and overview � Observe that ( y i , x i ) ∼ N (0 , V ) are jointly normal with Statistical setting θγη 2 √ Principal component θ 2 η 2 + σ 2 � d u T � regression θγη 2 √ V = . ( † ) τ 2 I + η 2 γ 2 d uu T Weak consistency and d u big data with n = 2 Risk approximations and y : R d → R so that � Goal: Given the data ( y , X ) , devise prediction rules ˆ consistency the risk Numerical results Conclusions and future y ( x new ) − y new } 2 = E V { ˆ y ( x new ) − h new θ } 2 + σ 2 directions R V (ˆ y ) = E V { ˆ √ is small, where ( y new , x new ) = ( h new θ + ξ new , h new γ d u + ǫ new ) has the same distribution as ( y i , x i ) and is independent of ( y , X ) . � R V (ˆ y ) is a measure of predictive risk , which is completely determined by y and the parameter matrix V , given in ( † ). ˆ DIMACS, May 16, 2013 – 9 / 26
One-shot asymptotic regime Introduction and � We are primarily interested in identifying methods ˆ y that perform overview well in the one-shot asymptotic regime . Statistical setting Principal component regression � Key features of the one-shot asymptotic regime: Weak consistency and big data with n = 2 � n is fixed (i) small n , large d Risk approximations and d → ∞ (ii) consistency Numerical results σ 2 → 0 � (iii) Conclusions and future abundant contextual information inf η 2 γ 2 /τ 2 > 0 directions (iv) � NB: � σ 2 is the noise-level for the “ y -data.” � η 2 γ 2 /τ 2 is the signal-to-noise ratio for the “ x -data.” DIMACS, May 16, 2013 – 10 / 26
Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Principal component regression Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 11 / 26
Linear prediction rules Introduction and � By assumption, the data are multivariate normal. Thus, overview Statistical setting E V ( y i | x i ) = x T i β , Principal component regression where β = θγη 2 √ d u / ( τ 2 + η 2 γ 2 d ) . Weak consistency and big data with n = 2 Risk approximations and � This suggests studying linear prediction rules of the form consistency Numerical results y ( x ) = x T ˆ ˆ β Conclusions and future directions for some estimator ˆ β of β . DIMACS, May 16, 2013 – 12 / 26
Principal component regression Introduction and � Let l 1 ≥ · · · ≥ l n ∧ d ≥ 0 denote the ordered n largest eigenvalues of overview X T X and let ˆ u 1 , ..., ˆ u n ∧ d denote corresponding eigenvectors with unit Statistical setting length. Principal component regression � ˆ u 1 , ..., ˆ u n ∧ d are the principal components of X . Weak consistency and big data with n = 2 � Let U k = (ˆ u 1 · · · ˆ u k ) be the d × k matrix with columns given by Risk approximations and consistency u 1 , ..., ˆ ˆ u k , for 1 ≤ k ≤ n ∧ d . In its most basic form, principal component regression involves regressing y on XU k for some (typically small) k , and Numerical results k X T XU k ) − 1 U T taking ˆ β = U k ( U T k X T y . Conclusions and future directions � In the problem considered here, Cov( x i ) = τ 2 I + η 2 γ 2 d uu T has a single eigenvector larger than τ 2 and the corresponding eigenvector is parallel to β . Thus, it is natural to take k = 1 and consider the principal component regression (PCR) estimator u T 1 X T y ˆ u 1 = 1 ˆ u T 1 X T y ˆ ˆ ˆ β pcr = u 1 . 1 X T X ˆ u T ˆ l 1 u 1 DIMACS, May 16, 2013 – 13 / 26
Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Weak consistency and big data with Risk approximations and consistency n = 2 Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 14 / 26
PCR with n = 2 Introduction and � As a warm-up for the general n setting, we consider the special overview case where n = 2 . Statistical setting Principal component � When n = 2 , the PCR estimator ˆ β pcr has an especially simple form regression because the largest eigenvalue of X T X and its corresponding Weak consistency and big data with n = 2 eigenvector are given explicitly by Risk approximations and consistency � � 1 || x 1 || 2 + || x 2 || 2 + � ( || x 1 || 2 − || x 2 || 2 ) 2 + 4( x T l 1 = 1 x 2 ) 2 , Numerical results 2 Conclusions and future l 1 − || x 2 || 2 directions ˆ ∝ x 1 + x 2 . u 1 x T 1 x 2 √ � Recall that x i = h i γ d u i + ǫ i . Using the large d approximations || x i || 2 h 2 i γ 2 d + τ 2 d ≈ x T h 1 h 2 γ 2 d ≈ 1 x 2 leads to... DIMACS, May 16, 2013 – 15 / 26
Recommend
More recommend