big data with n 2
play

big data with n = 2 Lee Dicker Department of Statistics Rutgers - PowerPoint PPT Presentation

One-shot learning and big data with n = 2 Lee Dicker Department of Statistics Rutgers University Joint work w/Dean Foster DIMACS, May 16, 2013 DIMACS, May 16, 2013 1 / 26 Introduction and overview Statistical setting Principal


  1. One-shot learning and big data with n = 2 Lee Dicker Department of Statistics Rutgers University Joint work w/Dean Foster DIMACS, May 16, 2013 DIMACS, May 16, 2013 – 1 / 26

  2. Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Introduction and overview Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 2 / 26

  3. One-shot learning Introduction and � Humans are able to correctly recognize and understand objects based on overview very few training examples. Statistical setting Principal component � e.g. images, words. regression Testing Training Weak consistency and big data with n = 2 Risk approximations and consistency Numerical results Conclusions and future directions − → Flamingo � Flamingo? Flamingo? Flamingo � Flamingo? � Vast literature in cognitive science (Tenenbaum et al., 2006; Kemp et al., 2007), language acquisition (Carey et al., 1978; Xu et al., 2007), and computer vision (Fink, 2005; Fei-Fei et al., 2006) DIMACS, May 16, 2013 – 3 / 26

  4. One-shot learning Introduction and � Successful one-shot learning requires the learner to incorporate overview strong contextual information into the learning algorithm. Statistical setting Principal component � Image recognition: Information on object categories. regression Weak consistency and � Objects tend to be categorized by shape, color, etc. big data with n = 2 Risk approximations and � Word-learning: Common function words are often used in consistency conjunction with a novel word and referent. Numerical results � This is a KOBA. Since this , is , and a are function words Conclusions and future directions that often appear with nouns, KOBA is likely the new referent. � Many recent statistical approaches to one-shot learning are based on hierarchical Bayesian models. � Effective in a variety of examples. DIMACS, May 16, 2013 – 4 / 26

  5. One-shot learning Introduction and � We propose a simple factor model for one-shot learning with continuous overview outcomes. Statistical setting Principal component � Highly idealized, but amenable to theoretical analysis. regression � Novel risk approximations for: Weak consistency and big data with n = 2 (i) assessing the performance of one-shot learning methods and Risk approximations and consistency (ii) gaining insight into the significance of various parameters for one-shot learning. Numerical results Conclusions and future � The methods considered here are variants of principal component directions regression (PCR). � One-shot asymptotic regime: Fixed n , large d , strong contextual information. � See work by Hall, Jung, Marron, and co-authors on “high dimension, low sample size” data (especially work on PCA and classification). � New insights into PCR. � Classical PCR estimator is generally inconsistent in the one-shot regime. � Bias-correction via expansion. DIMACS, May 16, 2013 – 5 / 26

  6. Outline Introduction and overview Statistical setting � Statistical setting. Principal component regression � Principal component regression. Weak consistency and big data with n = 2 � Weak consistency and big data with n = 2 . Risk approximations and consistency Numerical results � Risk approximations and consistency. Conclusions and future directions � Numerical results. � Conclusions and future directions. DIMACS, May 16, 2013 – 6 / 26

  7. Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Statistical setting Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 7 / 26

  8. The model Introduction and � The observed data consists of ( y 1 , x 1 ) , ..., ( y n , x n ) , where y i ∈ R is a overview scalar outcome and x i ∈ R d is an associated d -dimensional “context” Statistical setting vector. Principal component regression � We suppose that y i and x i are related via Weak consistency and big data with n = 2 h i ∼ N (0 , η 2 ) , ξ i ∼ N (0 , σ 2 ) , y i = h i θ + ξ i , Risk approximations and √ consistency ǫ i ∼ N (0 , τ 2 I ) . = h i γ d u + ǫ i , x i Numerical results Conclusions and future � NB: directions � h i , ξ i ∈ R and ǫ i ∈ R d , 1 ≤ i ≤ n , are all assumed to be independent. � h i is a latent factor linking y i and x i . � ξ i and ǫ i are random noise. � The unit vector u ∈ R d and real numbers θ, γ ∈ R are non-random. √ d u || 2 ≍ d is � It is implicit in our normalization that the “ x -signal” || h i γ quite strong. � To simplify notation, we let y = ( y 1 , ..., y n ) and X = ( x 1 , ..., x n ) T . DIMACS, May 16, 2013 – 8 / 26

  9. Predictive risk Introduction and overview � Observe that ( y i , x i ) ∼ N (0 , V ) are jointly normal with Statistical setting θγη 2 √ Principal component θ 2 η 2 + σ 2 � d u T � regression θγη 2 √ V = . ( † ) τ 2 I + η 2 γ 2 d uu T Weak consistency and d u big data with n = 2 Risk approximations and y : R d → R so that � Goal: Given the data ( y , X ) , devise prediction rules ˆ consistency the risk Numerical results Conclusions and future y ( x new ) − y new } 2 = E V { ˆ y ( x new ) − h new θ } 2 + σ 2 directions R V (ˆ y ) = E V { ˆ √ is small, where ( y new , x new ) = ( h new θ + ξ new , h new γ d u + ǫ new ) has the same distribution as ( y i , x i ) and is independent of ( y , X ) . � R V (ˆ y ) is a measure of predictive risk , which is completely determined by y and the parameter matrix V , given in ( † ). ˆ DIMACS, May 16, 2013 – 9 / 26

  10. One-shot asymptotic regime Introduction and � We are primarily interested in identifying methods ˆ y that perform overview well in the one-shot asymptotic regime . Statistical setting Principal component regression � Key features of the one-shot asymptotic regime: Weak consistency and big data with n = 2 � n is fixed (i) small n , large d Risk approximations and d → ∞ (ii) consistency Numerical results σ 2 → 0 � (iii) Conclusions and future abundant contextual information inf η 2 γ 2 /τ 2 > 0 directions (iv) � NB: � σ 2 is the noise-level for the “ y -data.” � η 2 γ 2 /τ 2 is the signal-to-noise ratio for the “ x -data.” DIMACS, May 16, 2013 – 10 / 26

  11. Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Principal component regression Risk approximations and consistency Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 11 / 26

  12. Linear prediction rules Introduction and � By assumption, the data are multivariate normal. Thus, overview Statistical setting E V ( y i | x i ) = x T i β , Principal component regression where β = θγη 2 √ d u / ( τ 2 + η 2 γ 2 d ) . Weak consistency and big data with n = 2 Risk approximations and � This suggests studying linear prediction rules of the form consistency Numerical results y ( x ) = x T ˆ ˆ β Conclusions and future directions for some estimator ˆ β of β . DIMACS, May 16, 2013 – 12 / 26

  13. Principal component regression Introduction and � Let l 1 ≥ · · · ≥ l n ∧ d ≥ 0 denote the ordered n largest eigenvalues of overview X T X and let ˆ u 1 , ..., ˆ u n ∧ d denote corresponding eigenvectors with unit Statistical setting length. Principal component regression � ˆ u 1 , ..., ˆ u n ∧ d are the principal components of X . Weak consistency and big data with n = 2 � Let U k = (ˆ u 1 · · · ˆ u k ) be the d × k matrix with columns given by Risk approximations and consistency u 1 , ..., ˆ ˆ u k , for 1 ≤ k ≤ n ∧ d . In its most basic form, principal component regression involves regressing y on XU k for some (typically small) k , and Numerical results k X T XU k ) − 1 U T taking ˆ β = U k ( U T k X T y . Conclusions and future directions � In the problem considered here, Cov( x i ) = τ 2 I + η 2 γ 2 d uu T has a single eigenvector larger than τ 2 and the corresponding eigenvector is parallel to β . Thus, it is natural to take k = 1 and consider the principal component regression (PCR) estimator u T 1 X T y ˆ u 1 = 1 ˆ u T 1 X T y ˆ ˆ ˆ β pcr = u 1 . 1 X T X ˆ u T ˆ l 1 u 1 DIMACS, May 16, 2013 – 13 / 26

  14. Introduction and overview Statistical setting Principal component regression Weak consistency and big data with n = 2 Weak consistency and big data with Risk approximations and consistency n = 2 Numerical results Conclusions and future directions DIMACS, May 16, 2013 – 14 / 26

  15. PCR with n = 2 Introduction and � As a warm-up for the general n setting, we consider the special overview case where n = 2 . Statistical setting Principal component � When n = 2 , the PCR estimator ˆ β pcr has an especially simple form regression because the largest eigenvalue of X T X and its corresponding Weak consistency and big data with n = 2 eigenvector are given explicitly by Risk approximations and consistency � � 1 || x 1 || 2 + || x 2 || 2 + � ( || x 1 || 2 − || x 2 || 2 ) 2 + 4( x T l 1 = 1 x 2 ) 2 , Numerical results 2 Conclusions and future l 1 − || x 2 || 2 directions ˆ ∝ x 1 + x 2 . u 1 x T 1 x 2 √ � Recall that x i = h i γ d u i + ǫ i . Using the large d approximations || x i || 2 h 2 i γ 2 d + τ 2 d ≈ x T h 1 h 2 γ 2 d ≈ 1 x 2 leads to... DIMACS, May 16, 2013 – 15 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend