Computational and Statistical Aspects of Statistical Machine - PowerPoint PPT Presentation

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center

Outline • “Modern” nonparametric inference for high dimensional data ◮ Nonparametric reduced rank regression • Risk-computation tradeoffs ◮ Covariance-constrained linear regression • Other research and teaching activities 2

Context for High Dimensional Nonparametrics Great progress in recent years on high dimensional linear models Many problems have important nonlinear structure. We’ve been studying “ purely functional ” methods for high dimensional, nonparametric inference • no basis expansions • no Mercer kernels 3

Additive Models Fully nonparametric models appear hopeless • Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and Wasserman (2008)) Additive models are useful compromise • Exponential scaling, p = exp ( n c ) (e.g., “SpAM” Ravikumar, Lafferty, Liu and Wasserman (2009)) 4

Additive Models 300 190 180 250 170 200 160 150 150 100 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Age Bmi 240 160 150 200 140 160 130 120 120 110 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Map Tc 5

Multivariate Regression Y ∈ R q and X ∈ R p . Regression function m ( X ) = E ( Y | X ) . Linear model Y = BX + ǫ where B ∈ R q × p . Reduced rank regression: r = rank ( B ) ≤ C . Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm � B � ∗ is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g., �� Var ( ǫ ) r ( p + q ) � � B n − B ∗ � F = O P n 6

Low-Rank Matrices and Convex Relaxation low rank matrices convex hull rank ( X ) ≤ t � X � ∗ ≤ t 7

Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball � X � ∗ ≤ t : • Compute the SVD: B = U diag ( σ ) V T • Soft threshold the singular values: B ← U diag ( Soft λ ( σ )) V T 8

Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (NIPS 2012) Nonparametric multivariate regression m ( X ) = ( m 1 ( X ) , . . . , m q ( X )) T Each component an additive model p � m k ( X ) = m k j ( X j ) j = 1 What is the nonparametric analogue of � B � ∗ penalty? 9

Low Rank Functions What does it mean for a set of functions m 1 ( x ) , . . . , m q ( x ) to be low rank? Let x 1 , . . . , x n be a collection of points. We require the n × q matrix M ( x 1 : n ) = [ m k ( x i )] is low rank. Stochastic setting: M = [ m k ( X i )] . Natural penalty is � q q � � 1 1 λ s ( 1 n M T M ) √ n � M � ∗ = σ s ( M ) = √ n s = 1 s = 1 Population version: � � � Σ( M ) 1 / 2 � � � � � � � ||| M ||| ∗ := � Cov ( M ( X )) � ∗ = � ∗ 10

Constrained Rank Additive Models ( CRAM ) Let Σ j = Cov ( M j ) . Two natural penalties: � � � � � � � � � � � � � Σ 1 / 2 � Σ 1 / 2 � Σ 1 / 2 ∗ + ∗ + · · · + � � � p 1 2 ∗ � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) � p 1 2 ∗ � � �� Y − � 2 + λ � 2 � � � M j � 1 Population risk (first penalty) j M j ( X j ) 2 E � j ∗ Linear case: p � � p � � � � � Σ 1 / 2 ∗ = � B j � 2 � p j = 1 j = 1 � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) ∗ = � B � ∗ � p 1 2 11

CRAM Backfitting Algorithm (Penalty 1) Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � M k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j n � P j � Compute SVD: 1 P T j = U diag ( τ ) U T M j = U diag ([ 1 − λ/ √ τ ] + ) U T � Soft-threshold: � P j M ( X i ) = � Output: Estimator � j � M j ( X ij ) . 12

Scaling of Estimation Error Using a “double covering” technique, ( 1 2 -parametric, 1 2 -nonparametric), we bound the deviation between empirical and population functional covariance matrices in spectral norm: �� q + log ( pq ) � � � Σ( V ) − � sup Σ n ( V ) sp = O P . � n V This allows us to bound the excess risk of the empirical estimator relative to an oracle. 13

Summary • Variations on additive models enjoy most of the good statistical and computational properties of sparse or low-rank linear models. • We’re building a toolbox for large scale, high dimensional nonparametric inference. 14

Computation-Risk Tradeoffs • In “traditional” computational learning theory, dividing line between learnable and non-learnable is polynomial vs. exponential time • Valiant’s PAC model • Mostly negative results: It is not possible to efficiently learn in natural settings • Claim: Distinctions in polynomial time matter most 15

Analogy: Numerical Optimization In numerical optimization, it is understood how to tradeoff computation for speed of convergence • First order methods: linear cost, linear convergence • Quasi-Newton methods: quadratic cost, superlinear convergence • Newton’s method: cubic cost, quadratic convergence Are similar tradeoffs possible in statistical learning? 16

Hints of a Computation-Risk Tradeoff Graph estimation: • Our method for estimating graph for Ising models: n = Ω( d 3 log p ) , T = O ( p 4 ) for graphs with p nodes and maximum degree d • Information-theoretic lower bound: n = Ω( d log p ) 17

Statistical vs. Computational Efficiency Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies Risk ( � Rate H , F ( n ) = m n ∈H sup inf m n , m ) � m ∈F • H : computationally constrained hypothesis class • F : smoothness constraints on “true” model 18

Computation-Risk Tradeoffs for Linear Regression Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression 19

Computation-Risk Tradeoffs for Linear Regression Standard ridge estimator solves � 1 � β λ = 1 nX T X + λ n I � nX T Y Sparsify sample covariance to get estimator � �� β t ,λ = 1 T t [ � nX T Y Σ] + λ n I where T t [ � Σ] is hard-thresholded sample covariance: � � T t ([ m ij ]) = m ij 1 ( | m ij | > t ) Recent advance in theoretical CS (Spielman et al.): Solving a symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time O ( m log 2 p ) � 20

Computation-Risk Tradeoffs for Linear Regression Dinah has recently proved the statistical error scales as � � β t ,λ − β ∗ � = O P ( � T t (Σ) − Σ � 2 ) = O ( t 1 − q ) � β ∗ � for class of covariance matrices with rows in sparse ℓ q balls (as studied by Bickel and Levina). • Combined with the computational advance, this gives us an explicit, fine-grained risk/computation tradeoff 21

Simulation 1.4 1.3 1.2 1.1 risk 1.0 0.9 0.8 0.0 0.5 1.0 1.5 2.0 lambda 22

Some Other Projects Minhua Chen : Convex optimization for dictionary learning Eric Janofsky : Nonparanormal component analysis Min Xu : High dimensional conditional density and graph estimation 23

Courses in the Works • Winter 2013: Nonparametric Inference (Undergraduate and Masters) • Spring 2013: Machine Learning for Big Data (Undergraduate Statistics and Computer Science) Charles Cary : Developing Cloud-based infras- tructure for the course. Candidate data: 80 mil- lion images, Yahoo! clickthrough data, Science journal articles, City of Chicago datasets. 24

Computational and Statistical Aspects of Statistical Machine - PowerPoint PPT Presentation

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric reduced rank

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

Geotechnical Aspects of Borders Railway Mine Infill Grouting 1 May 15 Geotechnical Aspects of

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Brief introduction to computational & statistical neuroscience Jonathan Pillow Lecture #1

Computational aspects of ncRNA research Mihaela Zavolan Biozentrum, Basel Swiss Institute of

Notes on the computational aspects of Kripkes theory of truth Stanislav O. Speranski

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Incremental Analysis of Interference Among Aspects Interference Among Aspects Authors: Emilia

Computational aspects of voting J er ome Lang LAMSADE, CNRS & Universit e

Computational Aspects of Computational . . . Physical Models Based on Euclidean Space: Proof

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

FISCAL YEAR 2 FISCAL YEAR 2018 018-2019 ANNU 2019 ANNUAL PERFORMANC AL PERFORMANCE REPORT E

MARCH MEETING | THURSDAY, MARCH 28, 2019 WESTERN DAKOTA TECHNICAL INSTITUTE | RAPID CITY, SD

Safe R oute s to Sc hool Tuesday, September 1, 2020 12:151:15PM House ke e ping 1.

Generalized Significance in Scale Space: The GS3 Package Daniel V. Samarov Statistical

Attendance Sign-in Welcome 2:00 2:05 Project Updates 2:05 2:30 Agency Roll Call 2:30

Secrets of Conflict Resolution Chad Green Beer City Code June 1, 2019 Chad Green , Director of

Conflict of Interest Statement We have no conflicts of interest to report Objectives for

Norm Conflict Identification using Deep Learning Jo ao Paulo Aires Felipe Meneguzzi

Computational and Statistical Aspects of Statistical Machine - PowerPoint PPT Presentation

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric reduced rank

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

Geotechnical Aspects of Borders Railway Mine Infill Grouting 1 May 15 Geotechnical Aspects of

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Brief introduction to computational &amp; statistical neuroscience Jonathan Pillow Lecture #1

Computational aspects of ncRNA research Mihaela Zavolan Biozentrum, Basel Swiss Institute of

Notes on the computational aspects of Kripkes theory of truth Stanislav O. Speranski

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Incremental Analysis of Interference Among Aspects Interference Among Aspects Authors: Emilia

Computational aspects of voting J er ome Lang LAMSADE, CNRS &amp; Universit e

Computational Aspects of Computational . . . Physical Models Based on Euclidean Space: Proof

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Variations on Nonparametric Additive Models: Computational and Statistical Aspects John Lafferty

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

FISCAL YEAR 2 FISCAL YEAR 2018 018-2019 ANNU 2019 ANNUAL PERFORMANC AL PERFORMANCE REPORT E

MARCH MEETING | THURSDAY, MARCH 28, 2019 WESTERN DAKOTA TECHNICAL INSTITUTE | RAPID CITY, SD

Safe R oute s to Sc hool Tuesday, September 1, 2020 12:151:15PM House ke e ping 1.

Generalized Significance in Scale Space: The GS3 Package Daniel V. Samarov Statistical

Attendance Sign-in Welcome 2:00 2:05 Project Updates 2:05 2:30 Agency Roll Call 2:30

Secrets of Conflict Resolution Chad Green Beer City Code June 1, 2019 Chad Green , Director of

Conflict of Interest Statement We have no conflicts of interest to report Objectives for

Norm Conflict Identification using Deep Learning Jo ao Paulo Aires Felipe Meneguzzi

Brief introduction to computational & statistical neuroscience Jonathan Pillow Lecture #1

Computational aspects of voting J er ome Lang LAMSADE, CNRS & Universit e