Sparse Shrunk Additive Models Guodong Liu(University of Pittsburgh) , - PowerPoint PPT Presentation

Sparse Shrunk Additive Models Guodong Liu(University of Pittsburgh) , Hong Chen (Huazhong Agricultural Univerisity), Heng Huang (University of Pittsburgh) June 14, 2020

1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability.

1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications.

1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications. ◮ Generalized additive model. ◮ Nonparametric extensions of linear models. ◮ Flexible and adaptive to high dimensional data.

1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications. ◮ Generalized additive model. ◮ Nonparametric extensions of linear models. ◮ Flexible and adaptive to high dimensional data. ◮ Univariate smooth component function. ◮ Pre-defined group structure information.

2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks.

2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks. Provided Generalization bound on the excess risk under mild ◮ conditions, which implies the fast convergence rate can be achieved.

2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks. Provided Generalization bound on the excess risk under mild ◮ conditions, which implies the fast convergence rate can be achieved. Derived the necessary and sufficient condition to characterize ◮ the sparsity of SSAM.

3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y .

3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) .

3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ .

3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ . ◮ For any given z , we define the data dependent hypothesis space as: j =1 f ( j ) ( x ( j ) ) , f ( j ) ∈ H ( j ) H z = { f : f ( x ) = � d z } , where H ( j ) = { f ( j ) = � m i =1 α ( j ) i K ( j ) ( x ( j ) , · ) : α ( j ) ∈ R } z i i

3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ . ◮ For any given z , we define the data dependent hypothesis space as: j =1 f ( j ) ( x ( j ) ) , f ( j ) ∈ H ( j ) H z = { f : f ( x ) = � d z } , where H ( j ) = { f ( j ) = � m i =1 α ( j ) i K ( j ) ( x ( j ) , · ) : α ( j ) ∈ R } z i i � � m t | : f ( j ) = � m � t =1 | α ( j ) t =1 α ( j ) t K ( j ) ( x ( j ) ◮ Denote � f ( j ) � ℓ 1 = inf t , · ) , and � f � ℓ 1 := � d j =1 � f ( j ) � ℓ 1 for f = � d j =1 f ( j ) .

3. Algorithm: Sparse Shrunk Additive Models Predictor of SSAM d d m f ( j ) α ( j ) t K ( j ) ( x ( j ) � � � f z = = ˆ t , · ) z j =1 j =1 t =1 where, for 1 ≤ t ≤ m and 1 ≤ j ≤ d , d m � α ( j ) | α ( j ) � � { ˆ t } = arg min λ t | α ( j ) t ∈ R , t , j t =1 j =1 (1) m d m + 1 � 2 + � α ( j ) t K ( j ) ( x ( j ) t , x ( j ) � � � � y i − ) . i m i =1 j =1 t =1

3. Algorithm: Sparse Shrunk Additive Models SSAM from the viewpoint of function approximation � 1 m ( y i − f ( x i )) 2 + λ � f � ℓ 1 � � f z = arg min . m f ∈H z i =1

4. Theoretical Analysis: Assumptions Assumption 1: j =1 f ( j ) Assume that f ρ = � d ρ , where for each j ∈ { 1 , 2 , ..., d } , f ( j ) : X ( j ) → R is a function of the form f ( j ) K ( j ) ( g ( j ) = L r ρ ) with ρ ρ ˜ some r > 0 and g ( j ) ∈ L 2 ρ X ( j ) . ρ Assumption 2: For each j ∈ { 1 , 2 , ..., d } , the kernel function K ( j ) : X ( j ) × X ( j ) → R is C s with some s > 0 satisfying: 2 , ∀ u , v , v ′ ∈ X ( j ) � K ( j ) ( u , v ) − K ( j ) ( u , v ′ ) � ≤ c s � v − v ′ � s for some positive constant c s .

4. Theoretical Analysis: Theorems Theorem 1 Let Assumptions 1 and 2 be true. For any 0 < δ < 1, with confidence 1 − δ , there exists positive constant ˜ c 1 independent of m , δ such that: (1) If r ∈ (0 , 1 2 ) in Assumption 1, setting λ = m − θ 1 with 2 θ 1 ∈ (0 , 2+ p ), c 1 log(8 /δ ) m − γ 1 , E ( π ( f z )) − E ( f ρ ) ≤ ˜ 2+ p − (2 − 2 r ) θ 1 , 2(1 − p θ 1 ) 2 r θ 1 , 1 − θ 1 +2 r θ 1 2 � � where γ 1 = min , . 2 2+ p 2 in Assumption 1, taking λ = m − θ 2 with some (2) If r ≥ 1 2 θ 2 ∈ (0 , 2+ p ), c 1 log(8 /δ ) m − γ 2 , E ( π ( f z )) − E ( f ρ ) ≤ ˜ � � θ 2 , 1 2 where γ 2 = min 2 , 2+ p − θ 2 .

4. Theoretical Analysis: Remark ◮ Theorem 1 provides the upper bound of generalization error to SSAM with Lipshitz continuous kernel. ◮ For r ∈ (0 , 1 2 ), as s → ∞ , we have γ 1 → min { 2 r θ 1 , 1 2 + ( r − 1 2 ) θ, 1 − 2 θ 1 + 2 r θ 1 } . 2 , the convergence rate O ( m − 1 ◮ When r → 1 2 and θ 1 → 1 2 ) can be reached. ◮ For r ≥ 1 1 2 , taking θ 2 = 2+ p , we get the convergence rate 1 O ( m − 2+ p ).

4. Theoretical Analysis: Theorems Theorem 2 2 Assume that f ( j ) ∈ H ( j ) for each 1 ≤ j ≤ d . Take λ = m − 2+3 p in ρ (1). For any 0 < δ < 1, with confidence 1 − δ we have 2 c 2 log(1 /δ ) m − 2+3 p , E ( π ( f z )) − E ( f ρ ) ≤ ˜ where ˜ c 2 is a positive constant independent of m , δ .

4. Theoretical Analysis: Theorems Theorem 2 2 Assume that f ( j ) ∈ H ( j ) for each 1 ≤ j ≤ d . Take λ = m − 2+3 p in ρ (1). For any 0 < δ < 1, with confidence 1 − δ we have 2 c 2 log(1 /δ ) m − 2+3 p , E ( π ( f z )) − E ( f ρ ) ≤ ˜ where ˜ c 2 is a positive constant independent of m , δ . ◮ The result is about a special case when f ( j ) ∈ H ( j ) . ρ ◮ Under the strong condition on f ρ , the convergence rate can be arbitrary close to O ( m − 1 ) as s → ∞ .

5. Empirical Evaluation: Synthetic Data Setting � n ◮ Pairwise interaction setting: k = 2 , d = � . 2

5. Empirical Evaluation: Synthetic Data Setting � n ◮ Pairwise interaction setting: k = 2 , d = � . 2 ◮ Each kernel on X ( j ) is generated from Gaussian kernel.

Sparse Shrunk Additive Models Guodong Liu(University of Pittsburgh) , - PowerPoint PPT Presentation

Sparse Shrunk Additive Models Guodong Liu(University of Pittsburgh) , Hong Chen (Huazhong Agricultural Univerisity), Heng Huang (University of Pittsburgh) June 14, 2020 1. Motivation Deep models have made great progress in learning large

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Lattice and Non-Lattice Markov Additive Models Jevgenijs Ivanovs, Guy Latouche and Peter Taylor

Honey, I Shrunk our Records! ARMA Silicon Valley Chapter March 8, 2018 Presented by: Karen

Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY

Honey, I Shrunk the Cube Matteo Golfarelli Stefano Rizzi University of Bologna - Italy Summary

Lecture 5 : Sparse Models Homework 3 discussion (Nima) Sparse Models Lecture - Reading :

Generalized Additive Models David L Miller Overview What is a GAM? What is smoothing? How do

Additive Manufacturing Turning Mind into Matter Neal de Beer (Ph.D) Overview Introduction to

APPLING CERMET (WC) COATINGS VIA INTERNAL APPLING CERMET (WC) COATINGS VIA INTERNAL BORE LASER

Metal Additive Technology 101 Technology Choices and Applications Jeff Crandall Additive

Diverse and Additive Cartesian Abstraction Heuristics Jendrik Seipp Malte Helmert University of

Non-additive measures and integrals Vicen c Torra March, 2014 IIIA-CSIC (joint work with

The additive model revisited Sara van de Geer January 8, 2013 but first something else (Les

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

OVERVIEW Level 1 Bullet AAPCHO Civic Engagement Level 1 Bullet Coordinator s Call

The Debian Astro project A Debian Pure Blend for astronomy and astrophysics Ole Streicher

Throughput prediction based on mobile device context in Cellular Network Yihua (Ethan) Guo

From Feast to Famine Managing mobile network resources across

Structured Encryption and Controlled Disclosure Melissa Chase Seny Kamara Microsoft Research

from Lippia salsa Griseb. (Verbenaceae) Florencia Antonella Musso, Valeria Cavallaro and Ana

Monotonicity formulas and the singular set in the thin obstacle problem Nicola Garofalo Arshak

E a r n i n g s C o n f e r e n c e C a l l April 21th, 2017 Forward-Looking Statement The