Learning Sparse Polynomials over product measures Kiran Vodrahalli - PowerPoint PPT Presentation

Learning Sparse Polynomials over product measures Kiran Vodrahalli knv2109@columbia.edu Columbia University December 11, 2017

The Problem “Learning Sparse Polynomial Functions” [Andoni, Panigrahy, Valiant, Zhang ’14] Consider learning a polynomial f : R n → R of degree d of k monomials. Key features of setting: ◮ real-valued (in contrast to many works considering f : {− 1 , 1 } n → {− 1 , 1 } ) ◮ “sparse” (only k monomials) ◮ distribution over data x : Gaussian or uniform ◮ only consider product measures ◮ realizable setting: assume we try to exactly recover the polynomial Why this setting? ◮ notion of “low-dimension” in sparsity ◮ Boolean settings are hard (parity functions) We outline the results of Andoni et. al. ’14 in this talk.

Background and Motivation computation and sample complexities Goal: Learn the polynomial in time and samples < o ( n d ) . ◮ many approaches for learning take sample/computation time O ( n d ) � n ◮ polynomial kernel regression in � -sized basis d ◮ sample complexity: same as linear regression (depends linearly on dimension, in this case n d ) ◮ computation complexity: worse than n d � n ◮ compressed sensing in � d ◮ f ( x ) := � v , x ⊗ d � where v is k -sparse, x is data ◮ sub-linear complexity results only hold for particular settings of data (RIP, incoherence, nullspace property) ◮ unclear if these hold for X ⊗ d (probably not) ◮ dimension reduction + regression (ex: principal components regression) — note this is improper learning

The Results sub- O ( n d ) samples and computation Two key results: oracle setting and learning from samples. Definition Inner product � h 1 , h 2 � is defined with respect to a distribution D over the data X as E D [ h 1 ( x ) h 2 ( x )] . We also have � h � 2 = � h , h � . Definition A correlation oracle pair calculates � f ∗ , f � and � ( f ∗ ) 2 , f � where f ∗ is the true polynomial. ◮ in the oracle setting, can exactly learn polynomial f ∗ in O ( k · nd ) oracle calls ◮ if learning from samples ( x , f ∗ ( x )) , learn ˆ f s.t. � ˆ f − f � ≤ ǫ : ◮ sample complexity: O ( poly ( n , k , 1 /ǫ, m )) ◮ m = 2 d if D uniform, m = 2 d log d if D Gaussian ◮ computation complexity: ( # samples) ·O ( nd ) ◮ ( x , f ∗ ( x ) + g ) , g ∼ N ( 0 , σ 2 ) : same bounds × poly ( 1 + σ )

Methodology overview of Growing-Basis Key idea: Greedily build a polynomial in an orthonormal basis, one basis function at a time. Identify first the existence of variable x i using correlation, and then find its degree in the basis function. This strategy will work for the following reasons: ◮ We can work in an orthonormal basis and pay a factor 2 d increase in the sparsity of the representation. ◮ We can identify the degree of a variable in a particular basis function by examining the correlation of several basis functions with ( f ∗ ) 2 in an iterative fashion. This search procedure takes time O ( nd ) .

Methodology orthogonal polynomial bases over distributions Definition Consider inner product space �· , ·� D for distribution D , where D = µ ⊗ n is a product measure over R n . For any coordinate, we can find an orthogonal basis of polynomials depending on distribution D by Gram-Schmidt. Let H t ( x i ) be the degree t basis function for variable x i . Then for T = ( t 1 , · · · , t n ) such that � i t i = d , H T ( x ) = � i H t i ( x i ) defines the orthogonal basis function parametrized by T in the product basis. Thus we can write f ∗ ( x ) := � α T H T ( x ) T for any polynomial f ∗ . There are at most k 2 d terms in the sum.

Methodology algorithm Algorithm 1 Growing-Basis 1: procedure Growing-Basis (degree d , �· , f ∗ � , �· , ( f ∗ ) 2 � ) ˆ f := 0 2: while � 1 , ( f ∗ − ˆ f ) 2 � > 0 do 3: H := 1 , B := 1 4: for r = 1 , · · · , n do 5: for t = d , · · · , 0 do 6: if � H · H 2 t ( x r ) , ( f ∗ − ˆ f ) 2 � > 0 then 7: H := H · H 2 t ( x r ) , B := B · H t ( x r ) 8: break out of double loop. 9: end if 10: end for 11: end for 12: f := ˆ ˆ f + � B , f ∗ � · B 13: end while 14: return ˆ f 15: end procedure

Methodology sparsity in orthogonal basis We give a lemma which allows us to work in an orthogonal basis without blowing up the sparsity too much. Lemma Suppose f ∗ is k -sparse in product basis H 1 . Then it is k 2 d sparse in product basis H 2 . Proof. Write each term H ( 1 ) t i ( x i ) of f ∗ in basis H 1 in basis H 2 : each will have t i terms. Since each monomial term in H 1 is a product of i t i ≤ 2 d terms for each � such H t i ( x i ) , there will be � i ( t i + 1 ) ≤ 2 monomial. Since there are k monomials, there are at most k 2 d terms when expressed in H 2 .

Methodology detecting degrees (1) We now give a lemma which suggests the correctness of the search procedure used in Growing-Basis . Lemma Let d 1 denote the maximum degree of variable x 1 in f ∗ . Then, � H 2 t ( x 1 ) , ( f ∗ ) 2 ( x ) � > 0 iff t ≤ d 1 . Proof. We have n n H t i ( x i ) 2 + � � � � ( f ∗ ) 2 ( x ) = α 2 α T α U H t i ( x i ) H u i ( x i ) T T i = 1 T � = U i = 1 Note that if t > t 1 , H t 1 ( x 1 ) 2 will only be supported on basis functions H 0 , · · · , H 2 t 1 . This set does not include H 2 t since 2 t > 2 t 1 , so � H 2 t ( x 1 ) , H t 1 ( x 1 ) 2 � = 0. Likewise for second term if t > u 1 , thus, if t > d 1 , correlation is zero. If t = d 1 , the correlation is nonzero for the first term, but zero for the second term.

Methodology detecting degrees (2) Let’s get some intuition. n n H t i ( x i ) 2 + � � � � ( f ∗ ) 2 ( x ) = α 2 α T α U H t i ( x i ) H u i ( x i ) T i = 1 T � = U i = 1 T Let’s look at   � n � � n 2 t i � � � � H t i ( x i ) 2 H 2 t ( x 1 ) , = H 2 t ( x 1 ) ,  1 + c t , j H j ( x i )  i = 1 i = 1 j = 1 Since t i = t (for T such that t 1 = d 1 ), the coefficient of the term H 2 t ( x 1 ) � n i = 2 H 0 ( x i ) is the only thing that remains since everything else will get zeroed out. Then just sum over T such that t 1 = d 1 . The second term does not contribute since either i � = 1 or t i + u i < 2 t since u i � = t i . n � � � H 2 t ( x 1 ) , H t i ( x i ) H u i ( x i ) = 0 i = 1

Methodology detecting degrees (3) Thus, it makes sense that if we proceed from the largest degree possible, we will be able to detect the degree of x 1 in one of the basis functions in the representation of f ∗ . With some more analysis of a similar flavor, we extend this to finding a complete product basis representation. ◮ Key idea: lexicographic order ◮ example: 1544300 � 1544000 since 0 < 3. ◮ we will use to compare degree lists T and U , which correspond to basis functions H T , H U . ◮ We can essentially proceed inductively. ◮ Recap: Suppose f ∗ contains basis functions H t 1 ( x 1 ) , · · · , H t r ( x r ) . Then, check � H 2 t 1 , ··· , 2 t r , t , 0 , ··· , 0 ( x ) , f ∗ ( x ) 2 � > 0 for t = d → 0. Assign t r + 1 := t ∗ such that t ∗ is the first value making the correlation > 0.

Methodology sampling version In the sampling situation, we only get data points { ( z i , f ∗ ( z i ) } m i = 1 and no oracle. We will run the same algorithm, replacing the oracles with an emulated version. ◮ Have to emulate correlation oracle: ˆ � m C ( f ) = 1 i = 1 f ( z i ) f ∗ ( z i ) 2 . m ◮ Chebyshev inequality suffices to bound � � � 1 max f E [ f 2 ( f ∗ ) 4 ] f 2 ( f ∗ ) 4 �� m = O < O to get a ǫ 2 E ǫ 2 constant probability bound. ◮ Can repeat log ( 1 /δ ) times and take the median to boost the probability of success to 1 − δ . ◮ For the noisy case, compute correlation up to 4 th moments instead and apply standard concentration inequalities (subgaussian noise is very standard).

Methodology getting 2 d sample complexity To actually get a bound for sample complexity, we bound f 2 ( f ∗ ) 4 � assuming a uniform distribution [ − 1 , 1 ] n . � max f E ◮ Legendre orthogonal polynomials for this distribution ◮ Fact: | H d i ( x i ) | ≤ √ 2 d i + 1. √ 2 S i + 1 ≤ � i 2 S i ≤ 2 d . ◮ Thus: | H S ( x ) | = � i | H S i ( x i ) | ≤ � i ◮ Thus: | f ∗ ( x ) | = | � S α S H S ( x ) | ≤ 2 d � S | α S | . ◮ By Parseval (Pythagorean thm. for inner product spaces), √ S = 1. Since f ∗ is k -sparse, � S α 2 � S | α S | ≤ k . ◮ Thus | f ∗ ( x ) | ≤ 2 d √ k . ◮ Thus f ( x ) 2 f ∗ ( x ) 4 ≤ 2 6 d k 2 if f ∗ is degree d and f is represented in a degree 2 d basis.

Key Takeaways proof methodology The key methodology in the proof has the following properties: ◮ relies heavily on orthogonal properties of polynomials ◮ is “term-by-term”: we examine and find each basis function one at a time. ◮ achieves 2 d dependence because ◮ transforming to an orthogonal basis only causes 2 d blow-up in sparsity ◮ fact about Legendre polynomials (for uniform distribution) ◮ weakness: relies heavily on product distribution assumption in order to construct orthogonal polynomial bases over n variables.

Thank you for your attention!

Learning Sparse Polynomials over product measures Kiran Vodrahalli - PowerPoint PPT Presentation

Learning Sparse Polynomials over product measures Kiran Vodrahalli knv2109@columbia.edu Columbia University December 11, 2017 The Problem Learning Sparse Polynomial Functions [Andoni, Panigrahy, Valiant, Zhang 14] Consider learning a

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Small-span characteristic polynomials of integer symmetric matrices James McKee (RHUL) ANTS 9,

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

From Eulerian Polynomials and Chromatic Polynomials to Hessenberg Varieties Michelle Wachs

Quadratic functions Elementary Functions In the last lecture we studied polynomials of simple form

Unimodality of q -Eulerian polynomials and q , p -Eulerian polynomials Michelle Wachs University

Universality for zeros of random polynomials Motivation Random polynomials Turgay Bayraktar

Advances in Knot Polynomials 21 October 2016 Advances in Knot Polynomials 21 October 2016 1 /

Polynomials and Fast Fourier Transform (FFT) Polynomials n-1 a i x i a polynomial of degree n-1

(Restrained) Chromatic Polynomials Aysel Erey Dalhousie University CanaDAM 2013, St. Johns

Polynomials that no one can solve! Supriya Pisolkar IISER Pune April 16, 2017 S. Pisolkar

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Piedmont Unified Reopening Schools Advisory Taskforce School Board Meeting June 10, 2020 1

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

The Science of Integrated Approaches to Natural Resources Management: lessons from programmes and

New Mission, New Goals, New Syllabus! Follow ing the Data Michele L. OConnor Jermaine

New trends in EE: PBL and international mega projects Anette Kolmos Professor at the UNESCO

PRESS PLAY A conversation about play with Katy Smith Susanne Leslie, Interviewer Proud

Stakeholder engagement strategies for policy and programmatic uptake: Lessons from the Drivers of

From Loyalists to Lincolnville Environmental Injustice and the Legacies of Racism Atlantic Land