Studying Model Asymptotics with Singular Learning Theory Shaowei - PowerPoint PPT Presentation

Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data Sets 1 / 27

Sparsity Penalties • Regression • BIC Integral Asymptotics Singular Learning RLCTs Sparsity Penalties 2 / 27

Linear Regression Sparsity Penalties ω, X ∈ R d , Y = ω · X + ε , Y ∈ R , ε ∈ N (0 , 1) Model • Regression ( Y 1 , X 1 ) , . . . , ( Y N , X N ) Data • BIC Integral Asymptotics � N i =1 | Y i − ω · X i | 2 min ω Least squares Singular Learning i =1 | Y i − ω · X i | 2 + π ( ω ) � N min ω Penalized regression RLCTs LASSO Bayesian Info Criterion (BIC) π ( ω ) = | ω | 1 · β π ( ω ) = | ω | 0 · log N Parameter space is partitioned into regions (submodels). 3 / 27

Bayesian Information Criterion Sparsity Penalties Given region Ω of parameters and a prior ϕ ( ω ) dω on Ω , • • Regression the marginal likelihood of the data is proportional to • BIC � Integral Asymptotics e − Nf ( ω ) ϕ ( ω ) dω Z N = Singular Learning Ω RLCTs � N 1 i =1 | Y i − ω · X i | 2 . where f ( ω ) = 2 N Laplace approximation : Asymptotically as sample size N → ∞ , • − log Z N ≈ Nf ( ω ∗ ) + d 2 log N + O (1) where ω ∗ = argmin ω ∈ Ω f ( ω ) and d = dim Ω . • Studying model asymptotics allows us to derive the BIC. But Laplace approx only works when the model is regular. Many models in machine learning are singular , e.g. mixtures, neural networks, hidden variables. 4 / 27

Sparsity Penalties Integral Asymptotics • Estimation • RLCT • Geometry • Desingularization • Algorithm Singular Learning RLCTs Integral Asymptotics 5 / 27

Estimating Integrals Sparsity Penalties Generally, there are three ways to estimate statistical integrals. Integral Asymptotics 1. Exact methods • Estimation • RLCT Compute a closed form formula for the integral, • Geometry e.g. (Lin · Sturmfels · Xu, 2009). • Desingularization • Algorithm 2. Numerical methods Singular Learning RLCTs Approximate using Markov Chain Monte Carlo (MCMC) and other sampling techniques. 3. Asymptotic methods Analyze how the integral behaves for large samples. 6 / 27

Real Log Canonical Threshold Asymptotic theory (Arnol’d · Guse˘ ın-Zade · Varchenko, 1985) Sparsity Penalties Integral Asymptotics states that for a Laplace integral, • Estimation • RLCT e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 � • Geometry Z ( N ) = • Desingularization Ω • Algorithm Singular Learning asymptotically as N → ∞ for some positive constants C, λ, θ RLCTs and where f ∗ = min ω ∈ Ω f ( ω ) . The pair ( λ, θ ) is the real log canonical threshold of f ( ω ) with respect to the measure ϕ ( ω ) dω . 7 / 27

Geometry of the Integral e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 Sparsity Penalties � Z ( N ) = Integral Asymptotics Ω • Estimation • RLCT Integral asymptotics depend on minimum locus of exponent f ( ω ) . • Geometry • Desingularization • Algorithm f ( x, y ) = x 2 + y 2 Singular Learning RLCTs f ( x, y ) = ( xy ) 2 f ( x, y ) = ( y 2 − x 3 ) 2 Plots of integrand e − Nf ( x,y ) for N = 1 and N = 10 8 / 27

Desingularizations Let Ω ⊂ R d and f : Ω → R real analytic function. Sparsity Penalties Integral Asymptotics We say ρ : U → Ω desingularizes f if • • Estimation • RLCT • Geometry U is a d -dimensional real analytic manifold covered 1. • Desingularization by coordinate patches U 1 , . . . , U s ( ≃ subsets of R d ). • Algorithm Singular Learning ρ is a proper real analytic map that is an isomorphism 2. RLCTs onto the subset { ω ∈ Ω : f ( ω ) � = 0 } . For each restriction ρ : U i → Ω , 3. f ◦ ρ ( µ ) = a ( µ ) µ κ , det ∂ρ ( µ ) = b ( µ ) µ τ where a ( µ ) and b ( µ ) are nonzero on U i . • Hironaka (1964) proved that desingularizations always exist. 9 / 27

Algorithm for Computing RLCTs • Sparsity Penalties We know how to find RLCTs of monomial functions (AGV, 1985). Integral Asymptotics � κd e − Nω κ 1 1 · · · ω τ d 1 ··· ω d ω τ 1 d dω ≈ CN − λ (log N ) θ − 1 • Estimation • RLCT Ω • Geometry where λ = min i τ i +1 κ i , θ = |{ i : τ i +1 = λ }| . • Desingularization κ i • Algorithm To compute the RLCT of any function f ( ω ) : • Singular Learning RLCTs Find minimum f ∗ of f over Ω . 1. Find a desingularization ρ for f − f ∗ . 2. Use AGV Theorem to find ( λ i , θ i ) on each patch U i . 3. λ = min { λ i } , θ = max { θ i : λ i = λ } . 4. • The difficult part is finding a desingularization, e.g (Bravo · Encinas · Villamayor, 2005). 10 / 27

Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC Singular Learning Theory RLCTs 11 / 27

Sumio Watanabe Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs Sumio Watanabe Heisuke Hironaka In 1998, Sumio Watanabe discovered how to study the asymptotic behavior of singular models. His insight was to use a deep result in algebraic geometry known as Hironaka’s Resolution of Singularities . Heisuke Hironaka proved this celebrated result in 1964. His accomplishment won him the Field’s Medal in 1970. 12 / 27

Bayesian Statistics Sparsity Penalties random variable with state space X (e.g. { 1 , 2 , . . . , k } , R k ) X Integral Asymptotics ∆ space of probability distributions on X Singular Learning • Sumio Watanabe M ⊂ ∆ statistical model, image of p : Ω → ∆ • Bayesian Statistics • Standard Form Ω parameter space • Learning Coefficient p ( x | ω ) dx distribution at ω ∈ Ω • Geometry • AIC and DIC ϕ ( ω ) dω prior distribution on Ω RLCTs Suppose samples X 1 , . . . , X N drawn from true distribution q ∈ M . N � � p ( X i | ω ) ϕ ( ω ) dω. Z N = Marginal likelihood Ω i =1 � q ( x ) log q ( x ) K ( ω ) = p ( x | ω ) dx. Kullback-Leibler function X 13 / 27

Standard Form of Log Likelihood Ratio Sparsity Penalties Define log likelihood ratio . Note that its expectation is K ( ω ) . Integral Asymptotics K N ( ω ) = 1 i =1 log q ( X i ) � N p ( X i | ω ) . Singular Learning N • Sumio Watanabe • Bayesian Statistics • Standard Form Standard Form of Log Likelihood Ratio (Watanabe) • Learning Coefficient • Geometry If ρ : U → Ω desingularizes K ( ω ) , then on each patch U i , • AIC and DIC 1 K N ◦ ρ ( µ ) = µ 2 κ − RLCTs µ κ ξ N ( µ ) √ N where ξ N ( µ ) converges in law to a Gaussian process on U . For regular models, this is a Central Limit Theorem . 14 / 27

Learning Coefficient � N Sparsity Penalties Define empirical entropy S N = − 1 i =1 log q ( X i ) . N Integral Asymptotics Singular Learning Convergence of stochastic complexity (Watanabe) • Sumio Watanabe • Bayesian Statistics The stochastic complexity has the asymptotic expansion • Standard Form • Learning Coefficient − log Z N = NS N + λ q log N − ( θ q − 1) log log N + O p (1) • Geometry • AIC and DIC where λ q , θ q describe the asymptotics of the deterministic integral RLCTs � e − NK ( ω ) ϕ ( ω ) dω ≈ CN − λ q (log N ) θ q − 1 . Z ( N ) = Ω For regular models, this is the Bayesian Information Criterion. Various names for ( λ q , θ q ) : statistics - learning coefficient of the model M at q algebraic geometry - real log canonical threshold of K ( ω ) 15 / 27

Geometry of Singular Models Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs 16 / 27

AIC and DIC Bayes generalization error B N . The Kullback-Leibler distance Sparsity Penalties Integral Asymptotics from the true distribution q ( x ) to the predictive distribution p ( x | D ) . Singular Learning • Sumio Watanabe Asymptotically , B N is equivalent to • Bayesian Statistics • Standard Form • Akaike Information Criterion for regular models • Learning Coefficient AIC = − � N • Geometry i =1 log p ( X i | ω ∗ ) + d • AIC and DIC • Akaike Information Criterion for singular models RLCTs AIC = − � N i =1 log p ( X i | ω ∗ ) + 2( singular fluctuation ) Numerically , B N can be estimated using MCMC methods. • Deviance Information Criterion for regular models DIC = E X [log p ( X | E ω [ ω ])] − 2 E ω [ E X [log p ( X | ω )]] • Widely Applicable Information Criterion for singular models WAIC = E X [log E ω [ p ( X | ω )]] − 2 E ω [ E X [log p ( X | ω )]] 17 / 27

Sparsity Penalties Integral Asymptotics Singular Learning RLCTs • Sparsity Penalty • Newton Polyhedra • Upper Bounds Real Log Canonical Thresholds 18 / 27

Studying Model Asymptotics with Singular Learning Theory Shaowei - PowerPoint PPT Presentation

Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data

Asymptotics of symmetric functions with applications to Setup Asymptotics of statistical

Descriptive and combinatorial set theory Introduction Singular cardinals, at singular cardinals

Using Geometric Singular Perturbation Theory to Understand Singular Shocks Barbara Lee Keyfitz

Asymptotics Will Perkins January 22, 2013 Asymptotics In many theorems and questions in

SYMBOLIC LOGIC UNIT 10: SINGULAR SENTENCES Singular Sentences (monadic) Paris is beautiful

[11] The Singular Value Decomposition The Singular Value Decomposition Gene Golubs license

/ Link Invariants from Braided Monoidal On the PROB of Singular Braids Categories Singular

Singular Value Decomposition Presented by Matthew Motoki 1 What is a singular value

Statistical mechanics via Answers: GUE asymptotics of symmetric functions Probability via Schur

Foundations of Computer Science Lecture 9 Sums And Asymptotics Computing Sums Asymptotics:

On second order perturbation theory for embedded theory eigenvalues Nelson model Singular

Singular Research Conference Singular Research Conference Jeffrey Glajch, Vice President &

On the product of a singular Wishart matrix and a singular Gaussian vector in high dimension.

Multiplicative relations among singular moduli Jonathan Pila University of Oxford ERC meeting in

Eigenvalue Problems and Singular Value Decomposition Sanzheng Qiao Department of Computing and

1 Singular Value Decomposition The singular vector decomposition allows us to write any matrix A

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

On Distinguishability Measures for Quantum States Christopher Granade August 9, 2010 Quantum

distribution function in the CRIS equation of state model A u t h o r s : Benjamin J. Cowen

Waiting for rare entropic fluctuations in stochastic thermodynamics Keiji Saito (Keio University)

Semiparametric Estimation Theory for Discretely Observed L evy Processes Chris A.J. Klaassen

Model-Robust Inference for Clinical Trials that Improve Precision by Stratified Randomization and

Likelihood Based Uncertainty Bounding in Prediction Error Identification using ARX models A

Analysis of low-energy scattering and weakly-bound states through effective-range functions.

Studying Model Asymptotics with Singular Learning Theory Shaowei - PowerPoint PPT Presentation

Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data

Asymptotics of symmetric functions with applications to Setup Asymptotics of statistical

Descriptive and combinatorial set theory Introduction Singular cardinals, at singular cardinals

Using Geometric Singular Perturbation Theory to Understand Singular Shocks Barbara Lee Keyfitz

Asymptotics Will Perkins January 22, 2013 Asymptotics In many theorems and questions in

SYMBOLIC LOGIC UNIT 10: SINGULAR SENTENCES Singular Sentences (monadic) Paris is beautiful

[11] The Singular Value Decomposition The Singular Value Decomposition Gene Golubs license

/ Link Invariants from Braided Monoidal On the PROB of Singular Braids Categories Singular

Singular Value Decomposition Presented by Matthew Motoki 1 What is a singular value

Statistical mechanics via Answers: GUE asymptotics of symmetric functions Probability via Schur

Foundations of Computer Science Lecture 9 Sums And Asymptotics Computing Sums Asymptotics:

On second order perturbation theory for embedded theory eigenvalues Nelson model Singular

Singular Research Conference Singular Research Conference Jeffrey Glajch, Vice President &amp;

On the product of a singular Wishart matrix and a singular Gaussian vector in high dimension.

Multiplicative relations among singular moduli Jonathan Pila University of Oxford ERC meeting in

Eigenvalue Problems and Singular Value Decomposition Sanzheng Qiao Department of Computing and

1 Singular Value Decomposition The singular vector decomposition allows us to write any matrix A

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

On Distinguishability Measures for Quantum States Christopher Granade August 9, 2010 Quantum

distribution function in the CRIS equation of state model A u t h o r s : Benjamin J. Cowen

Waiting for rare entropic fluctuations in stochastic thermodynamics Keiji Saito (Keio University)

Semiparametric Estimation Theory for Discretely Observed L evy Processes Chris A.J. Klaassen

Model-Robust Inference for Clinical Trials that Improve Precision by Stratified Randomization and

Likelihood Based Uncertainty Bounding in Prediction Error Identification using ARX models A

Analysis of low-energy scattering and weakly-bound states through effective-range functions.

Singular Research Conference Singular Research Conference Jeffrey Glajch, Vice President &