Estimating Unnormalised Models by Score Matching Michael Gutmann - PowerPoint PPT Presentation

Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Program 1. Basics of score matching 2. Practical objective function for score matching Michael Gutmann Score Matching 2 / 18

Program 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Michael Gutmann Score Matching 3 / 18

Problem formulation ◮ We want to estimate the parameters θ of a parametric statistical model for a random vector x ∈ R d . ◮ Given: iid data x i , . . . , x n that are assumed to be observations of x that has pdf p ∗ ◮ Further notation: p ( ξ ; θ ) is the model pdf; ξ ∈ R d is a dummy variable. ◮ Assumptions: ◮ Model p ( ξ ; θ ) is known only up the partition function p ( ξ ; θ ) = ˜ p ( ξ ; θ ) � Z ( θ ) = p ( ξ ; θ ) d ξ ˜ Z ( θ ) ξ ◮ Functional form of ˜ p is known (can be easily computed) ◮ Partition function Z ( θ ) cannot be computed analytically in closed form and numerical approximation is expensive. ◮ Goal: Estimate the model without approximating the partition function Z ( θ ). Michael Gutmann Score Matching 4 / 18

Basic ideas of score matching ◮ Maximum likelihood estimation can be considered to find parameter values ˆ θ so that p ( ξ ; ˆ log p ( ξ ; ˆ θ ) ≈ p ∗ ( ξ ) or θ ) ≈ log p ∗ ( ξ ) (as measured by Kullback-Leibler divergence, see Barber 8.7) ◮ Instead of estimating the parameters θ by matching (log) densities, score matching identifies parameter values ˆ θ for which the derivatives (slopes) of the log densities match ∇ ξ log p ( ξ ; ˆ θ ) ≈ ∇ ξ log p ∗ ( ξ ) ◮ ∇ ξ log p ( ξ ; θ ) does not depend on the partition function: ∇ ξ log p ( ξ ; θ ) = ∇ ξ [log ˜ p ( ξ ; θ ) − log Z ( θ )] = ∇ ξ log ˜ p ( ξ ; θ ) Michael Gutmann Score Matching 5 / 18

The score function (in the context of score matching) ◮ Define the model score function R d → R d as ∂ log p ( ξ ; θ )   ∂ξ 1 .  .  ψ ( ξ ; θ ) =  = ∇ ξ log p ( ξ ; θ ) .    ∂ log p ( ξ ; θ ) ∂ξ d While defined in terms of p ( ξ ; θ ), we also have ψ ( ξ ; θ ) = ∇ ξ log ˜ p ( ξ ; θ ) ◮ Similarly, define the data score function as ψ ∗ ( ξ ) = ∇ ξ log p ∗ ( ξ ) Michael Gutmann Score Matching 6 / 18

Definition of the SM objective function ◮ Estimate θ by minimising a distance between model score function ψ ( ξ ; θ ) and score function of observed data ψ ∗ ( ξ ) J sm ( θ ) = 1 � ξ ∈ R m p ∗ ( ξ ) � ψ ( ξ ; θ ) − ψ ∗ ( ξ ) � 2 d ξ 2 = 1 2 E ∗ � ψ ( x ; θ ) − ψ ∗ ( x ) � 2 ( x ∼ p ∗ ) ◮ Since ψ ( ξ ; θ ) = ∇ ξ log ˜ p ( ξ ; θ ) does not depend on Z ( θ ) there is no need to compute the partition function. ◮ Knowing the unnormalised model ˜ p ( ξ ; θ ) is enough. ◮ Expectation E ∗ with respect to p ∗ can be approximated as sample average over the observed data, but what about ψ ∗ ? Michael Gutmann Score Matching 7 / 18

Program 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Michael Gutmann Score Matching 8 / 18

Program 1. Basics of score matching 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 9 / 18

Reformulation of the SM objective function ◮ In the objective function we have the score function of the data distribution ψ ∗ . How to compute it? ◮ In fact, no need to compute it because the score matching objective function J sm can be expressed as d ∂ j ψ j ( x ; θ ) + 1 � � 2 ψ 2 � J sm ( θ ) = E ∗ j ( x ; θ ) + const. j =1 where the constant does not depend on θ , and ∂ j ψ j ( ξ ; θ ) = ∂ 2 log ˜ ψ j ( ξ ; θ ) = ∂ log ˜ p ( ξ ; θ ) p ( ξ ; θ ) ∂ξ 2 ∂ξ j j Michael Gutmann Score Matching 10 / 18

Proof (general idea) ◮ Use Euclidean distance and expand the objective function J sm J sm ( θ ) = 1 2 E ∗ � ψ ( x ; θ ) − ψ ∗ ( x ) � 2 = 1 + 1 2 E ∗ � ψ ( x ; θ ) � 2 − E ∗ � � 2 E ∗ � ψ ∗ ( x ) � 2 ψ ( x ; θ ) ⊤ ψ ∗ ( x ) d = 1 2 E ∗ � ψ ( x ; θ ) � 2 − � E ∗ [ ψ j ( x ; θ ) ψ ∗ , j ( x )] + const j =1 ◮ First term does not depend on ψ ∗ . The ψ j and ψ ∗ , j are the j -th elements of the vectors ψ and ψ ∗ , respectively. Constant does not depend on θ . ◮ The trick is to use integration by parts for the second term to get an objective function which does not involve ψ ∗ . Michael Gutmann Score Matching 11 / 18

Proof (not examinable) � E ∗ [ ψ j ( x ; θ ) ψ ∗ , j ( x )] = p ∗ ( ξ ) ψ ∗ , j ( ξ ) ψ j ( ξ ; θ ) d ξ ξ � p ∗ ( ξ ) ∂ log p ∗ ( ξ ) = ψ j ( ξ ; θ ) d ξ ∂ξ j ξ �� p ∗ ( ξ ) ∂ log p ∗ ( ξ ) � = ψ j ( ξ ; θ ) d ξ j d ξ k ∂ξ j ξ k ξ j k � = i �� ∂ p ∗ ( ξ ) � = ψ j ( ξ ; θ ) d ξ j d ξ k ∂ξ j ξ k ξ j k � = i Use integration by parts � � ∂ p ∗ ( ξ ) p ∗ ( ξ ) ∂ψ j ( ξ ; θ ) ψ j ( ξ ; θ ) d ξ j = [ p ∗ ( ξ ) ψ j ( ξ ; θ )] b j d ξ j a j − ∂ξ j ∂ξ j ξ j ξ j � p ∗ ( ξ ) ∂ψ j ( ξ ; θ ) = − d ξ j , ∂ξ j ξ j where the a j and b j specify the boundaries of the data pdf p ∗ along dimension j and where we assume that [ p ∗ ( ξ ) ψ j ( ξ ; θ )] b j a j = 0. Michael Gutmann Score Matching 12 / 18

Proof (not examinable) If [ p ∗ ( ξ ) ψ j ( ξ ; θ )] b j a j = 0 �� p ∗ ( ξ ) ∂ψ j ( ξ ; θ ) � E ∗ [ ψ j ( x ; θ ) ψ ∗ , j ( x )] = − d ξ j d ξ k ∂ξ j ξ k ξ j k � = i � p ∗ ( ξ ) ∂ψ j ( ξ ; θ ) = − d ξ ∂ξ j ξ = − E ∗ [ ∂ j ψ j ( x ; θ )] so that d J sm ( θ ) = 1 2 E ∗ � ψ ( x ; θ ) � 2 − � − E ∗ [ ∂ j ψ j ( x ; θ )] + const j =1 d � � ∂ j ψ j ( x ; θ ) + 1 � 2 ψ 2 = E ∗ j ( x ; θ ) + const j =1 Replacing the expectation / integration over the data density p ∗ by a sample average over the observed data gives a computable objective function for score matching. Michael Gutmann Score Matching 13 / 18

Final method of score matching ◮ Given iid data x 1 , . . . , x n , the score matching estimate is ˆ θ = argmin J ( θ ) θ n d J ( θ ) = 1 ∂ j ψ j ( x i ; θ ) + 1 � � � � 2 ψ j ( x i ; θ ) 2 n i =1 j =1 ψ j is the partial derivative of the log unnormalised model log ˜ p with respect to the j -th coordinate (slope) and ∂ j ψ j its second partial derivative (curvature). ◮ Parameter estimation with intractable partition functions without approximating the partition function. Michael Gutmann Score Matching 14 / 18

Requirements � n � d J ( θ ) = 1 ∂ j ψ j ( x i ; θ ) + 1 2 ψ j ( x i ; θ ) 2 � � i =1 j =1 n Requirements: ◮ technical (from proof): [ p ∗ ( ξ ) ψ j ( ξ ; θ )] b j a j = 0, where the a j and b j specify the boundaries of the data pdf p ∗ along dimension j ◮ smoothness: second derivatives of log ˜ p ( ξ ; θ ) with respect to the ξ j need to exist, and should be smooth with respect to θ so that J ( θ ) can be optimised with gradient-based methods. Michael Gutmann Score Matching 15 / 18

Simple example p ( ξ ; θ ) = exp( − θξ 2 / 2), parameter θ > 0 is the precision. ◮ ˜ ◮ The slope and curvature of the log unnormalised model are ψ ( ξ ; θ ) = ∂ ξ log ˜ p ( ξ ; θ ) = − θξ, ∂ ξ ψ ( ξ ; θ ) = − θ. ◮ If p ∗ is Gaussian, lim ξ →±∞ p ∗ ( ξ ) ψ ( ξ ; θ ) = 0 for all θ . ◮ Score matching objective 2 n J ( θ ) = − θ + 1 2 θ 2 1 x 2 � 1 i n i =1 0 � − 1 n � 1 ⇒ ˆ x 2 � θ = i Neg. score matching objective n −1 i =1 Term with score function derivative Term with squared score function −2 ◮ For Gaussians, same as the MLE. 0.5 1 1.5 2 Precision Michael Gutmann Score Matching 16 / 18

Extensions ◮ Score matching as presented here only works for x ∈ R d ◮ There are extensions for discrete and non-negative random variables (not examinable) https://www.cs.helsinki.fi/u/ahyvarin/papers/CSDA07.pdf ◮ Can be shown to be part of a general framework to estimate unnormalised models (not examinable) https://michaelgutmann.github.io/assets/papers/Gutmann2011b.pdf ◮ Overall message: in some situations, other learning criteria than likelihood are preferable. Michael Gutmann Score Matching 17 / 18

Program recap 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 18 / 18

Estimating Unnormalised Models by Score Matching Michael Gutmann - PowerPoint PPT Presentation

Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Program 1. Basics of score matching 2. Practical objective

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Propensity Score Matching James H. Steiger Department of Psychology and Human Development

Score Distribution Models Evangelos Kanoulas Keshi Dai Virgil Pavlu Javed Aslam

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Sample Score Report by three areas, or claims. Sample Score

Entrepreneurship & SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship

Linear Classification w T x i is the classifier score for the instance x i The score can be used

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

B e t t e r s o r t i n g a l g o r i t h ms ( W e i s s c h a p t

Quantum gravity at one-loop and AdS/CFT Marcos Mario University of Geneva (mostly) based on

An analytic pointview on the Mock Theta functions of Ramanujan Changgui ZHANG University of

Co-existence of different functional classes implemented as

Staged Events Tiago Salmito tsalmito@inf.puc-rio.br Ana Lcia de Moura Noemi Rodriguez 6th

Examining the Software Specification [Reading assignment: Chapter 4, pp. 54-62] Testing the

1 Functional requirements: Michael Jackson s Design Functional requirements methodology

02291: System Integration Hubert Baumeister hub@imm.dtu.dk Spring 2012 Contents 1

Estimating Unnormalised Models by Score Matching Michael Gutmann - PowerPoint PPT Presentation

Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Program 1. Basics of score matching 2. Practical objective

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Propensity Score Matching James H. Steiger Department of Psychology and Human Development

Score Distribution Models Evangelos Kanoulas Keshi Dai Virgil Pavlu Javed Aslam

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Sample Score Report by three areas, or claims. Sample Score

Entrepreneurship &amp; SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship

Linear Classification w T x i is the classifier score for the instance x i The score can be used

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

B e t t e r s o r t i n g a l g o r i t h ms ( W e i s s c h a p t

Quantum gravity at one-loop and AdS/CFT Marcos Mario University of Geneva (mostly) based on

An analytic pointview on the Mock Theta functions of Ramanujan Changgui ZHANG University of

Co-existence of different functional classes implemented as

Staged Events Tiago Salmito tsalmito@inf.puc-rio.br Ana Lcia de Moura Noemi Rodriguez 6th

Examining the Software Specification [Reading assignment: Chapter 4, pp. 54-62] Testing the

1 Functional requirements: Michael Jackson s Design Functional requirements methodology

02291: System Integration Hubert Baumeister hub@imm.dtu.dk Spring 2012 Contents 1

Entrepreneurship & SCORE By: Mort Harris Agenda Who is SCORE Entrepreneurship