Binary probit regression with I-priors Haziq Jamil Supervisors: Dr. - - PowerPoint PPT Presentation

binary probit regression with i priors
SMART_READER_LITE
LIVE PREVIEW

Binary probit regression with I-priors Haziq Jamil Supervisors: Dr. - - PowerPoint PPT Presentation

Binary probit regression with I-priors Haziq Jamil Supervisors: Dr. Wicher Bergsma & Prof. Irini Moustaki Social Statistics (Year 3) London School of Economics and Political Science 8 May 2017 PhD Presentation Event http://phd3.haziqj.ml


slide-1
SLIDE 1

Binary probit regression with I-priors

Haziq Jamil

Supervisors: Dr. Wicher Bergsma & Prof. Irini Moustaki

Social Statistics (Year 3) London School of Economics and Political Science

8 May 2017 PhD Presentation Event http://phd3.haziqj.ml

slide-2
SLIDE 2

Outline

1 Introduction

I-priors PhD Roadmap

2 Probit models with I-priors

The latent variable motivation Using I-priors Estimation (and challenges)

3 Variational inference

Introduction Mean-field factorisation Variational I-prior probit

4 Examples

Cardiac arrhythmia data set Meta-analysis of smoking cessation

5 Summary

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 0 / 24

slide-3
SLIDE 3

Introduction Probit with I-priors Variational Examples Summary End

The regression model

  • For i = 1, . . . , n, consider the regression model

yi = f (xi) + ǫi (ǫ1, . . . , ǫn) ∼ N(0, Ψ−1) (1) where f ∈ F, yi ∈ R, and xi = (xi1, . . . , xip) ∈ X.

  • x

y

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 1 / 24

slide-4
SLIDE 4

Introduction Probit with I-priors Variational Examples Summary End

I-priors

  • Let F be a reproducing kernel Hilbert space (RKHS) with reproducing

kernel hλ : X × X → R. An I-prior on f is

  • f (x1), . . . , f (xn)

⊤ ∼ N

  • f0, I(f )
  • ,

with f0 a prior mean, and I the Fisher information for f , given by I

  • f (x), f (x′)
  • =

n

  • k=1

n

  • l=1

ψklhλ(x, xk)hλ(x′, xl).

  • The I-prior regression model for i = 1, . . . , n becomes

yi = f0(xi) +

n

  • k=1

hλ(xi, xk)wk + ǫi (w1, . . . , wn) ∼ N(0, Ψ) (ǫ1, . . . , ǫn) ∼ N(0, Ψ−1). (2)

  • W. Bergsma (2017). “Regression with I-priors”.

Manuscript in preparation

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 2 / 24

slide-5
SLIDE 5

Introduction Probit with I-priors Variational Examples Summary End

I-priors (cont.)

  • Of interest is the posterior regression function characterised by the

distribution p(f|y) = p(y|f)p(f)

  • p(y|f)p(f) df ,

and also the posterior predictive distribution for new data points xnew p(ynew|y) =

  • p(ynew|y, fnew)p(fnew|y) dfnew

with fnew = f (xnew).

  • Estimation using EM algorithm or direct maximisation of the marginal

likelihood log p(y).

  • Complete Bayesian estimation also possible.

HJ (2017a). iprior: Linear Regression using I-Priors. R Package version 0.6.4: CRAN

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 3 / 24

slide-6
SLIDE 6

Introduction Probit with I-priors Variational Examples Summary End

Fractional Brownian motion (FBM) RKHS

Prior

x y

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 4 / 24

slide-7
SLIDE 7

Introduction Probit with I-priors Variational Examples Summary End

Fractional Brownian motion (FBM) RKHS

Posterior

x y

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 4 / 24

slide-8
SLIDE 8

Introduction Probit with I-priors Variational Examples Summary End

Fractional Brownian motion (FBM) RKHS

Truth

Posterior

x y

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 4 / 24

slide-9
SLIDE 9

Introduction Probit with I-priors Variational Examples Summary End

Posterior predictive distribution

  • 95% credible interval

x y

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 5 / 24

slide-10
SLIDE 10

Introduction Probit with I-priors Variational Examples Summary End

Posterior predictive distribution

Posterior predictive check

y density

Observed Replications Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 5 / 24

slide-11
SLIDE 11

Introduction Probit with I-priors Variational Examples Summary End

PhD Roadmap

  • ● ●
  • ✓ ✓ ✗ ✗ ✓
  • ● ●
  • I-priors

Unified methodology for

  • additive models
  • multilevel models
  • models with functional covariates

canonical (linear) FBM Pearson

RKHS

Estimation:

  • Direct maximisation
  • EM algorithm
  • MCMC (Gibbs/HMC)

R/iprior

Bayesian Variable Selection (using I-priors in the canonical RKHS)

Good performance in cases with multicollinearity

X1 X2 X3 X4 X5

Extension to binary responses Estimation using variational inference

Binary probit models with I-priors

Advantages

  • Minimal assumptions
  • Straightforward inference
  • Performance competetive

classification inference / fitted probabilities

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 6 / 24

slide-12
SLIDE 12

1 Introduction 2 Probit models with I-priors 3 Variational inference 4 Examples 5 Summary

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 6 / 24

slide-13
SLIDE 13

Introduction Probit with I-priors Variational Examples Summary End

The latent variable motivation

  • Consider binary responses y1, . . . , yn together with their corresponding

covariates x1, . . . , xn.

  • For i = 1, . . . , n, model the responses as

yi ∼ Bern(pi).

  • Assume that there exists continuous, underlying latent variables

y∗

1 , . . . , y∗ n, such that

yi =

  • 1

if y∗

i ≥ 0

if y∗

i < 0.

  • Model these continuous latent variables according to

y∗

i = f (xi) + ǫi

where (ǫ1, . . . , ǫn) ∼ N(0, Ψ−1) and f ∈ F (some RKHS).

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 7 / 24

slide-14
SLIDE 14

Introduction Probit with I-priors Variational Examples Summary End

Using I-priors

  • Assume an I-prior on f . Then,

f (xi) =

α

f0(xi) +

n

  • k=1

hλ(xi, xk)wk (w1, . . . , wn) ∼ N(0, Ψ).

  • For now, consider iid errors Ψ = ψIn. In this case,

pi = P[yi = 1] = P[y∗

i ≥ 0]

= P[ǫi ≤ f (xi)] = Φ

  • ψ1/2(α + n

k=1hλ(xi, xk)wk)

  • where Φ is the CDF of a standard normal.
  • No loss of generality compared with using an arbitrary threshold τ or

error precision ψ. Thus, set ψ = 1.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 8 / 24

slide-15
SLIDE 15

Introduction Probit with I-priors Variational Examples Summary End

Estimation

  • Denote fi = f (xi) for short.
  • The marginal density

p(y) =

  • p(y|f)p(f) df

=

  • n
  • i=1
  • Φ(fi)yi

1 − Φ(fi) 1−yi · N(α1n, H2

λ) df

for which p(f|y) depends, cannot be evaluated analytically.

  • Some strategies:

✗ Naive Monte-Carlo integral ✗ EM algorithm with a MCMC E-step ✓ Laplace approximation

details

✓ MCMC sampling

details Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 9 / 24

slide-16
SLIDE 16

1 Introduction 2 Probit models with I-priors 3 Variational inference 4 Examples 5 Summary

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 9 / 24

slide-17
SLIDE 17

Introduction Probit with I-priors Variational Examples Summary End

Variational inference

  • Consider a statistical model where we have observations (y1, . . . , yn)

and also some latent variables (z1, . . . , zn).

  • The zi could be random effects or some auxiliary latent variables.
  • In a Bayesian setting, this could also include the parameters to be

estimated.

  • GOAL: Find approximations for

◮ The posterior distribution p(z|y); and ◮ The marginal likelihood (or model evidence) p(y).

  • Variational inference is a deterministic approach, unlike MCMC.
  • C. M. Bishop (2006). Pattern Recognition and Machine Learning.

Springer, Ch. 10

  • K. P. Murphy (2012). Machine Learning: A Probabilistic Perspective.

The MIT Press, Ch. 21

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 10 / 24

slide-18
SLIDE 18

Introduction Probit with I-priors Variational Examples Summary End

Decomposition of the log marginal

  • Let q(z) be some density function to approximate p(z|y). Then the

log-marginal density can be decomposed as follows: log p(y) = log p(y, z) − log p(z|y) = log p(y, z) q(z) − log p(z|y) q(z)

  • q(z) dz

= L(q) + KL(qp) ≥ L(q)

  • L is referred to as the “lower-bound”, and it serves as a surrogate

function to the marginal.

  • Maximising L(q) is equivalent to minimising KL(qp).
  • Although KL(qp) is minimised at q(z) ≡ p(z|y) (c.f. EM algorithm),

we are unable to work with p(z|y).

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 11 / 24

slide-19
SLIDE 19

Introduction Probit with I-priors Variational Examples Summary End

Factorised distributions (Mean-field theory)

  • Maximising L over all possible q not feasible. Need some restrictions,

but only to achieve tractability.

  • Suppose we partition elements of z into m disjoint groups

z = (z(1), . . . , z(m)), and assume q(z) =

m

  • j=1

qj(z(j)).

  • Under this restriction, the solution to arg maxq L(q) is

˜ qj(z(j)) ∝ exp

  • E−j[log p(y, z)]
  • (3)

for j ∈ {1, . . . , m}.

  • In practice, these unnormalised densities are of recognisable form

(especially if conjugate priors are used).

  • D. M. Blei et al. (2016). “Variational Inference: A Review for Statisticians”.

arXiv: 1601.00670

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 12 / 24

slide-20
SLIDE 20

Introduction Probit with I-priors Variational Examples Summary End

Coordinate ascent mean-field variational inference (CAVI)

  • The optimal distributions are coupled with another, i.e. each ˜

qj(z(j)) depends on the optimal moments of z(k), k ∈ {1, . . . , m : k = j}.

  • One way around this to employ an iterative procedure.
  • Assess convergence by monitoring the lower bound

L(q) = Eq[log p(y, z)] − Eq[log q(z)].

Algorithm 1 CAVI

1: initialise Variational factors qj(z(j)) 2: while L(q) not converged do 3:

for j = 1, . . . , m do

4:

log qj(z(j)) ← E−j[log p(y, z)] + const.

5:

end for

6:

L(q) ← Eq[log p(y, z)] − Eq[log q(z)]

7: end while 8: return ˜

q(z) = m

j=1 ˜

qj(z(j))

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 13 / 24 example

slide-21
SLIDE 21

Introduction Probit with I-priors Variational Examples Summary End

Variational I-prior probit

xi fi y∗

i

λ α yi wi h i = 1, . . . , n

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 14 / 24

p(y, y∗, w, α, λ) = p(y|y∗)p(y∗|f)p(w)p(λ)p(α) = n

i=1 1[y∗ i ≥ 0]yi 1[y∗ i < 0]1−yi

· n

i=1{N(fi, 1)} · [N(0, 1)]n

· N(λ0, κ−1

0 ) · N(α0, ν−1 0 )

slide-22
SLIDE 22

Introduction Probit with I-priors Variational Examples Summary End

Posterior distribution

  • Approximate the posterior by a mean-field variational density

p(y∗, w, α, λ|y) ≈

n

  • i=1

q(y∗

i )q(w)q(α)q(λ)

where q(y∗

i ) ≡

  • 1[y∗

i ≥ 0] N(˜

fi, 1) if yi = 1 1[y∗

i < 0] N(˜

fi, 1) if yi = 0 q(w) ≡ N(˜ w, ˜ Vw) q(λ) ≡ N(˜ λ, ˜ vw) q(α) ≡ N(˜ α, 1/n) ˜ fi = ˜ α + n

k=1h˜ λ(xi, xk) ˜

wk ˜ α = 1 n n

k=1

  • E[y∗

i ] − h˜ λ(xi, xk) ˜

wk

  • ˜

w = ˜ VwH˜

λ(E[y∗] − ˜

α1n) ˜ V−1

w

= H2

˜ λ + In

˜ λ = (E[y∗] − ˜ α1n)H˜ w/˜ vλ ˜ vλ = tr(H2(˜ Vw + ˜ w˜ w⊤))

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 15 / 24

slide-23
SLIDE 23

Introduction Probit with I-priors Variational Examples Summary End

Variational lower bound

  • Since the solutions are coupled, we implement an iterative scheme

(as per Algorithm 1).

  • Assess convergence by monitoring the lower bound

L = Eq[log p(y, y∗, w, α, λ)] − Eq[log q(y∗, w, α, λ)] = const. +

n

  • i=1
  • yi log Φ(˜

fi) + (1 − yi) log

  • 1 − Φ(˜

fi)

  • − 1

2

  • tr ˜

Vw + tr(˜ w˜ w⊤) − log |˜ Vw| + log ˜ vλ

  • (possible) ISSUE: Different initialisations lead to different converged

lower bound values indicating presence of many local optima.

  • From experience, typically local optima gives better predictive abilities.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 16 / 24

slide-24
SLIDE 24

Introduction Probit with I-priors Variational Examples Summary End

Posterior predictive distribution

  • Given new data points xnew, interested in

p(ynew|y) =

  • p(ynew|y∗

new, y)p(y∗ new|y) dy∗ new

  • p(ynew|y∗

new)q(y∗ new) dy∗ new

=

  • Φ(˜

fnew) if ynew = 1 1 − Φ(˜ fnew) if ynew = 0 where ˜ fnew = ˜ α + n

k=1h˜ λ(xnew, xk) ˜

wk.

  • fnew represents the estimate of the latent propensity for ynew, and its

uncertainty is described by q(y∗

new).

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 17 / 24

slide-25
SLIDE 25

1 Introduction 2 Probit models with I-priors 3 Variational inference 4 Examples 5 Summary

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 17 / 24

slide-26
SLIDE 26

Introduction Probit with I-priors Variational Examples Summary End

Cardiac arrhythmia data set

  • Detect the presence of cardiac arrhythmia based on various ECG data

and other attributes such as age and weight (n = 451, p = 194).

Normal Arrhythmia

−2 −1 1 2

Standardised attribute values

  • H. A. Guvenir et al. (1998). UCI Machine Learning Repository: Arrhythmia Data

Set. URL: https://archive.ics.uci.edu/ml/datasets/Arrhythmia

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 18 / 24

slide-27
SLIDE 27

Introduction Probit with I-priors Variational Examples Summary End

Cardiac arrhythmia data set - Model fit

  • Fit an I-prior probit model using Canonical and FBM kernels. The full

data set takes about 35 seconds.

R> mod <- iprobit(y, X, kernel = "FBM")

  • Compare against popular classifiers: 1) k-nearest neighbours; 2)

support vector machine; 3) Gaussian process classification; 4) random forests; 5) nearest shrunken centroids (Tibshirani et al. 2003); and 6) L-1 penalised logistic regression.

  • Experiment set-up:

◮ Form training set by sub-sampling nsub ∈ {50, 100, 200} data points. ◮ Use remaining data as test set. ◮ Fit model on training set and obtain test error rates. ◮ Repeat 100 times.

  • T. I. Cannings and R. J. Samworth (2017). “Random-projection ensemble

classification”.

  • J. R. Stat. Soc. Ser. B: Stat. Methodol (w. discussion), to appear

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 19 / 24

slide-28
SLIDE 28

Introduction Probit with I-priors Variational Examples Summary End

Cardiac arrhythmia data set - Results

34.54 34.69 40.64 36.16 37.28 31.65 34.98 34.92 31.43 27.28 38.94 35.64 33.80 26.72 33.00 30.48 29.72 24.51 35.76 35.20 29.31 22.40 31.08 26.12

n = 50 n = 100 n = 200 20 30 40 20 30 40 20 30 40 k−nn SVM NSC GP (radial) I−probit (linear) L−1 logistic I−probit (FBM−0.5) Random forests

Misclassification rate

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 20 / 24

slide-29
SLIDE 29

Introduction Probit with I-priors Variational Examples Summary End

Meta-analysis of smoking cessation

Fagerstrom 1982

1 1 1 1 1 1

Villa 1999

1 1 1 1 1

Nakamura 1990

1 1 1

Garvey 2000

1 1 1 1 · · ·

Niaura 1999

1 1 1 1 1

  • Data from 27 separate smoking cessation studies, where participants

subjected to nicotine gum treatment or placed in control group.

  • Some summary statistics:

Min. Avg. Max.

  • Prop. quit

Odds quit Control 20 101 617 0.207 0.261 Treated 21 117 600 0.320 0.470

  • Raw odds ratio: 1.801.
  • Random-effects analysis using a multilevel logistic model estimates

this odds ratio as 1.768.

  • A. Skrondal and S. Rabe-Hesketh (2004). Generalized Latent Variable Modeling:

Multilevel, Longitudinal, and Structural Equation Models. Chapman & Hall/CRC, §9.5

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 21 / 24

slide-30
SLIDE 30

Introduction Probit with I-priors Variational Examples Summary End

Meta-analysis of smoking cessation - model

  • Let i = 1, . . . , nj index the patients in study group j ∈ 1, . . . , 27.
  • Denote yij as the binary response variable indicating Quit (1) or

Remain (0), and xij as patientij’s treatment group indicator.

  • Model binary data using I-probit model

Φ−1(pij) = f (xij, j) = f1(xij) + f2(j) + f12(xij, j) with f1, f2 ∈ Pearson RKHS, and f12 ∈ ANOVA RKHS. Model Lower bound Brier score

  • No. of RKHS

param. 1 f1

  • 3210.79

0.0311 1 2 f1 + f2

  • 3097.24

0.0294 2 3 f1 + f2 + f12

  • 3091.21

0.0294 2

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 22 / 24

slide-31
SLIDE 31

Introduction Probit with I-priors Variational Examples Summary End

Meta-analysis of smoking cessation - results

1.031 1.903 1.172 1.941 2.063 1.750 2.390 1.705 1.564 1.563 Niaura 1999 Garvey 2000 Gross 1995 Garcia 1989 Campbell 1991 Nakamura 1990 Tonnesen 1988 Hall 1985 Villa 1999 Fagerstrom 1982

  • avg. odds

ratio = 1.687 0.5 1.0 Control Nicotine gum treatment

Model predicted odds

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 23 / 24

slide-32
SLIDE 32

1 Introduction 2 Probit models with I-priors 3 Variational inference 4 Examples 5 Summary

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 23 / 24

slide-33
SLIDE 33

Introduction Probit with I-priors Variational Examples Summary End

Summary

  • An extension of the I-prior methodology to binary responses.
  • Variational inference used to approximate the intractable likelihood.

◮ A deterministic approximation of the posterior density by a “close” (in

the KL divergence sense), tractable density.

◮ It’s somewhere between Laplace’s method and MCMC sampling. details

  • Several real-world examples demonstrated the use of I-probit models

for classification and inference.

  • Further work:

◮ R package iprobit ◮ Extend to non-iid errors case ◮ Extend to multinomial probit models ◮ Other algorithms (e.g. expectation propagation)

Slides, source code and results are made available at: http://phd3.haziqj.ml

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 24 / 24

slide-34
SLIDE 34

Introduction Probit with I-priors Variational Examples Summary End

End

Thank you!

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 24 / 24

slide-35
SLIDE 35

References I

Bergsma, W. (2017). “Regression with I-priors”. Manuscript in preparation. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2016). “Variational Inference: A Review for Statisticians”. arXiv: 1601.00670. Cannings, T. I. and R. J. Samworth (2017). “Random-projection ensemble classification”. Journal of the Royal Statistical Society. Series B: Statistical Methodology (with discussion), to appear. Guvenir, H. A., M. Burak Acar, and H. Muderrisoglu (1998). UCI Machine Learning Repository: Arrhythmia Data Set. URL: https://archive.ics.uci.edu/ml/datasets/Arrhythmia. Jamil, H. (2017a). iprior: Linear Regression using I-Priors. R Package version 0.6.4: CRAN.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 24 / 24

slide-36
SLIDE 36

References II

Jamil, H. (2017b). iprobit: Binary Probit Regression with I-Priors. R Package version 0.1.0: GitHub. Kass, R. and A. Raftery (1995). “Bayes Factors”. Journal of the American Statistical Association 90.430, pp. 773–795. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press. Skrondal, A. and S. Rabe-Hesketh (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Chapman & Hall/CRC. Tibshirani, R., T. Hastie, B. Narasimhan, and G. Chu (2003). “Class prediction by nearest shrunken centroids, with applications to DNA microarrays”. Statistical Science 18.1, pp. 104–117.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 24 / 24

slide-37
SLIDE 37

6 Additional material

The I-prior probit model Laplace’s method Full Bayesian analysis of I-probit models Variational inference A simple variational inference example Fisher’s Iris data set

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 24 / 24

slide-38
SLIDE 38

Additional material

The I-prior probit model

xi fi pi λ α yi wi Φ h i = 1, . . . , n

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 25 / 24

p(y, w, α, λ) = p(y|f)p(w)p(λ)p(α) = n

i=1Φ(fi)yi

1 − Φ(fi) 1−yi · [N(0, 1)]n · N(λ0, κ−1

0 )

· N(α0, τ −1

0 )

slide-39
SLIDE 39

Additional material

Laplace’s method

  • Interested in p(f|y) ∝ p(y|f)p(f) =: eQ(f), with normalising constant

p(y) =

  • eQ(f) df. The Taylor expansion of Q about its mode ˜

f Q(f) ≈ Q(˜ f) − 1 2(f − ˜ f)⊤A(f − ˜ f) is recognised as the logarithm of an unnormalised Gaussian density, with A = −D2Q(f) being the negative Hessian of Q evaluated at ˜ f.

  • The posterior p(f|y) is approximated by N(˜

f, A−1), and the marginal by p(y) ≈ (2π)n/2|A|−1/2p(y|˜ f)p(˜ f)

  • Won’t scale with large n; difficult to find modes in high dimensions.
  • R. Kass and A. Raftery (1995). “Bayes Factors”.

Journal of the American Statistical Association 90.430, pp. 773–795, §4.1, pp.777-778.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 26 / 24 back

slide-40
SLIDE 40

Additional material

Full Bayesian analysis using MCMC

  • Assign hyperpriors on parameters of the I-prior, e.g.

◮ λ2 ∼ Γ−1(a, b) ◮ α ∼ N(c, d2)

for a hierarchical model to be estimated fully Bayes.

  • No closed-form posteriors - need to resort to MCMC sampling.
  • Computationally slow, and sampling difficulty results in unreliable

posterior samples.

lambda 25 50 75 100 5 10 15 20

Iteration value Chain

1 2 3 4 5 6 7 8

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 27 / 24 back

slide-41
SLIDE 41

Additional material

Variational inference

  • Name derived from calculus of variations which deals with maximising
  • r minimising functionals.

Functions p : θ → R (standard calculus) Functionals H : p → R (variational calculus)

  • Using standard calculus, we can solve

arg max

θ

p(θ) =: ˆ θ e.g. p is a likelihood function, and ˆ θ is the ML estimate.

  • Using variational calculus, we can solve

arg max

p

H(p) =: ˜ p e.g. H is the entropy H = −

  • p(x) log p(x) dx, and ˜

p is the entropy maximising distribution.

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 28 / 24

slide-42
SLIDE 42

Additional material

Comparison of approximations (density)

mode mean

Laplace Truth Variational

z Density

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 29 / 24 back

slide-43
SLIDE 43

Additional material

Comparison of approximations (deviance)

Variational Truth Laplace

z Deviance (−2 x Log−density)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 30 / 24 back

slide-44
SLIDE 44

Additional material

Estimation of a 1-dim Gaussian mean and variance

  • GOAL: Bayesian inference of mean µ and variance ψ−1

yi

iid

∼ N(µ, ψ−1) Data µ|ψ ∼ N

  • µ0, (κ0ψ)−1

Priors ψ ∼ Γ(a0, b0) i = 1, . . . , n

  • Substitute p(µ, ψ|y) with the mean-field approximation

q(µ, ψ) = qµ(µ)qψ(ψ)

  • From (3), we can work out the solutions

˜ qµ(µ) ≡ N κ0µ0 + n¯ y κ0 + n , 1 (κ0 + n) Eq[ψ]

  • and

˜ qψ(ψ) ≡ Γ(˜ a, ˜ b) ˜ a = a0 + n 2 ˜ b = b0 + 1 2 Eq n

i=1(yi − µ)2 + κ0(µ − µ0)2

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 31 / 24 back

slide-45
SLIDE 45

Additional material

Estimation of a 1-dim Gaussian mean and variance (cont.)

1.0 1.5 2.0 −0.25 0.00 0.25 0.50

Mean (µ) Precision (ψ)

L(q) log p(y)

Iteration 0 (initialisation)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 32 / 24 back

slide-46
SLIDE 46

Additional material

Estimation of a 1-dim Gaussian mean and variance (cont.)

1.0 1.5 2.0 −0.25 0.00 0.25 0.50

Mean (µ) Precision (ψ)

L(q) log p(y)

Iteration 1 (µ update)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 32 / 24 back

slide-47
SLIDE 47

Additional material

Estimation of a 1-dim Gaussian mean and variance (cont.)

1.0 1.5 2.0 −0.25 0.00 0.25 0.50

Mean (µ) Precision (ψ)

L(q) log p(y)

Iteration 1 (ψ update)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 32 / 24 back

slide-48
SLIDE 48

Additional material

Estimation of a 1-dim Gaussian mean and variance (cont.)

1.0 1.5 2.0 −0.25 0.00 0.25 0.50

Mean (µ) Precision (ψ)

L(q) log p(y)

Iteration 2 (µ update)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 32 / 24 back

slide-49
SLIDE 49

Additional material

Estimation of a 1-dim Gaussian mean and variance (cont.)

1.0 1.5 2.0 −0.25 0.00 0.25 0.50

Mean (µ) Precision (ψ)

L(q) log p(y)

Iteration 2 (ψ update)

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 32 / 24 back

slide-50
SLIDE 50

Additional material

Fisher’s Iris data set

2.0 2.5 3.0 3.5 4.0 4.5 5 6 7 8

Sepal.Length Sepal.Width Class

Others Setosa Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 33 / 24

slide-51
SLIDE 51

Additional material

Fisher’s Iris data set - Model fitting

  • Varitional inference for I-prior probit models implemented in R package

iprobit (still lots of work to do!).

R> system.time( + (mod <- iprobit(y, X)) + ) ## ## |================================= | 61% ## Converged after 6141 iterations. ## Training error rate: 0 % ## user system elapsed ## 67.857 6.396 74.277

HJ (2017b). iprobit: Binary Probit Regression with I-Priors. R Package version 0.1.0: GitHub

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 34 / 24

slide-52
SLIDE 52

Additional material

Fisher’s Iris data set - Model summary

R> summary(mod) ## ## Call: ## iprobit(y = y, X, maxit = 10000) ## ## RKHS used: Canonical ## ## Mean S.E. 2.5% 97.5% ## alpha

  • 4.1730 0.0816 -4.3330 -4.0129

## lambda 1.2896 0.0142 1.2618 1.3175 ## ## Converged to within 1e-05 tolerance. No. of iterations: 6141 ## Model classification error rate (%): 0 ## Variational lower bound: -12.93486

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 35 / 24

slide-53
SLIDE 53

Additional material

Fisher’s Iris data set - Lower bound

R> iplot_lb(mod, niter.plot = 10)

−23.57 −12.93

0.0025 0.0050 0.0075 0.0100 −100 −75 −50 −25 2 4 6 8 10

Time (seconds) Iteration Variational lower bound

Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 36 / 24

slide-54
SLIDE 54

Additional material

Fisher’s Iris data set - Decision boundary

R> iplot_decbound(mod)

2.0 2.5 3.0 3.5 4.0 4.5 5 6 7 8

Sepal.Length Sepal.Width Class

Others Setosa Haziq Jamil - http://haziqj.ml I-prior probit 8 May 2017 37 / 24