Bias-Adjusted Maximum Likelihood Estimation Improving Estimation for - - PowerPoint PPT Presentation

bias adjusted maximum likelihood estimation
SMART_READER_LITE
LIVE PREVIEW

Bias-Adjusted Maximum Likelihood Estimation Improving Estimation for - - PowerPoint PPT Presentation

Bias-Adjusted Maximum Likelihood Estimation Improving Estimation for Exponential-Family Random Graph Models (ERGMs) Ruth M Hummel David R Hunter Department of Statistics, Penn State University MURI meeting, May 25, 2010 MURI meeting May 2010


slide-1
SLIDE 1

Bias-Adjusted Maximum Likelihood Estimation

Improving Estimation for Exponential-Family Random Graph Models (ERGMs) Ruth M Hummel David R Hunter

Department of Statistics, Penn State University

MURI meeting, May 25, 2010

MURI meeting May 2010 Estimation for ERGMs

slide-2
SLIDE 2

Motivation: Why model networks?

A statistical model for

  • bserved network data

yobs allows us to: Summarize: Give a parsimonious quantitative summary of the data and, ideally, how precisely we know this summary Predict: Describe or simulate other networks that could have arisen from the same process

MURI meeting May 2010 Estimation for ERGMs

slide-3
SLIDE 3

Motivation: The likelihood function and MLE

The ERG model class: Pθ(Y = y) = exp{θtg(y)} κ(θ) , where κ(θ) =

  • all possible

graphs z

exp{θtg(z)} θ is a parameter vector to be estimated. g(y) is a user-defined vector of graph statistics. The loglikelihood function is ℓ(θ) = θtg(yobs) − log κ(θ). The MLE is the maximizer ˆ θ of the likelihood.

MURI meeting May 2010 Estimation for ERGMs

slide-4
SLIDE 4

The likelihood is sometimes intractable

7 8 7 11 8 8 7 8 8 9 7 11 9 8 11 7 9 9 10 7 7 9 8 9 9 7 9 7 9 7 7 8 9 7

For this undirected, 34-node network, computing ℓ(θ) directly requires summation of 7,547,924,849,643,082,704,483, 109,161,976,537,781,833,842, 440,832,880,856,752,412,600, 491,248,324,784,297,704,172, 253,450,355,317,535,082,936, 750,061,527,689,799,541,169, 259,849,585,265,122,868,502, 865,392,087,298,790,653,952 terms.

MURI meeting May 2010 Estimation for ERGMs

slide-5
SLIDE 5

The pseudolikelihood: A tractable alternative

Some algebra based on the ERGM gives, for all i = j, log P(Yij = 1 | Y c

ij )

P(Yij = 0 | Y c

ij ) = θt

g(Y +

ij ) − g(Y − ij )

  • .

The pseudolikelihood ignores the conditioning, assuming instead log P(Yij = 1) P(Yij = 0) = θt g(Y +

ij ) − g(Y − ij )

  • ≡ θtδ(Y )ij

independently for all i = j. Thus, the pseudolikelihood equals

  • i=j

exp

  • θtδ(yobs)ij

yobs

ij

1 + exp {θtδ(yobs)ij}

MURI meeting May 2010 Estimation for ERGMs

slide-6
SLIDE 6

Evidence of bias in MPLE compared to MLE

Van Duijn, Gile, and Handcock (2009, Social Networks) compare MLE to MPLE. They cite a small but compelling set of explorations of the MPLE, suggesting that there may be large differences between the MPLE and the approximate MLE, sometimes even in cases where the dependence is not thought to be a concern. They explore the bias in the MLE and MPLE compared to the “truth” They introduce a bias-corrected version of the MPLE (the “MBLE”). A similar bias-correction is possible for the MLE, though it is a bit less straightforward.

MURI meeting May 2010 Estimation for ERGMs

slide-7
SLIDE 7

bias-correction via Firth

The bias-correction we employ (which might be better described as a preemptive bias-mitigation, rather than correction) follows from Firth (1993). The idea is to maximize a penalized likelihood which induces a bias in the score function in order to reverse the some of the anticipated bias in the maximizer. The penalized likelihood is: ℓbc(θ) = ℓ(θ) + 1/2 log |I(θ)| The resulting maximizer is also the Bayesian maximum posterior estimator based on assigning a Jeffreys prior to the parameter.

MURI meeting May 2010 Estimation for ERGMs

slide-8
SLIDE 8

The intuition behind this modification for an exponential family model is the following: Since the score function, U(η), can be written U(η) = ℓ′(η) = g(Y ) − κ′(η), it is clear that the shape of U(η) is not affected by the sufficient statistic, g(Y ). For this reason, any anticipated bias in the MLE can be offset by shifting the score function by the amount bias ∗ ∇U. (Here ∇U = −i(η).) This adjustment is illustrated in the following figure, taken from Firth (1993):

Figure: Modification of the unbiased score function

MURI meeting May 2010 Estimation for ERGMs

slide-9
SLIDE 9

Evidence of bias in MLE (and MPLE) compared to “truth”

Taken from van Duijn, et al. (2009), these boxplots show the bias

  • f the MLE for selected parameters in two networks (“original” and

“transitivity”) for the canonical parameter space. (The true parameter is shown as a horizontal line.) Note that the bias is greatest in the MLE.

MURI meeting May 2010 Estimation for ERGMs

slide-10
SLIDE 10

Evidence of bias in MLE (and MPLE) compared to “truth”

Here we see that there is no bias of the MLE for selected parameters in two networks (“original” and “transitivity”) for the mean value parameter space. (This is by definition, since the mean-value MLE is the observed statistic.)

MURI meeting May 2010 Estimation for ERGMs

slide-11
SLIDE 11

Comparison on Lazega collaboration network

In order to compare our present extended results to the results found for just the MBLE and the ordinary MPLE and MLE in the van Duijn, et al. paper, we duplicate their results on the corporate lawyer partnerships data and include the analysis for the bias-corrected MLE (pMLE).

MURI meeting May 2010 Estimation for ERGMs

slide-12
SLIDE 12

Lazega collaboration network

The Lazega collaboration data are collaborations in the late 1980’s between 36 New England lawyers determined by their responses to the question “With which members of your firm have you spent time together on at least one case, have you been assigned to the same case, have they read or used your work product or have you have read or used their work product?” Additional member attributes collected include the attorneys’ gender, age, status (36 are partners; 35 are associates), seniority, years with the firm, practice (litigation or corporate),

  • ffice location (Boston, Hartford, or Providence), and law school

attended (Yale or Harvard, University of Connecticut, or any

  • ther).

MURI meeting May 2010 Estimation for ERGMs

slide-13
SLIDE 13

Following van Duijn, et al., we simulate networks based on a “truth” for the following model: Model terms ”True” parameter value edges

  • 6.506

GWESP 0.897 seniority (nodal covariate) 0.853 practice (nodal covariate) 0.410 practice (homophily effect) 0.759 gender (homophily effect) 0.702

  • ffice (homophily effect)

1.145

MURI meeting May 2010 Estimation for ERGMs

slide-14
SLIDE 14

Preliminary results:

Results based on very few simulations show no improvement in the MLE yet...

MLE pMLE MPLE MBLE 0.6 0.7 0.8 0.9 1.0 1.1 MLE pMLE MPLE MBLE 0.35 0.40 0.45 0.50 0.55 0.60

Figure: Distribution of the GWESP and Nodal Practice canonical parameter; true parameter shown as horizontal line.

MURI meeting May 2010 Estimation for ERGMs

slide-15
SLIDE 15

Preliminary results:

Here you can see that the number of sub-simulations for calculating the mean value parameter is clearly not sufficient, as the mean for the uncorrected MLE should be unbiased...

MLE pMLE MPLE MBLE 100 150 200 250 300 MLE pMLE MPLE MBLE 80 100 120 140 160

Figure: Distribution of the GWESP and Nodal Practice mean value parameter; true parameter shown as horizontal line.

MURI meeting May 2010 Estimation for ERGMs

slide-16
SLIDE 16

Current extensions:

increasing the simulations for the current network applying the same to the “increased transitivity” version of the collaboration network as used in van Duijn, et al. applying the same to a larger biological network applying the same to a friendship network

MURI meeting May 2010 Estimation for ERGMs

slide-17
SLIDE 17

A few words about Contrastive Divergence (CD)

Consider the idea of MCMC MLE: Suppose we fix η0. A bit of algebra shows that − log Eη0

  • exp
  • (η − η0)tg(Y )
  • = ℓ(η) − ℓ(η0).

(1) The Law of Large Numbers suggests obtaining a sample of Y from the model using θ0 as the parameter, then approximating the expectation by a sample mean. Q: How do we sample from g(Y ) using θ0 as the parameter? A: Run MCMC infinitely long.

MURI meeting May 2010 Estimation for ERGMs

slide-18
SLIDE 18

A few words about Contrastive Divergence (CD)

Consider the idea of MCMC MLE: Suppose we fix η0. A bit of algebra shows that − log Eη0

  • exp
  • (η − η0)tg(Y )
  • = ℓ(η) − ℓ(η0).

(1) The Law of Large Numbers suggests obtaining a sample of Y from the model using θ0 as the parameter, then approximating the expectation by a sample mean. Q: How do we sample from g(Y ) using θ0 as the parameter? A: Run MCMC infinitely long. But what if we only run MCMC for a single step (starting at yobs), for a randomly chosen Yij? For this Yij, we’re sampling from the conditional distribution given (yobs)c

ij.

MURI meeting May 2010 Estimation for ERGMs

slide-19
SLIDE 19

A few words about Contrastive Divergence (CD)

To summarize: Running an infinitely long Markov chain leads to the loglikelihood. Running a 1-step Markov chain leads to the pseudolikelihood. Thus, if we alternately sample and then optimize the resulting ”likelihood-like” function, we can view MLE and MPLE as two ends of a spectrum, the “contrastive divergence” spectrum. (MLE is CD-∞ and MPLE is CD-1.)

MURI meeting May 2010 Estimation for ERGMs

slide-20
SLIDE 20

A few words about Contrastive Divergence (CD)

Considering CD-1. . . Q: Is it better to

1 Repeatedly pick i = j at random, or 2 Cycle through all possible i = j in some systematic fashion?

A: The latter. The reason boils down to the following well-known identity for any two random variables Y and Z: Var(Y ) = Var [E(Y | Z)] + E [Var(Y | Z)] . Here, “Y ” is the likelihood-like quantity based on the randomly sampled networks and “Z” is the selected pair i = j.

MURI meeting May 2010 Estimation for ERGMs