The BLP Method of Demand Curve Estimation in Industrial Organization - - PDF document

▶

Mar 15, 2024 110 likes •484 views

The BLP Method of Demand Curve Estimation in Industrial Organization 27 February 2006 Eric Rasmusen 1 IDEAS USED 1. Instrumental variables . We use instruments to correct for the endogeneity of prices, the classic problem in estimating

SLIDE 1

The BLP Method of Demand Curve Estimation in Industrial Organization

27 February 2006 Eric Rasmusen 1

SLIDE 2

IDEAS USED

1. Instrumental variables. We use instruments to correct for the en-

dogeneity of prices, the classic problem in estimating supply and de- mand.

2. Product characteristics. We look at the effect of characteristics on

demand, and then build up to products that have particular levels

f the characteristics. Going from 50 products to 6 characteristics

drastically reduces the number of parameters to be estimated.

3. Consumer and product characteristics interact. This is what is

going on when consumer marginal utilities are allowed to depend on consumer characteristics. This makes the pattern of consumer pur- chases substituting from one good to another more sensible.

4. Structural estimation. We do not just look at conditional correla-

tions of relevant variables with a disturbance term tacked on to ac- count for the imperfect fit of the regression equation. Instead, we start with a model in which individuals maximize their payoffs by choice of actions, and the model includes the disturbance term which will later show up in the regression.

5. The contraction mapping. A contraction mapping is used to esti-

mate the parameters that are averaged across consumers, an otherwise difficult optimization problem.

6. The method of moments. The generalized method of moments is

used to estimate the other parameters.

2

SLIDE 3

II. The Problem

Suppose we are trying to estimate a market demand curve. We have available t = 1, ...20 months of data on a good, data consisting of the quantity sold, qt, and the price, pt. Our theory is that demand is linear, with this equation: qt(pt) = α − βpt + εt. (1) Let’s start with an industry subject to price controls. A mad dictator sets the price each month, changing it for entirely whimsical reasons. Figure 1: Supply and Demand with Price Controls

3

SLIDE 4

Next, suppose we do not have price controls. Instead, we have a situ- ation of normal supply and demand. Figure 2: Supply and Demand without Price Controls The solution to the paradox is shown in Figure 2c: OLS estimates the supply curve, not the demand curve. This is what Working (1927) pointed

It could be that the unobservable variables εt are what are shifting the demand curve in Figure 2. Or, it could be that it is some observable variable that we have left out of our theory. So perhaps we should add income, yt: qt(pt) = α − βpt + γyt + εt. (2) Note, however, that if the supply curve never shifted, we still wouldn’t be able to estimate the effect of price on quantity demanded. We need some observed variable that shifts supply.

4

SLIDE 5

What if we could get data on individual i’s purchases of good j? qit(pt) = αi − βipt + γiyit + εit. (3) From the point of view of any one small consumer, the supply curve is flat. Figure 3: The Individual’s Demand Curve That is enough, if we are willing to simplify our theory to assume that all consumers have identical demand functions: qit(pt) = α − βpt + γyit + εit. (4) Consumers still differ in income yit and unobserved influences, εit, but the effect of an increase in price on quantity is the same for all consumers. If we are willing to accept that, however, we can estimate the demand curve for one consumer, and use that. Or, we can use data on n different consumers, to get more variance in income, and estimate ˆ β that way.

5

SLIDE 6

Two Alternatives to Identical Parameters across Consumers (1) Use the demand of one or n consumers anyway, arriving at the same estimate ˆ β that we would if consumers were identical. The interpretation would be different, though–it would be that we have estimated the average value of β, and interpreting the standard errors would be harder, since they would be affected by the amount of heterogeneity in βi as well as in εit. (The estimates would be unbiased, though, unlike the estimates I criticize in Rasmusen (1989a,b), since pt is not under the control of the consumer.) (2) Estimate the βi for all n consumers and then average the n esti- mates to get a market β, as opposed to running one big regression. One situation in which this would clearly be the best approach is if had reason to believe that the n consumers in our sample were not rep- resentative of the entire population. In that case, running one regression

n all the data would result in a biased estimate, as a simple consequence
f starting with a biased sample. Instead, we could run n separate regres-

sions, and then compute a weighted average of the estimates, weighting each type of consumer by how common his type was in the population.

6

SLIDE 7

Freedom from the endogeneity problem is deceptive. Changes in individuals’ demand are unlikely to be statistically inde- pendent of each other. When the unobservable variable ǫ900000t for consumer i = 900, 000 is unusually negative, so too in all likelihood is the unobservable variable ǫ899999t for consumer i = 899, 999. Thus, they will both reduce their purchases at the same time, which will move the equilibrium to a new point on the supply curve, reducing the market price. Price is endogenous for the purposes of estimation, even though it is exogenous from the point of view of any one consumer.

7

SLIDE 8

The identification problem remains even with price controls. We truly need a mad dictator, not just price controls. Suppose we have a political process— either a democratic one, or a sane dictator who is making decisions with an eye to everything in the economy and public opinion too. When is politics going to result in a higher regulated price? Probably when demand is stronger and quantity is greater. If the supply curve would naturally slope up, both buyers and sellers will complain more if the demand curve shifts out and the regulated price does not change. Thus, even the regulated price will have a supply curve. One thing about structural methods is that they force us to think more about the econometric problems. Your structural model will say what the demand disturbance is— un-

bserved variables that influence demand.

If you must build a maximizing model of where regulated prices come from, you will realize that they might depend on those unobserved vari- ables.

8

SLIDE 9

III. The Structural Approach

Suppose we are trying to estimate a demand elasticity— how quantity demanded responds to price. We have observations from 20 months of cereal market data, the same 50 brands of cereal for each month, which makes a total of 1,000 data points. We also have data on 6 characteristics of each cereal brand and we have demographic data on how 4 consumer characteristics are distributed in the population in each month. Each type of consumer decides which brand of cereal to buy, buying either one or zero units. We do not observe individual decisions, but we will model them so we can aggregate them to obtain the cereal brand market shares that we do

bserve. Let there be I = 400 consumers.

At this point, we could decide to estimate the elasticity of demand for each product and all the cross-elasticities directly, but with 50 products that would require estimating 2,500 numbers. Instead, we will focus on the 6 product characteristics.There would

nly be 6 even if there were 500 brands of cereal instead of 50.

9

SLIDE 10

The Consumer Decision The utility of consumer i if he were to buy brand j in month t is given by the following equation, denoted equation (1N) because it is equation (1) in Nevo (2001): uijt = αi(yi − pjt) + xjtβi + ξjt + ǫijt i = 1, ..., 400, j = 1, ..., 50, t = 1, ..., 20, (1N) yi is the income of consumer i (which is unobserved and which we will assume does not vary across time), pjt is the observed price of product j in month t, xjt is a 6-dimensional vector of observed characteristics of product j in month t, ξjt (the letter “xi”) is a disturbance scalar summarizing unobserved characteristics of product j in month t, ǫijt is the usual unobserved disturbance with mean zero. The parameters to be estimated are: (1) Consumer i’s marginal utility of income, αi, (2) Consumer i’s marginal utility of brand characteristics, the 6-vector βi.

10

SLIDE 11

The Outside Good, and Indirect vs. Direct Utility Fucntions Consumer i also has the choice to not buy any product at all. We will model this outside option as buying “product 0” and normalize by setting the j = 0 parameters equal to zero (or, if you like, by assuming that it has a zero price and zero values of the characteristics): ui0t ≡ αiyi + ǫi0t (5) Equation (1N) is an indirect utility function, depending on income yi and price pjt as well as on the real variables xjt, ξjt, and ǫijt. It is easily derived from a quasilinear utility function, however, in which the consumer’s utility is the utility from his consumption of one (or zero) of the 50 products, plus utility which is linear in his consumption of the outside

good. Quasilinear utility is not concave in income, so it lacks income

effects, but if those are important, they can be modelled by indirect utility that is a function not of (yi − pjt) but of some concave function such as log(yi − pjt), as in BLP (1995). Our utility function was: uijt = αi(yi − pjt) + xjtβi + ξjt + ǫijt i = 1, ..., 400, j = 1, ..., 50, t = 1, ..., 20, (1N)

11

SLIDE 12

We will assume that ǫijt follows the Type I extreme-value distribution, which if it has mean zero and scale parameter one has the density and cumulative distribution f(x) = exee−x F(x) = e−e−x. (6) This is the limiting distribution of the maximum value of a series of draws

f independent identically distributed random variables.

Figure 4: The Type I Extreme-Value Distribution (from the Engineering Statistics Handbook)

12

SLIDE 13

Simple Logit One way to proceed would be to assume that all consumers are identical in their taste parameters; i.e., that αi = α and βi = β, and that the ǫijt disturbances are uncorrelated across i’s. uijt = α(yi − pjt) + xjtβ + ξjt + ǫijt i = 1, ..., 400, j = 1, ..., 50, t = 1, ..., 20. (5N) Since ǫjt follows the Type I extreme value distribution by assumption, it turns out that the market share of product j under our utility function is sjt = exjtβ−αpjt+ξjt 1 + 50

k=1 exktβ−αpkt+ξkt

(6N) Equation (6N) is by no means obvious. The market share of product j is the probability that j has the highest utility, which occurs if ǫjt is high enough relative to the other disturbances. The probability that product 1 has a higher utility than the other 49 products and the outside good (which has a utility normalized to zero) is thus Prob(α(y − p1t) + x1tβ + ξ1t + ǫ1t > α(y − p2t) + x2tβ + ξ2t + ǫ2t)∗ Prob(α(y − p1t) + x1tβ + ξ1t + ǫ1t > α(y − p3t) + x3tβ + ξ3t + ǫ3t) ∗ · · · ∗Prob(α(y − p1t) + x1tβ + ξ1t + ǫ1t > α(y − p50t) + x50tβ + ξ50t + ǫ50t)∗ Prob(α(y − p1t) + x1tβ + ξ1t + ǫ1t > αy + ǫ0t) (7) Substituting for the Type I extreme value distribution into equation (7) and solving this out yields, after much algebra, equation (6N). Since αy appears on both sides of each inequality, it drops out. The 1 in equation (6N) is there because of the outside good, with its 0 utility, since e0 = 1.

13

SLIDE 14

Elasticities of Demand in Simple Multinominal Logit We need

∂sjt ∂pkt for brands k = 1, ..., 50. DefineMj as

Mj ≡ exjtβ−αpjt+ξjt. (8) so the share of a brand is sjt = Mj 1 + 50

k=1 Mk

. (9) Then ∂sjt ∂pkt =

∂Mj ∂pkt

1 + 50

k=1 Mk

+

−Mj

(1 + 50

k=1 Mk)2

∂Mk ∂pkt

(10)

First, suppose k = j. Then

∂sjt ∂pkt =

1 + 50

k=1 Mk

+

−Mj

(1 + 50

k=1 Mk)2

(−αMk)

= α

1 + 50

k=1 Mk

Mk 1 + 50

k=1 Mk

= αsjtskt

(11) Second, suppose k = j. Then

∂sjt ∂pjt =

−αMj 1 + 50

k=1 Mk

+

−Mj

(1 + 50

k=1 Mk)2

(−αMj)

= −αsjt + αs2

jt

= −αsjt(1 − sjt) (12)

14

SLIDE 15

We just found:

∂sjt ∂pkt = αsjtskt ∂sjt ∂pjt = −αsjt(1 − sjt)

We can now calculate the elasticity of the market share: the percentage change in the market share of good j when the price of good k goes up: ηjkt ≡ %∆sjt %∆pkt = ∂sjt ∂pkt · pkt sjt = −αpjt(1 − sjt) if j = k αpktskt

therwise.

(13)

15

SLIDE 16

The elasticities we just found are: ηjkt ≡ %∆sjt %∆pkt = ∂sjt ∂pkt · pkt sjt = −αpjt(1 − sjt) if j = k αpktskt

therwise.
1. If market shares are small, as is frequently the case, then own-price

elasticities are close to −αpjt. This says that if the price is lower, demand is less elastic, less responsive to price, which in turn implies that the seller will charge a higher markup on goods with low marginal cost.

2. The cross-price elasticity of good j with respect to the price of good k
nly depends on features of good k— its price and market share.

If good k raises its price, it loses customers equally to each other brand. (the red-bus blue-bus problem) Another way to proceed is to use nested logit.

16

SLIDE 17

The Random Coefficients Logit Model αi βi

α β

+ ΠDi + Σνi

= α β

+
Πα

Πβ

Di +
Σα

Σβ νiα, νiβ

(2N)

where Di is a 4 × 1 vector of consumer i’s observable characteristics, νi is a 7 × 1 vector of the effect of consumer i’s unobservable characteristics

n his αi and βi parameters; Π is a 7 × 4 matrix of how parameters (the

αi and the 6 elements of βi ) depend on consumer observables, Σ is a 7×7 matrix of how those 7 parameters depend on the unobservables, and (νiα, νiβ), (Πα, Πβ) and (Σα, Σβ) just split each vector or matrix into two parts. We will denote the distributions of D and ν by P∗

D(D) and P∗

ν(ν). Since we’ll be estimating the distribution of the consumer characteristics D, you will see the notation ˆ P∗

D(D) show up too. We will assume that

P∗ ν(ν) is multivariate normal.

17

SLIDE 18

Utility in the Random Coefficients Logit Model Equation (1N) becomes uijt = αi(yi − pjt) + xjtβi + ξjt + ǫijt = αiyi − (α + ΠαDi + Σανiα)pjt + xjt(β + ΠβDi + Σβνiβ) + ξjt + ǫijt = αiyi + δjt + µijt + ǫijt j = 1, ..., 50, t = 1, ..., 20. (14) What I have done above is to reorganize the terms to separate them into four parts. First, there is the utility from income, αiyi. This plays no part in the consumer’s choice, so it will drop out. Second, there is the “mean utility”, δjt, which is the component of utility from a consumer’s choice of brand j that is the same across all consumers. δjt ≡ −αpjt + xjtβ + ξjt (15) Third and fourth, there is a heteroskedastic disturbance, µijt, and a ho- moskedastic i.i.d. disturbance, ǫijt.

18

SLIDE 19

Market Shares and Elasticities in the BLP Model sjt =

sijtdˆ P∗

D(D)dP∗

ν(ν) =

ν
D
eδjt+µijt

1 + 50

k=1 eδkt+µikt

P∗

D(D)dP∗

ν(ν) (16) sjt = exjtβ−αpjt+ξjt 1 + 50

k=1 exktβ−αpkt+ξkt

(6N)

19

SLIDE 20

One non-structural approach would have been to use ordinary least squares to estimate the following equation, including product dummies to account for the ξjt product fixed effects. sjt = xjtβ − αpjt + ξjt (17) We can incorporate consumer characteristics by creating new variables in a vector dt that represents the mean value of the 4 consumer charac- teristics in month t, and then interacting the 4 consumer variables in dt with the 6 product variables in xt to create a 1 × 24 variable wt. Then we could use least squares with product dummies to estimate sjt = xjtβ − αpjt + dtθ1 + wtθw + ξjt (18)

20

SLIDE 21

The Method of Moments Our assumption on the population is Ezmω(θ∗) = 0, m = 1, . . . , M. (N8) The GMM estimator is ˆ θ =

argmin

θ ω(θ)′ZΦ−1Z′ω(θ), (N9) where Φ is a consistent estimator of EZ′ǫǫ′Z. We minimize the square of the sample analog because we want to minimize the magnitude, rather than generate large negative numbers by

ur choice, and squares are easier to deal with than absolute values (the

same choice in OLS versus minimizing absolute values of errors). We weight by Φ because we want to make heavier use of observa- tions that contain more independent information. This includes the serial correlation and heteroskedasticity corrections. Next, I’ll go into more detail on the Method of Moments.

21

SLIDE 22

The Method of Moments The two estimation approaches most used in economics are least squares and maximum likelihood. The principle of least squares is to find an equation that fits the data in the sense of minimizing the distance between the estimated equation and the actual data. OLS uses the sum of squared errors as its measure of distance, but if we used the sum of absolute deviations we would be estimating in the same spirit. The principle of maximum likelihood is to assume that the disturbance follows a particular distribution and to choose the equation parameters to maximize the likelihood that we observe the data we actually do see. The method of moments is closer to least squares in spirit. What it does is start with how the disturbance term relates to the exogenous variables. The name “method of moments” is misleading. The “first moment” of a distribution is the average and the “second moment” is the variance. The typical method of moments in economics uses neither. It use a covariance condition instead. A better name would be “method of analogy”, because the method of moment works by trying to match sums of observed values to theoretical equations that the modeller specifies.

22

SLIDE 23

Estimating a Mean, using the Method of Moments Jeffrey Wooldridge has an excellent example of how this works in his 2001 article in The Journal of Economic Perspectives. Suppose you were trying to estimate the mean of a population, µ. The method of moments would be to use the sample analog, the sample mean ˆ µ1 = x. But suppose you had additional information: that the population vari- ance is three times the population mean, so σ2 = 3µ. There would be an alternative way to use the method of moments, based on the sample variance s2: by making your estimate ˆ µ2 = s2/3. But these two estimates would be different, because each is computed differently from the sample data: ˆ µ1 = ˆ µ2. Both estimates are consistent— that is, in large samples they will give reliable answers. The generalized method of moments gives a way produce a superior estimator as a weighted average of ˆ µ1 and ˆ µ2. The optimal weight depends

n the variance of each estimate; the more variable single estimator should

be weighted less heavily. GLS weights individual observations to generate an estimator more efficient than ordinary least squares, even though OLS is consistent— thus, the “generalized” method of moments.We get the extra efficiency because GMM uses the extra information that σ2 = 3µ. In BLP, the extra information will be extra instruments over the min- imum needed for IV — the overidentifying restrictions.

23

SLIDE 24

Suppose we have T observations and we are trying to explain the

bservations on y, a T ×1 vector, in terms of M observed variables and one

unobserved variable. The observed explanatory variables are x1, ..., xM, where each xm is a T × 1 vector and we can line them all up as X, a T × M matrix. The unobserved variable is the disturbance term ǫ, which is a T × 1 vector. We need to assume something about the functional form and its pa-

rameters. Let’s assume linearity, so

y = Xβ + ǫ, (19) where β is an M ×1 vector of parameters.The expression Xβ is T ×M × M × 1 so it comes out to be T × 1, the same as y.

24

SLIDE 25

Most commonly (and practically by definition) we assume that the

bserved and unobserved variables are independent, which implies (but is

not equivalent to) Ex′

mǫ = 0.

m = 1, . . . , M, (20) The sample analog to this is X′ˆ ǫ = 0, (21) where 0 is an M × 1 vector of zeroes. That gives us M equations (one for each xm) for m unknowns (the M parameters βm). The value of our estimate of the disturbance, ˆ ǫ, is ˆ ǫ ≡ y − Xˆ β, (22) where the M × 1 vector ˆ β contains our M estimated parameters. Substi- tuting into (22), we get X′(y − Xˆ β) = 0, (23) so X′y = X′Xˆ β (24) and ˆ β = (X′X)−1X′y (25)

25

SLIDE 26

We solved for ˆ β analytically here, but that is not an essential part of the method of moments— fortunately not, because sometimes the moment condition (which was (22) here) has no exact solution. Something we could always do is to to turn the problem (using [23]) into

Minimize

ˆ β X′(y − Xˆ β) (26) We could, that is, try searching for the M values in ˆ β that minimize X′ˆ ǫ. We would give the computer an arbitrary starting value, ˆ β1 and calculate the size of X′ ˆ ǫ1 using ˆ β1. But what does “size” mean? X′ˆ ǫ is an M × 1 vector, so it contains M numbers to be minimized, not just 1. What if one set of ˆ β parameters yields X′ˆ ǫ = (0, 0, 0, 50) and another yields X′ˆ ǫ = (20, 20, 20, 20)? Which has minimized X′ˆ ǫ? We need a definition of the size of a vector. One definition you could use is to add up all the elements of the vector, in which case the two vectors would have sizes 50 and 80. Another is that you could add up the squares

f the elements, in which case the sizes are 2,500 and 1,600, the opposite

ranking. Here, we avoided the problem of defining size because we could find an analytic perfect solution to the minimization problem. Any sensible definition of size would say that (0, 0, 0, 0, 0) is the smallest 5 × 1 vector

possible. In other method of moments situations, you will have to confront

the problem directly.

26

SLIDE 27

Generalized Least Squares and the Method of Moments If disturbances are correlated across observations (serial correlation)

r different observations have disturbances with different probability dis-

tributions (heteroskedasticity), OLS is consistent, but it is inefficient and its standard errors are biased estimates of the standard deviations of the disturbances, so hypothesis testing is unreliable. There is nothing in the logic of least squares that tells us how to get to the GLS estimator which takes care of these problems. But the GLS estimator is, intuitive, both for serial correlation and for heteroskedasticity: it puts less weight on less informative data. If the disturbances in observations 1, 2, and 3 are highly correlated, then our estimator ought to weight those observations less, because they don’t contain as much information as three observations with independent disturbances. If you’re trying to estimate the temperature precisely, and you measure it with 100 thermometers that were manufactured to be identical, so they all have the same measurement error, you don’t get as precise an estimate as if you used 100 thermometers with independent errors. If the disturbances in observations 1 to 40 are distributed with a vari- ance of 10 and the disturbances in observations 41 to 80 are distributed with a variance of 500, then our estimator ought to weight observations 1 to 40 more heavily. They contain less noise. These intuitions are not the intuitions of least squares, nor of the method of moments. They are statistical intuitions, justified by either a frequentist or bayesian approach. We can nonetheless tack them on top

f OLS or the method of moments.

27

SLIDE 28

Implementing the GLS Idea in the Method of Moments In the method of moments, we can alter our moment condition by inserting the inverse of the variance-covariance matrix, Φ−1, a T × T matrix since it shows the covariance between any two of the T disturbances ǫt. The theoretical moment condition becomes EX′Φ−1ǫ = 0, (27) The sample analog is X′ ˆ Φ

−1ˆ

ǫ = 0, (28)

X′ ˆ Φ

−1(y − Xˆ

β) = 0, (29) which solves to ˆ β = (X′ ˆ Φ

−1X)−1 ˆ

Φ

−1y.

(30) This estimator is identical to the GLS estimator used in the least- squares approach. I didn’t say how to calculate the variance-covariance estimate ˆ Φ

−1,

but all we need is a consistent estimator for it, and we could do that as in GLS by iterating between calculating ˆ β to get estimates for ˆ ǫ and using those estimates to calculate ˆ Φ.

28

SLIDE 29

Instrumental Variables and the Method of Moments With endogeneity of the explanatory variables. What this means is that EX′ǫ does not equal zero even in our theoretical model. Some of the x’s are caused by the y, or something else jointly causes them. The solution to this problem is instrumental variables. We need to find some instruments, variables that (a) do not cause y, (b) are correlated with x, and (c) are uncorrelated with ǫ. For each x that is endogenous, we need to find at least one instrument. Requirement (a) says that our theory must be able to rule out the instruments from being part of the original equation we are estimating, i.e., we can’t use one xm to instrument for another xm. The other two requirements can be interpreted as requiring us to set up a second theoretical equation, one in which xm is explained by the instruments, among other things, though it is not a problem if our sec-

nd equation omits some relevant variables, since we won’t care if the

parameters we estimate in it are biased. Let’s define a new matrix Z consisting of T observations on all of the xm variables which are exogenous plus at least one instrument for each xm that is endogenous. Thus, Z will be T × N, where N ≥ M. For the moment, let’s assume that N = M, which means we have one instrument for each endogenous xm and the system is exactly identified, not overidentified. Our system then consists of y = Xβ + ǫ X = ZΓ + ν, (31) where Γ is an N × M matrix of coefficients so that we have M equations for how the N exogenous variables affect the M explanatory variables in the first equation.

29

SLIDE 30

Our theoretical moment equation is different now. We are not assum- ing that the X are all independent of ǫ. Instead, we are assuming that the variables in Z are all independent of ǫ. Thus, the theoretical equation is EZ′ǫ = 0. (32) The sample analog is Z′ˆ ǫ = 0, (33) where 0 is an M × 1 vector of zeroes, or Z′(y − Xˆ β) = 0. (34) As before, we can solve this. The first step is Z′y = Z′Xˆ β) (35) and the second step is ˆ β = (Z′X)−1Z′y. (36) Thus, we get the standard IV estimator, using the logic of the method of moments.

30

SLIDE 31

Overidentification, Defining Vector Size, and the Method of Moments We assumed that N = M. What if N > M? Well, then we can’t invert Z′X, which is an N × T × T × M matrix— an N × M matrix. We can invert that if it is a square M × M, but not otherwise, because N = M. A solution in the least squares approach is to use two-stage least

squares. In the first stage, regress bfZ on X to calculate the fitted values

ˆ X ≡ Zˆ Γ (37) Then our estimator can be ˆ β = (ˆ X′X)−1 ˆ X′y, (38) which is fine since ˆ X is a T × M matrix like X, even if Z is T × N. GMM takes a different approach to get to the same answer. The reader will recall our earlier discussion of how GMM can still apply even if there is no analytic solution to the equations, because we can still minimize the sample analog moment condition— if we define vector size. So that is what we will now do.

31

SLIDE 32

Let’s define the size of a T × 1 vector w as w′w— the sum of squared elements of the vector. Then the size of the M × 1 vector X′ǫ will be the 1 × T × T × M × M × T × T × 1 matrix (a scalar, actually) ǫ′XX′ǫ, and we will choose ˆ β so ˆ β =

argmin

β ˆ ǫ′XX′ˆ ǫ, (39) Recall that earlier I said that any sensible size definition would yield the OLS estimator in the simple case— or the IV estimator for an exactly identified system where N = M— since the minimand equals zero at the solution then. We can verify that here, and find the OLS estimator analytically a different way: by using calculus to maximize the function f(ˆ β) = ˆ ǫ′XX′ˆ ǫ = (y − Xˆ β)′XX′(y − Xˆ β) = y′XX′y − ˆ β

′X′XX′y − y′XX′Xˆ

β + ˆ β

′X′XX′Xˆ

β (40) We can differentiate this with respect to ˆ β to get the first order condition f ′(ˆ β) = −X′XX′y − y′XX′X + 2ˆ β

′X′XX′X = 0

= −2X′XX′y + 2ˆ β

′X′XX′X = 0

= 2X′X(−X′y + ˆ β

′X′X) = 0

(41) in which case ˆ β = (X′X)−1X′y (42) and we are back to OLS.

32

SLIDE 33

If we want to use GLS, we can weight our “size” by the inverse of the variance-covariance matrix, ˆ Φ

−1, like this:

ˆ β =

argmin

β ˆ ǫ′Xˆ Φ

−1X′ˆ

ǫ, (43) If we want to use instrumental variables, then our GMM estimator will also incorporate a weighting matrix. This is analogous to two-stage least squares (2SLS), where we can regress the variables in X on a greater number of variables in Z. Now the weighting matrix– the “size” definition— will start to matter. We will make one final change to it. We will use (Z′Z)−1σ2. To understand what this means, imagine that the different zn vectors are uncorrelated with each other and that N = 3. Then (Z′Z)−1 =    

1 z′

1z1 0

1 z′

2z1 0

1 z′

3z3

    σ2 (44) The σ2 makes no difference, since it applies equally to all the equations. But if zm varies a lot, it is going to get less weight. If we want to use GLS and instrumental variables, our GMM estimator becomes ˆ β =

argmin

β ˆ ǫ′Zˆ Φ

−1Z′ˆ

ǫ, (45) We cannot solve this analytically if N > M, so we would use a computer to search for the best solution numerically.

33

SLIDE 34

Thus, if our assumption on the population is that EZ′ω(θ∗) = 0, m = 1, . . . , M, (8N) the GMM estimator is ˆ θ =

argmin

θ ω(θ)′ZΦ−1Z′ω(θ), (9N) where Φ is a consistent estimator of EZ′ǫǫ′Z. We minimize the square of the sample analog because we want to minimize the magnitude, rather than generate large negative numbers by

ur choice, and squares are easier to deal with than absolute values (the

same choice in OLS versus minimizing absolute values of errors). We weight by Φ because we want to make heavier use of observa- tions that contain more independent information. This includes the serial correlation and heteroskedasticity corrections. GMM does not rely on linearity. We may not be able to find analytic solutions if the theoretical equation is nonlinear, but we can still minimize the discrepancy between the sample moment condition and the theoretical. (The least squares approach can handle nonlinearity too, actually. You could specify a nonlinear functional form, and minimize the sum of squared errors in estimating it.)

34

SLIDE 35

BLP PROCEDURE (-1) Select arbitrary values for (δ, Π, Σ) as a starting point. Recall that δ from (15) is a vector of the mean utility from each of the brands, and that Π, Σ is the vector showing how consumer characteristics and brand characteristics interact to generate utility. (0) Draw random values for (νi, Di) for i = 1, ...ns from the distributions P∗ ν(ν) and ˆ P∗

D(D) for a sample of size ns, where the bigger you pick ns

the more accurate your estimate will be. (1) Using the starting values and the random values, and using the as- sumption that the ǫijt follow the extreme-value distribution, approximate the integral for market share. (2) Use the following contraction mapping, which, a bit surprisingly, con-

verges. Keeping (Π, Σ) fixed at their starting points, find values of δ by

the following iterative process. δh+1

·t

= δh

·t + (ln(S·t) − ln(s·t)),

(12N) where S·t is the observed market share. and s·t is the predicted market share from step (1) that uses δh+1

·t

as its starting point. Start with the arbitrary δ0 of step (-1). Thus, in step (2) we come out with values for δ.

35

SLIDE 36

(2.5) Pick some starting values for (α, β). (3) Figure out the value of the moment expression using the starting values and your δ estimate. First, define ωjt = δjt − (xjtβ + αpjt) (13N) Second, figure out the value of the moment expression, ω′ZΦ−1Z′ω (46) You need the matrix Φ−1 to do this. Until step (5), just use Φ−1 = Z′Z as a starting point. (4) Do a minimization search, trying nearby values of (α, β, δ, Π, Σ) until the value of the moment expression is close enough to zero. (5) Take your converged estimates and use them to compute a new ω. Use that to compute a new value for Φ, Z′ωω′Z. Then go back to step (1).

36

SLIDE 37

References Berry, Steven (1994) “Estimating Discrete-Choice Models of Product Differentiation,” The RAND Journal of Economics, 25: 242-262. Berry, Steven, James Levinsohn & Ariel Pakes (1995) “Automobile Prices in Market Equilib- rium,” Econometrica, 63(4): 841-890 (July 1995). Berry, Steven, James Levinsohn & Ariel Pakes (2004). “Differentiated Products Demand Systems from a Combination of Micro and Macro Data: The New Car Market,” The Journal

f Political Economy, 112(1): 68-105 (February 2004).

Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Re- strictions,” Journal of Econometrics, 34: 305-334. Hall, Bronwyn H. (1996) “Notes on Generalized Method of Moments Estimation,” http://emlab.berkeley.edu/users/bhha March 1996 (revised February 1999). Hall, Bronwyn H. (2005) “Computer Code for Problem Set 3 (Effects of Horizontal Merger),” http://emlab.berkeley.edu/users/bhhall/e220c/rc dc code.htm. Hansen, L. P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators,” Econometrica, 50: 1029-1054. Nevo, Aviv (2000) “A Practitioner’s Guide to Estimation of Random- Coefficients Logit Models

f Demand,” Journal of Economic and Management Strategy, 9(4): 513-548 (Winter 2000).

Nevo, Aviv “Appendix to ‘A Practitioner’s Guide to Estimation of Random Coefficients Logit Models of Demand Estimation: The Nitty-Gritty’,” http://www.faculty.econ.northwestern.edu/faculty/nevo/supplemen Rasmusen, Eric (1998a) “Observed Choice, Estimation, and Optimism about Policy Changes,” Public Choice, 97(1-2): 65-91 (October 1998). Wooldridge, Jeffrey M. (2001) “Applications of Generalized Method of Moments Estimation,” Journal of Economic Perspectives, 15( 4): (Fall 2001). 37