Statistics, inference and ordinary least squares Frank Venmans - - PowerPoint PPT Presentation

β–Ά
statistics inference
SMART_READER_LITE
LIVE PREVIEW

Statistics, inference and ordinary least squares Frank Venmans - - PowerPoint PPT Presentation

Statistics, inference and ordinary least squares Frank Venmans Statistics Conditional probability Consider 2 events: A: die shows 1,3 or 5 => P(A)=3/6 B: die shows 3 or 6 =>P(B)=2/6 2 1 5 3 6 4 A


slide-1
SLIDE 1

Statistics, inference and ordinary least squares

Frank Venmans

slide-2
SLIDE 2

Statistics

slide-3
SLIDE 3

Conditional probability

  • Consider 2 events:
  • A: die shows 1,3 or 5 => P(A)=3/6
  • B: die shows 3 or 6 =>P(B)=2/6
  • A∩B : A and B occur: die shows 3 =>P(A&B)=1/6
  • AUB : A or B occur: die shows 1,3, 5 or 6 =>P(AorB)=4/6
  • Addition rule: P(AorB)=P(A)+P(B)-P(A&B) (~ venn diagram)
  • 𝑄 𝐡 𝐢 =

𝑄 𝐡& 𝐢 𝑄 𝐢

(~ venn diagram)

  • P(A|B): prob of event A given that B occurs=1/2
  • P(B|A): prob of event B given that A occurs=1/3
  • Bayes’ Law: 𝑄 𝐡&𝐢 = 𝑄(𝐡|𝐢)P(B)=P(B|A)P(A)
  • Event can be any set of outcomes. Example
  • A: Random draw from belgian population with income >30,000
  • B: Random draw from Belgian population with education >12 years
  • P(A|B)β‰ P(A)

2 1 5 3 6 4

Income>30,000 Education>12

slide-4
SLIDE 4

Independence

  • 2 events A and B:

𝑄 𝐡 𝐢 = 𝑄 𝐡 ⇔ 𝑄 𝐢 𝐡 = 𝑄 𝐢 ⇔ 𝐡 π‘π‘œπ‘’ 𝐢 𝑏𝑠𝑓 π‘—π‘œπ‘’π‘“π‘žπ‘“π‘œπ‘’π‘“π‘œπ‘’

  • Two variables X and Y

𝑔 𝑦|𝑧 = 𝑔 𝑦 ⇔ 𝑔 𝑧 𝑦 = 𝑔 𝑧 ⇔ x and y are independent

  • X and Y are independent if the conditional distribution of X given Y is the

same as the unconditional distribution of X.

  • Independent variables do not necessarily have a zero correlation.
  • Example: height of my sun and Indian GDP are correlated (both affected by time)
  • Dependent variables may have a zero correlation in exceptional cases.
  • Example: selection bias may compensate a causal effect (see further)
slide-5
SLIDE 5

Cumulative Distribution Function CDF Probability Density Function PDF

  • Notation:
  • Random variables X,Y: ex. Yearly earnings and level of eduction
  • Discrete if earnings are multiples of 100€ and eduction in years
  • ~Continuous if earnings are expressed un eurocent and education in seconds
  • Specific values of random variables:
  • a,b or x,y
  • Cumulative Distrubtion Function:
  • probability that X is smaller than or equal to a
  • 𝐺 𝑏 = 𝑄 π‘Œ ≀ 𝑏
  • Probability Density Function
  • For discrete variables: f(a)=P(X=a)
  • For continuous variables
  • 𝑔 𝑏 = 𝑒𝐺 𝑏

𝑒𝑏

⇔ 𝐺 𝑏 = 𝑔 π‘Œ π‘’π‘Œ

𝑏 βˆ’βˆž

  • Area under the pdf =1 because 𝐺 ∞ = 1
slide-6
SLIDE 6

Joint Cumulative Distribution Function

  • Assume Y Yearly earnings and X level of education
  • 𝐺 𝑦, 𝑧 = 𝑄 π‘Œ < 𝑦 &𝑍 < 𝑧
slide-7
SLIDE 7

Density function

  • Joint Density Function
  • For discrete variables:𝑔 𝑦, 𝑧 = 𝑄 π‘Œ = 𝑦&𝑍 = 𝑧
  • Continuous variables: 𝑔 𝑦, 𝑧 = πœ–2𝐺 𝑦,𝑧

πœ–π‘¦πœ–π‘§

  • Marginal Denstity Function
  • Discrete variables 𝑔 𝑦 = 𝑄 π‘Œ = 𝑦 disregarding y
  • Continuous variables 𝑔 𝑦 =

𝑔 𝑦, 𝑧 𝑒𝑧

𝑧=∞ 𝑧=βˆ’βˆž

  • (red and blue line)
  • Conditional Density Function
  • Discrete variables 𝑔 𝑦|𝑧 = 𝑄 π‘Œ = 𝑦 |𝑍 = 𝑧
  • 𝑔 𝑦|𝑧 = 𝑔 𝑦,𝑧

𝑔 𝑧

  • (intersections through the joint density function)
slide-8
SLIDE 8

Regression as a conditional density function

slide-9
SLIDE 9

Expected value

  • Unconditional expected value
  • For a discrete random variable : 𝐹 π‘Œ = βˆ‘π‘¦π‘—π‘„ 𝑦𝑗 = 𝜈
  • For a continuous random variable : 𝐹 π‘Œ =

𝑦𝑔 𝑦 𝑒𝑦

∞ βˆ’βˆž

= 𝜈

  • Conditional expected value (in finance many expectations are conditional on the

information set at time t)

  • 𝐹 π‘Œ 𝑍 = 𝐹𝑍[π‘Œ]=βˆ‘π‘¦π‘—π‘„ 𝑦𝑗|𝑍
  • 𝐹 π‘Œ|𝑍 =

𝑦𝑔 𝑦|𝑧 𝑒𝑦

∞ βˆ’βˆž

  • Variance= 𝜏2 = 𝐹[ π‘Œ βˆ’ 𝜈 2]
  • Covariance between X and Y= πœπ‘Œ,𝑍 = 𝐹

π‘Œ βˆ’ πœˆπ‘Œ Y βˆ’ πœˆπ‘

  • Skewness= 𝐹

π‘Œβˆ’πœˆ 𝜏 3

  • Kurtosis= 𝐹

π‘Œβˆ’πœˆ 𝜏 4

slide-10
SLIDE 10

Normal distribution

  • 𝑔 𝑦 =

1 𝜏 2𝜌 exp βˆ’ 1 2 π‘¦βˆ’πœˆ 𝜏 2

  • Notation π‘Œ~𝑂(𝜈, 𝜏2)
  • Skewness=0
  • Kurtosis=3
  • Jacques-Berra test for normality:

tests if skewness and kurtosis are close to 0 and 3.

  • Any linear combination of normally distributed variables

(correlated or not) is normally distributed

  • Central limit theorem: the probability distribution of a variable that is the sum of an

infinite number of independent random variables with any distribution will be normally distributed.

slide-11
SLIDE 11

Chi square distribution

  • 𝑍 = βˆ‘

π‘Œπ‘—

2π‘₯π‘—π‘’β„Ž π‘Œπ‘—~𝑂 0,1 π‘π‘œπ‘’ π‘π‘šπ‘š π‘Œπ‘—π‘—π‘œπ‘’π‘“π‘žπ‘“π‘œπ‘’π‘“π‘œπ‘’ π‘œ 𝑗=1

follows a πœ“2distribution with n degrees of freedom.

  • 𝑍~πœ“π‘œ

2

slide-12
SLIDE 12

Student t distribution

  • π‘Ž =

π‘Œ

𝑍 π‘œ

π‘₯π‘—π‘’β„Ž π‘Œ~𝑂 0,1 π‘π‘œπ‘’ 𝑍~πœ“π‘œ

2 π‘π‘œπ‘’ π‘Œ π‘—π‘œπ‘’π‘“π‘žπ‘“π‘œπ‘’π‘“π‘œπ‘’ 𝑔𝑠𝑝𝑛 𝑍

follows a student or t-distribution with n degrees of freedom

  • π‘Ž~π‘’π‘œ
  • Higher variance and kurtosis than

the standardized normal distribution

  • Converges to the normal distribution

for large n: π‘’βˆž = 𝑂 0,1

slide-13
SLIDE 13

F distribution

  • Z= X/n

Y/m with X~Ο‡π‘œ

2 π‘π‘œπ‘’ 𝑍~πœ“π‘› 2 π‘π‘œπ‘’ π‘Œ π‘—π‘œπ‘’π‘“π‘žπ‘“π‘œπ‘’π‘“π‘œπ‘’ 𝑔𝑠𝑝𝑛 𝑍 follows

an F distribution with n and m degrees of freedom.

  • π‘Ž~𝐺

π‘œ,𝑛

slide-14
SLIDE 14

Inference

slide-15
SLIDE 15

Statistical inference

  • Try to say something about the real distribution of a random variable based
  • n a sample.
  • The real distribution corresponds to an infinitely repeated event (ex dice), the

entire population, entire set of possible β€˜states of the world’ in a future period etc.

slide-16
SLIDE 16

3 types of inference

  • Point estimator:
  • Ex: sample mean, sample variance, marginal effect in a linear regression (beta), correlation…
  • Concept of repeated sampling: every sample gives another estimator πœ„

=> πœ„ will follow a prob distribution

  • Unbiased: Expected value of estimator corresponds to the real parameter 𝐹 πœ„

= πœ„

  • Consistent: The estimator can get arbitrarily close to the real parameter by increasing the sample size

plim

π‘œβ†’βˆž

πœ„ = πœ„

  • Ex: sample variance estimator 𝑑 Β² =

1 π‘œ βˆ‘ 𝑧𝑗 βˆ’ 𝑧

2

𝑗

is a biased but consistent estimator of the variance

  • Efficient estimator: 𝑀𝑏𝑠(πœ„

) is small

  • Interval estimation:
  • Ex: given the observed sample, the real mean lays between 1 and 3 with 95% probability
  • Hypothesis testing:
  • Ex: if the null hypothesis is true (𝜈 = 2), what is the probability of a random sample to have a more

extreme (less likely) outcome than the observed sample mean of 4 and sample variance of 2.

slide-17
SLIDE 17

Example: Sample mean

  • Income of Belgian households: a random variable following a distribution

with mean 𝜈 and variance 𝜏² (distribution is skewed, not normal)

  • You have a sample of n individuals. You want to say something about 𝜈 and

𝜏²

  • Estimator of 𝜈: sample mean y

=

𝑧1+𝑧2+𝑧3β€¦π‘§π‘œ π‘œ

  • Estimator will be different each time you draw a different sample=>sample

mean will follow a distribution, which is different from the distribution of y.

  • Central limit theorem =>the sample mean converges to a normal

distribution even if y does not follow a normal distribution.

slide-18
SLIDE 18

Sample mean: variance known

  • 𝑧

π‘π‘‘π‘‘π‘§π‘›π‘žπ‘’π‘π‘’π‘—π‘‘ ~N 𝜈, 𝜏2 π‘œ

β‡’

𝑧 βˆ’πœˆ 𝜏 π‘œ π‘π‘‘π‘‘π‘§π‘›π‘žπ‘’π‘π‘’π‘—π‘‘ ~N(0,1)

  • This allows to determine a 95%confidence interval

𝑄 βˆ’1,96 <

𝑧 βˆ’πœˆ 𝜏/ π‘œ < 1,96 = 0,95 ⇔ 𝑄 𝑧

βˆ’ 1,96

𝜏 π‘œ < 𝜈 < y

+ 1,96

𝜏 π‘œ =0,95

  • When interval includes zero we say that the sample mean is not

significantly different from zero at the 5% confidence level.

slide-19
SLIDE 19

Sample mean: variance unknown and y normally distributed

  • Both mean and variance will need to be estimated.
  • Estimator for variance: 𝑑 2 =

1 π‘œβˆ’1 βˆ‘

𝑧𝑗 βˆ’ 𝑧 2

π‘œ

  • If Y follows a normal distribution⇔ (π‘œβˆ’1)𝑑 2

𝜏2

= βˆ‘

π‘§π‘—βˆ’π‘§ 𝜏 2 π‘œ

~πœ“π‘œβˆ’1

2

(no proof but intuitive)

  • 𝑧

βˆ’πœˆ 𝑑 / π‘œ =

𝑧 βˆ’πœˆ 𝜏/ π‘œ 𝑑 𝜏

=

𝑧 βˆ’πœˆ 𝜏/ π‘œ π‘œβˆ’1 𝑑 2 (π‘œβˆ’1)𝜏2

~ 𝑂 0,1

πœ“π‘œβˆ’1 2 π‘œβˆ’1

= π‘’π‘œβˆ’1

  • This allows to determine a 95% confidence interval (ex. n=21)
  • 𝑄 βˆ’2,086 < 𝑧

βˆ’πœˆ

𝑑 π‘œ

< 2,086 = 0,95 ⇔ 𝑄 𝑧 βˆ’ 2,086 𝑑

π‘œ < 𝜈 < y

+ 2,086 𝑑

π‘œ =0,95

  • For large n, the t distribution converges to the normal distribution
slide-20
SLIDE 20

Hypothesis testing

  • Null hypothesis 𝐼0: πœ„ = πœ„0

ex: 𝐼0: πœ„ = 0

  • One sided test 𝐼𝐡: πœ„ > πœ„0 (𝑝𝑠 πœ„ < πœ„0)

ex: 𝐼𝐡: πœ„ > 0

  • Two sided test 𝐼𝐡: πœ„ β‰  πœ„0

ex: 𝐼

𝐡: πœ„ β‰  0

  • 2 regions:
  • If observed data (test statistic) falls in rejection region =>reject H0
  • If observed data (test statistic) falls in acceptence region =>accept H0
  • Imagine you have 10 months of data and you observe a mean monthly return
  • f the stock of Apple of 0,8% and you want to test if this mean is different

from a zero return.

  • Assume the standard error of the return is observed to be 1,58%, so the standard error
  • f the mean is 1,58%

10 = 0,5%

slide-21
SLIDE 21

One sided test vs 2-sided test

  • One sided test: if the real mean was zero, what would

be the probability to observe an estimator larger than 0,8%(1,58%)?

  • Standardize your outcome 𝐼0: 𝜈 = 0 β‡’

𝑧 βˆ’0 𝑑 / π‘œ ~π‘’π‘œ

  • β‡’ 𝑄 𝑍 >

𝑧 βˆ’0

𝑑 π‘œ

= 𝑄 𝑍 > 1,6 =0,05 (Pvalue given by Stata)

  • Two sided test: if the real mean was zero, what would

be the probability that the sample mean was outside the interval of [βˆ’

𝑧 𝑑 / π‘œ, 𝑧 𝑑 / π‘œ]

  • 𝐼0: 𝜈 = 0 β‡’ 1 βˆ’ 𝑄(βˆ’

𝑧 𝑑 / π‘œ < π‘Œ < 𝑧 𝑑 / π‘œ)

= 1- P(-1,6<X<1,6)=0,10 (Pvalue given by Stata)

  • Remark: n is small so assumption of normally distributed

returns is needed

slide-22
SLIDE 22

Type I and type II errors

Do not reject H0 Reject H0 H0 true Type I error, Ξ± (ex 5%) H0 false Type II error, Ξ² 1-Ξ²=power of test

  • Level of significance: probability to reject the null if the null is true
  • Power of the test: probability to reject the null if the nulle is false
  • Reduce probability of type I error =>increase probability of Type II error
  • More efficient estimator=>reduce probability of Type II error=increase power of the

test

  • Increase sample size=>reduce probability of Type II error=increase power of the test
  • General rule: go for a large sample, in small samples you may only see phenomenons

big as an elephant, that you knew allready before doing the test, all the rest has an insignificant effect.

slide-23
SLIDE 23

Ordinary least squares

slide-24
SLIDE 24

Regression

  • Assume we want to know the relationship between sales and advertising expenditure
  • OLS: minimize squared distance between points and regresssion line

Y=Sales X=advertising expenditure πœ—π‘— 𝑍 = 𝛽 + π›Ύπ‘Œ Slope= Ξ² Ξ± 𝛽 + π›Ύπ‘Œπ‘— 𝑍

𝑗

slide-25
SLIDE 25

Population regression line vs sampling regression line

  • Estimators and regression line (orange) will be different for each sample.
  • Would you use a one-sided or two-sided t-test for beta?

Y=Sales πœ—π‘— 𝑍 = 𝛽 + π›Ύπ‘Œ Slope= Ξ² Ξ± 𝛽 + π›Ύπ‘Œπ‘— 𝑍

𝑗

πœ— 𝑗 𝑍 = 𝛽 + 𝛾 π‘Œ Slope= 𝛾 𝛽 + 𝛾 π‘Œ Advertising expenditures

slide-26
SLIDE 26

Assumptions of OLS

  • 5 Gauss-Markov assumptions:
  • The true model is 𝑍 = 𝛽0 + 𝛾1π‘Œ1 + 𝛾2π‘Œ2 + β‹― + πœ— with E πœ— = 0 (linearity)
  • No perfect collinearity (you cannot write X1 as a linear combination of the other Xj’s)
  • Homoscedastic errors 𝐹 πœ—π‘—

2 = 𝜏²

  • Uncorrelated errors 𝐹 πœ—π‘—πœ—π‘˜ = 0
  • 𝐹 πœ—|π‘Œ1, π‘Œ2 = 0 (exogenous explainatory variables, no endogeneΓ―ty)
  • If 5 assumptions are met, OLS is Best Linear Unbiased Estimator (BLUE)
  • If the errors are normally distributed, OLS is Best Unbiased Estimator (BUE)
  • OLS with non-normal errors is still unbiased and consistent!
  • 𝛾

follows a t-distribution only if errors are normal => be prudent with interpreting confidence intervals in small samples

𝑛𝑏𝑒𝑠𝑗𝑦 π‘œπ‘π‘’π‘π‘’π‘—π‘π‘œ 𝐹 πœ—πœ—β€² = 𝜏𝐽

slide-27
SLIDE 27

The math behind OLS (optional, only for those who like it)

  • Consider Matrix notation of model: 𝑍 = π‘Œπ›Ύ + πœ—
  • If there is an intercept, X contains a column of one’s
  • Minimum distance estimator

min

𝛾 βˆ‘ πœ—π‘— 2 π‘œ

= min

𝛾 πœ—β€²πœ— = min 𝛾

𝑍 βˆ’ π‘Œπ›Ύ β€² 𝑍 βˆ’ π‘Œπ›Ύ = min

𝛾

𝑍′𝑍 βˆ’ 2𝛾′ π‘Œβ€²π‘ + 𝛾′ π‘Œβ€²π‘Œπ›Ύ 𝑔𝑗𝑠𝑑𝑒 𝑒𝑓𝑠𝑗𝑀𝑏𝑒𝑗𝑀𝑓: βˆ’2π‘Œβ€²π‘ + 2π‘Œβ€²π‘Œπ›Ύ = 0 ⇔ 𝛾 = π‘Œβ€²π‘Œ βˆ’1π‘Œβ€²π‘

  • Method of moments: errors must be uncorrelated with the regressors

Xβ€²πœ— = 0 ⇔ π‘Œβ€² 𝑍 βˆ’ π‘Œπ›Ύ = 0 ⇔ 𝛾 = π‘Œβ€²π‘Œ βˆ’1π‘Œβ€²π‘

  • Maximum likelihood under normal distribution of error term
  • π‘šπ‘—π‘™π‘“π‘šπ‘—β„Žπ‘π‘π‘’ =

1 2𝜌𝜏2 exp βˆ’ πœ—π‘—

2

2𝜏2 𝑗

  • π‘€π‘π‘•π‘šπ‘—π‘™π‘“π‘šπ‘—β„Žπ‘π‘π‘’ = βˆ’

n 2 log 2𝜌𝜏2 βˆ’ 1 2𝜏2 βˆ‘ πœ—π‘—Β² π‘œ

  • Minimising the loglikelihood boils down to minimum distance estimator=>OLS is BUE
slide-28
SLIDE 28

OLS inference

  • If errors are normally distributed, the estimate 𝛾

follows a student t distribution (only assymptotically the case if errors are not normally distributed)

  • If Errors are correlated or heteroscedastic, the variance of beta 𝜏

𝛾 2 can

be increased to take that into account (option β€˜robust’ in stata)

  • Stata command: regress Y X1 X2, robust
  • Default includes a constant, you can add option noconstant
  • Exogeneity of X’s cannot be tested. The problem of endogeneΓ―ty is

most important condition for causal interpretation of beta’s (see next week)

slide-29
SLIDE 29

Avoid endogeneΓ―ty

  • Conditions 1 and 5 imply 𝐹 𝑍 π‘Œ1, π‘Œ2 = 𝛽0 + 𝛾1π‘Œ1 + 𝛾2π‘Œ2 .
  • This allows a causal interpretation of beta’s: an increase of X1 by one

unit, all other relevant factors being equal, will have an effect 𝛾1 on Y.

  • Intuition: β€˜all other factors being equal’ implies that all factors that

drive the error term (and thus Y), are uncorrelated to the variables of interest X.

slide-30
SLIDE 30

The effect of marketing Β« all else being equal Β»

Sales Marketing expenditures Error= all other factors Innovative company Competitors Quality of product Delivery time Business cycle

𝐹 πœ—|π‘Œ β‰  0 β‡’ 𝑑𝑝𝑀 πœ—, π‘Œ β‰  0 β‡’ πœ— and X are driven by common factors.

slide-31
SLIDE 31

Fixed effect panel regression to avoid endogeneΓ―ty

Sales Marketing expenditures Fixed effect= all factors that are constant over time Innovative company Competitors Quality of product Delivery time Idiosyncratic error= all other factors that change over time Business cycle

slide-32
SLIDE 32

Fixed effects - random effect - pooled panel

  • 3 ways of writing the same fixed effect model:
  • Y= π‘Œπ›Ύ

+ βˆ‘ 𝛿𝑗 𝐸𝑗

𝑗

+ πœ— with Di a dummy variable for company i.

  • 𝑍

𝑗𝑒 = π‘Œπ‘—π‘’π›Ύ

+ 𝛿𝑗 + πœ—it

  • 𝑍

𝑗𝑒 βˆ’ 𝑍 𝑗 = (π‘Œπ‘—π‘’βˆ’π‘Œ

𝑗)𝛾 + πœ—π‘—π‘’ βˆ’ πœ— 𝑗 =β€˜within estimator’ (obtained by subtracting the sum of eq 2 over time periods)

  • Beta measures the effect of a deviation from mean marketing expenditures on the deviation of mean sales β€˜within’ a

company i. =>The difference in mean sales between a company with high and low average marketing expenditures does not drive the estimation of beta => be careful with measurement errors and lagged effects because part of variability is filtered out.

  • If theory indicates that you may avoid a source of endogeneity and you have enough data to find significant effects =>

use fixed effects!

  • Random Effect model and Pooled panel regression assume that none of the factors that drives a company specific

effect drives any of the X’s as well: fixed effects are uncorrelated with X’s

  • Pooled panel regression is an OLS as if there was no panel structure: every observation has equal weight
  • Random effect model is a Generalized Least Squares (GLS) estimator: the observed heterogeneity and serial

correlation is used to make estimator more efficient compared to the pooled panel regression

slide-33
SLIDE 33

Alternative functional forms

  • If X increases by 1, Y will increase by Ξ²
  • 𝑍 = 𝛽 + π›Ύπ‘Œ + πœ—
  • 𝑒𝑍

π‘’π‘Œ = 𝛾= marginal effect

  • If X increases by 1%, Y will increase by Ξ²%
  • π‘šπ‘œπ‘ = 𝛽 + π›Ύπ‘šπ‘œπ‘Œ + πœ— ⇔ 𝑍 = π‘“π›½π‘Œπ›Ύπ‘“πœ—
  • π‘’π‘šπ‘œπ‘

π‘’π‘šπ‘œπ‘Œ = 𝛾 = 𝑒𝑍/𝑍 π‘’π‘Œ/π‘Œ = π‘“π‘šπ‘π‘‘π‘’π‘—π‘‘π‘—π‘’π‘§

  • If X increases by 1, Y will increase by Ξ²%
  • π‘šπ‘œπ‘ = 𝛽 + π›Ύπ‘Œ + πœ— ⇔ 𝑍 = π‘“π›½π‘“π›Ύπ‘Œπ‘“πœ—
  • π‘’π‘šπ‘œπ‘

π‘’π‘Œ = 𝛾 = 𝑒𝑍/𝑍 π‘’π‘Œ = 𝑕𝑝π‘₯π‘’β„Ž 𝑠𝑏𝑒𝑓 (think of X as time)

  • If X increases by 1%, Y will increase by Ξ²
  • 𝑍 = 𝛽 + π›Ύπ‘šπ‘œπ‘Œ + πœ— ⇔ 𝑓𝑍 = π‘“π›½π‘Œπ›Ύπ‘“πœ—
  • 𝑒𝑍

π‘šπ‘œπ‘Œ = 𝛾 = 𝑒𝑍

π‘’π‘Œ π‘Œ

  • Any transformation (lnX, 1/X, XΒ², XΒ³, expX) is in principle allowed
  • Transformation can be justified by theory (in most cases) or by the data (see graphs)

Y Y Y X X X

slide-34
SLIDE 34

Some useful commands in Stata

  • Type help … in the command windows for the following commands:
  • summarize: summarize information about a variable or dataset
  • tabulate var1 var2 : tabulation table to explore data
  • destring : define a variable as numerical if it would be imported as β€˜string’ (text)
  • generate var1=var2+var3 : generates a new variable (also for log transformation)
  • replace var1=0 if var2==. & var3>36 :
  • logical expressions with == ;
  • dot= missing (or ∞) => if var3 is missing, it satisfies var3>36 !
  • replace var1=1 if l.var2==25 | var3==var4 : lag operator only if dataset defined as

time series or panel.

  • regress y x1 x2 : regression with many options
  • tset year : define dataset to be a time series with year name of time variable
slide-35
SLIDE 35

Commands in Stata:

  • xtset i t : define a database to be a panel with i name of person or company
  • xtreg : fixed effect or random effect panel regression
  • Create an id per companyname:
  • egen id= group(companyname)
  • egen stands for β€˜extensions to generate’, operates on groups of observations.
  • generate only applies to one observation at a time
  • Calculate mean (over time) of variable income per company id:
  • by id: egen meanincome = mean(income)
  • Eliminate extreme values beyond 99th percentile of variable income:
  • Summarize income, detail
  • Replace income=. if income>`r(p99)’
  • Most commands create different macro’s (mentioned at the end of help document). You can

use them with `…’ . Since they are local macro’s, they are erased at some point (in this case until the next command that uses r to store results.