STA 103: Probability and Statistical Inference Paul Marriott - - PDF document

sta 103 probability and statistical inference
SMART_READER_LITE
LIVE PREVIEW

STA 103: Probability and Statistical Inference Paul Marriott - - PDF document

STA 103: Probability and Statistical Inference Paul Marriott paul@stat.duke.edu 223C, Old Chemistry January 7, 2004 Syllabus Basic laws of probability - random events, independence and dependence, expectations, Bayes theorem. Discrete


slide-1
SLIDE 1

STA 103: Probability and Statistical Inference

Paul Marriott

paul@stat.duke.edu

223C, Old Chemistry January 7, 2004

Syllabus

  • Basic laws of probability - random events, independence and dependence, expectations, Bayes theorem.
  • Discrete and continuous random variables, density, and distribution functions. Binomial and normal models for obser-

vational data.

  • Introductions to maximum likelihood estimation and Bayesian inference.
  • One- and two-sample mean problems, simple linear regression, multiple linear regression with two explanatory vari-

ables.

  • Applications in economics and quantitative social sciences, and natural sciences emphasized.

This course is recommended for students majoring in economics and the natural or computational sciences. Prerequisites: MTH 31 or equivalent. Web-page

  • Details of the module can be found at

http://www.stat.duke.edu/˜paul/STA103

  • You can down-load pdf copies of these lecture notes as well as details of the labs sessions.
  • The readings will be also posted

1

slide-2
SLIDE 2

Book, Readings and Labs

  • The course book is

Mathematical Statistics and Data Analysis by John A. Rice

  • All of the mathematical aspects of the course are taken from this book.
  • There will also be readings posted on the course web-page covering applications of the mathematical ideas.
  • There are weekly labs (in 01 Old Chem.) details are posted on web-site.

Evaluation and structure

  • The mathematical aspects of the module are evaluated through four quizs and partly in midterm and the final.

There are weekly assignments (self-evaluated), which are closely related to quiz questions which you are strongly advised to do.

  • There will also be shorter quizzes and assignments in class for which short notice will be given
  • Computation issues in the course are found in the weekly lab sessions. Assignments given each week are to be handed

in and marked.

  • Ideas and contextual issue. This stream looks at issues such as what is randomness and probability? what is the

relevance of probability to science, economics and social science? what are statistical models and what can they do? The material for this is from the readings and class discussion. It will be evaluated via essay style questions in the midterm and final and short quizs in the lecture times.

  • The break-down of marks are: Total of quizzes (45%), midterm (15%), labs (15%), final (25%).

Expectation for students

  • In learning Mathematical issue it is extremely important that you work in detail through the recommended questions

yourselves

  • “Statistics is not a spectator sport”
  • I expect approximately 6-7 hours of student effort per week outside lectures/labs

Contact

  • Contact hours for Paul Marriott are 3:00-4:00 Tuesdays, Wednesday and Thursdays or by email appointment, at 223C

Old Chemistry Building.

  • The Teaching Assistants for STA103 are Janine Wilcox, and Tyler McCormick
  • There is a Statistical Education and Consulting Center in 211 AB Old Chem where the TA’s for this course and other

TA can be contacted. 2

slide-3
SLIDE 3

Timetable The timetable for lectures and labs can be found on the course website http://www.stat.duke.edu/˜paul/STA103/ Mathematical Resources This is a calculus-based statistics course. One stumbling block students often have are with purely technical mathematical

  • issues. The following web-sites might be of help:

Rice Virtual Lab in Statistics http://www.ruf.rice.edu/˜lane/rvls.html Mathworld (A free service for the mathematical community provided by Wolfram Research, makers of Mathematica, with additional support from the National Science Foundation) http://mathworld.wolfram.com/ Self-diagnostic quiz The following short quiz does not contributed to your final mark, but is a self diagnostic to evaluate your mathematical

  • background. Please just give the answer to each of the following
  • 1. Sketch the curve exp(2x) for x ∈ (−1, 1).
  • 2. Differentiate x4 with respect to x.
  • 3. Differentiate x exp(x) with respect to x.
  • 4. Integrate the function x3 over the interval (0, 1).
  • 5. What is the mean value of the set {1, 2, −4, −2, 0}?
  • 6. What is the median value of the set {1, 2, −4, −2, 0}?
  • 7. Solve for x if x2 − 2x + 1 = 0
  • 8. Sketch the curve for x ∈ (−1, 1)

f(x) = 1/2 if x < 0 if x ≥ 0 3

slide-4
SLIDE 4

Some examples To start thinking about ideas of probability and statistics consider the following three examples which we will return to throughout the course.

  • 1. A model for movements in the stock market.
  • 2. Testing the effectiveness of US corporate government
  • 3. Who wins the Olympic games: Economic resources and medal totals

Modelling the stock market The following plot show the movement of the Dow Jones index in 1970 and the histogram of daily changes.

50 100 150 200 250 650 700 750 800 850 Trading Day (1970) Dow Jones Index

changes in Dow Jones Index

Change in price Relative Frequency −20 −10 10 20 30 0.00 0.01 0.02 0.03 0.04 0.05 0.06

4

slide-5
SLIDE 5

Modelling the stock market The following plots show how you might think of modelling the Dow Jones index. It compares the index with a simple random game; win +1 with probability 0.5, lose −1 with probability 0.5, played many times.

50 100 150 200 250 6.45 6.50 6.55 6.60 6.65 6.70

Log Dow Jones Index, 1970

Day Log Price 50 100 150 200 250 −60 −40 −20 20

Plot of Coin Toss Game

Index Winnings in Coin Tossing

Issues As with the previous example the following issues can be discussed

  • What does it mean to say the Dow Jones index is random?
  • In what sense are the Dow Jones index and the random game the same, in what ways different?
  • Can we use random, or statistical models to predict? If so how can we evaluate how well they predict?

5

slide-6
SLIDE 6

Testing the effectiveness of US corporate government

  • A paper by Hunter (1997) which is one of the readings uses empirical data to test to see if corporate boards are

inefficiently large.

  • He uses publicly available data on firm costs of rural distributors of electricity.
  • He uses a linear regression model and statistical tests to empirically check an economically based hypothesis.

Issues As with the previous example the following issues can be discussed

  • What is a regression model and when is it appropriate for testing hypotheses in Economics and in the Social Sciences?
  • What can be learned from a Statistical test?
  • What is the nature of the data that we have in Economics and how does it differ from that in natural or experiment

sciences? Predicting the winners

  • Prediction very commonly done in Economics and in the Business world, and we look at an example of prediction done

using models typically used in Economics

  • Here applied to the problem of predicting which counties will win medals in the Olympics.
  • The study by Bernard and Busse tried to predict, before the recent Sydney 2000 Olympics, how many medals each

county would win. 6

slide-7
SLIDE 7

Model fit

+ + + + + + + + + + + + + + + + + + + + + + 20 40 60 80 100 20 40 60 80 100

Prediction against Reality: number of medals

predicted medals (Fitted) Actual medals (Observed)

Issues

  • 1. A regression model has been used, what does that mean and how should such a model be selected?
  • 2. How should you assess the predictive power of such a model?

7

slide-8
SLIDE 8

A brief overview of Inference In order to be able to start to understand the readings let us take an informal look at some of the fundamental statistical ideas in this course

  • 1. What is a statistical model?
  • 2. What is inference?
  • 3. What does statistically significant mean?
  • 4. What is a p-value?

What is a statistical model?

  • 1. It is a mathematical and probability based tool
  • 2. By making certain simplifying assumptions the variability in a set of data is described using probability theory
  • 3. A model might be used to:

(a) describe the data (b) to predict future values (c) to test to see if there is evidence in the data to support or contradict an economic theory

  • 4. In Economics the majority of models are either regression models or time series models. We concentrate on regression

in this course. Example of a model

  • In the paper “Who wins the Olympic games: economic resources and medal totals” Andrew Bernard and Meghan

Busse use a regression model to both explain what economic factors are primary associated with winning Olympic medals and to predict the number of medals won in the future.

  • The study tried to predict, before the recent Sydney 2000 Olympics, how many medals each county would win at

Sydney based on a selection of economic indicators

  • The model takes explanatory variables for each country and gives out a response (i.e. the number of medals to be won.)

explanatory variables → response 8

slide-9
SLIDE 9

Selection of explanatory variables

  • 1. How large a population the country has.

This cannot be the only feature as China, India, Indonesia and Bangladesh have more than 43% of the worlds population but have historically only won about 6% of the medals.

  • 2. Economic development.

Adding GNP into the model makes a very big improvement of the amount of variability explained. For example up to 1996 the counties mentioned above accounted for 5% of the World’s GNP, about the same proportion of medals won.

  • 3. Is the country the host of the games? This always gives rise to a big increase in the number if medals won, typically an

extra 3% of the total medals.

  • 4. Is the country in the former Eastern Bloc grouping?

Selection of explanatory variables

  • The explanatory variables are then

Variable Units Symbol Medals won Number M Population Number millions P GNP Number of Trillion Dollars GNP Host 1 if Yes, 0 if No HOST Eastern Bloc 1 if Yes, 0 if No EB

  • One model might be

M = 1 10P + 1 30GNP + 5Host + 6EB + error Example

  • Suppose Poland had a GNP of 15 Trillion Dollars and a Population of 60 million, how many medals is it predicted to

win by the model M = 1 10P + 1 30GNP + 5Host + 6EB

  • How many for Australia, population of 20 million and GNP 20 Trillion Dollar?

9

slide-10
SLIDE 10

General form of model

  • More generally we might write the model as

M = β0 + β1P + β2GNP + β3Host + β4EB + error

  • The coefficients β0, . . . , β4 are called the parameters of the model.
  • Who do we select their values?
  • One way is to look at past data (i.e. Look at the history of the Olympics and historical values of population, GNP etc)

and use a statistically based method to estimate them.

  • Let us denote the estimated values by ˆ

β0, ˆ β1, . . . , ˆ β4 What is a statistical model?

  • In general a model is a way of relating some response to a set of explanatory variables
  • This model will involve some mathematical and probability functions
  • A model has a set of parameters associated to it.
  • The values of these parameters needs to be estimated in some statistically sound way.

What is inference?

  • Inference is the way we learn about parameters by using data which we have observed.
  • The real value of a parameter will have a real world or economic meaning.
  • We never know exactly the real value of the parameter (β) we only can estimate its value in a statistical way (i.e.

calculate ˆ β)

  • We might use the estimate values of the parameters to:
  • 1. describe the data
  • 2. to predict future values
  • 3. to test to see if there is evidence in the data to support or contradict an economic theory

10

slide-11
SLIDE 11

Home effect?

  • For example in the Olympic model we might want to know how important being the host of the Olympics is
  • We have the model

M = β0 + β1P + β2GNP + β3Host + β4EB + error

  • The size of β3 determines the amount of difference being the host makes to the number of medals won.
  • What does it mean if
  • 1. β3 = 0?
  • 2. β3 = 10?
  • 3. β3 > 0?
  • 4. β3 < 0?

How good is the estimate?

  • We never see the exact value of the ‘home effect’ (β3) only its estimated value ˆ

β3

  • The estimate is not exact it is a statistical object. It might be very precise or very rough
  • Inference theory gives us a way of determining if ˆ

β is ‘close to’ β

  • Often rather than giving a single estimate ˆ

β a range of possible values (ˆ β−, ˆ β+) is given where are are fairly certain the true value lies in this range.

  • Such an approach is called a confidence interval.

What does statistically significant mean?

  • Suppose that some economist claims (from his own theory) that there is no home effect in the Olympics.
  • The economist is claiming that the true value of β3 is 0
  • From the data we have an estimated value of the size of the home effect ˆ

β3 which is not exactly zero.

  • Inference theory will tell us if our estimated value of ˆ

β3 is far enough away from 0 so that we are sure that this new theory is wrong

  • Statistically significantly different means the estimated value and the hypothesed value are far enough apart even after

taking into account the possible statistical variability of ˆ β. 11

slide-12
SLIDE 12

What is a p-values

  • In the papers you will often see that a p-value has been calculated
  • This is a way of making a statistically valid comparison between what a theory tells us to what the observed data has

told us.

  • The p-value gives us a numerical scale to measure the amount of difference between theory and practice
  • It is one way of deciding if the theory is ‘wrong’

The p-value scale

  • The p-value is a number between 0 and 1
  • If the p-value is greater than 0.05 then we say there is not strong evidence that the hypothesis is inconsistent with the

data.

  • If the p-value is between 0.05 and 0.01 we say that there is reasonable evidence that the hypothesis is inconsistent with

the data.

  • If the p-value of less than 0.01 then we say that there is strong evidence that the hypothesis is inconsistent with the

data. Example An economist says his theory predicts that there is no home effect in the Olympics. He is therefore saying that β3 in our model is 0. We collect data to test his result and calculate the p-value.

  • 1. If we calculated the p-value of 0.54 we would say .....?
  • 2. If we calculated the p-value of 0.04 we would say .....?
  • 3. If we calculated the p-value of 0.0001 we would say .....?
  • 4. If we calculated the p-value of 0.96 we would say .....?

12