Workshop 4: Statistical modelling intro
Murray Logan 10 Mar 2019
Workshop 4: Statistical modelling intro Murray Logan 10 Mar 2019 - - PowerPoint PPT Presentation
Workshop 4: Statistical modelling intro Murray Logan 10 Mar 2019 Section 1 Introduction Statistical modelling What is a statistical model? Statistical modelling What is a statistical model? Mathematical model 12 10 8
Murray Logan 10 Mar 2019
Statistical modelling
What is a statistical model?
Statistical modelling
What is a statistical model? Mathematical model
x
2 4 6 8 10 12y
y=2+1.5x
Statistical model
x
2 4 6 8 10 12y
y=2+1.5x+ε
Statistical modelling
y=2+1.5x+ε
What is a statistical model?
variables to one or more independent variables
Statistical modelling
A random variable is one whose values depend on a set of random events and are described by a probability distribution
Statistical modelling
y=2+1.5x+ε
What is a statistical model?
with the distributional assumptions underlying this generation
Statistical modelling
y=2+1.5x+ε
What is the purpose of statistical modelling?
Statistical models
x
2 4 6 8 10 12y
y=2+1.5x+ε
How do we estimate model parameters? - Y ∼ β0 + β1X What criterion do we use to assess best fit?
Statistical models
x
2 4 6 8 10 12y
y=2+1.5x+ε
If we assume Y is drawn from a normal (gaussian) distribution
Estimation
Estimation
L e a s t s q u a r e s
6 8 10 12 14 Sum of squares
µ=10
Parameter estimates
8 10 12 14
Estimation
L e a s t s q u a r e s e s t i m a t e s
Y X 3 2.5 1 6 2 5.5 3 9 4 8.6 5 12 6
3.0 = β0 × 1 + β1 × 0 + ε1 2.5 = β0 × 1 + β1 × 1 + ε1 6.0 = β0 × 1 + β1 × 2 + ε2 5.5 = β0 × 1 + β1 × 3 + ε3 9.0 = β0 × 1 + β1 × 4 + ε4 8.6 = β0 × 1 + β1 × 5 + ε5 12.0 = β0 × 1 + β1 × 6 + ε6
Estimation
L e a s t s q u a r e s e s t i m a t e s
Provided data (and residuals) Gaussian
x
2 4 6 8 10 12y
y=2+1.5x+ε
Gaussian distribution
Probability density function
µ = 25, σ2 = 5 µ = 25, σ2 = 2 µ = 10, σ2 = 2 5 10 15 20 25 30 35 40
f(x | µ, σ2) =
1 √ 2σ2π e− (x−µ)2
2σ2
Linear model assumptions
yi = β0 +β1 ×xi
+εi εi ∼ N (0,σ2)
V = cov = σ2 ··· σ2 ··· . . . . . . ··· σ2 . . . ··· ··· σ2 Homogeneity of variance Zero covariance (=independence)
Linear model assumptions
yi = β0 +β1 ×xi
+εi εi ∼ N (0,σ2)
V = cov = σ2 ··· σ2 ··· . . . . . . ··· σ2 . . . ··· ··· σ2 Homogeneity of variance Zero covariance (=independence)
What do we do, if the data do not satisfy the assumptions?
Scale transformations
Leaf length (cm)
10 20 30 40
Frequency log10 Leaf length (cm)
0.0 0.5 1.0 1.5 2.0
Frequency Leaf length (cm)
10 20 30 40
log10 leaf length (cm)
0.0 0.5 1.0 1.5 2.0 Logarithmic scale Linear scale
Linear model
yi = β0 + β1xi + εi
Data types
Type Example Distribution Range Measurements length, weight Gaussian real, −∞ ≤ x ≥ ∞ logNormal real, 0 < x ≥ ∞ Gamma real, 0 < x ≥ ∞ Counts Abundance Poisson discrete, 0 ≥ x ≤ ∞ Negative Binomial discrete, 0 ≥ x ≤ ∞ Binary Presence/Absence Binomial discrete, x = 0, 1 Proportions Ratio Binomial discrete, 0 ≥ x ≤ n Percentages Percent cover Binomial real, 0 ≤ x ≥ 1 Beta real, 0 ≤ x ≥ 1 What about density?
Gamma
zero-bound variables with large var.
Probability density function
µ = 15, σ2 = 15 (a = 15, s = 1) µ = 15, σ2 = 30 (a = 7.5, s = 2) µ = 15, σ2 = 60 (a = 3.75, s = 4) 5 10 15 20 25 30 35 40
$f(x| s, a) =
1 (saΓ(a))x(a − 1)e{−(x/s)}$
a shape s scale as as
Poisson distribution
Count data
Probability density function
λ = 25 λ = 15 λ = 3 5 10 15 20 25 30 35 40
f(x | λ) = e−λλx
x!
df dispersion
Negative Binomial
Count data
Probability density function
µ = 25, ω = Inf (θ = 0) µ = 15, ω = Inf (θ = 0) µ = 3, ω = Inf (θ = 0) 5 10 15 20 25 30 35 40
Probability density function
µ = 15, ω = 7.5 (θ = 0.133; σ2 = 3µ) µ = 15, ω = 3 (θ = 0.333; σ2 = 6µ) µ = 15, ω = 1.667 (θ = 0.6; σ2 = 10µ) 5 10 15 20 25 30 35 40
f(x | µ, ω) = Γ(x+ω)
Γ(ω)x! × µxωω (µ+ω)µ+ω
dispersion when
Binomial distribution
Proportions or Presence/absence f(x | n, p) =
(n
p
)
px(1 − p)n−x
µ = np, σ2 = np(1 − p)
for presence/absence n = 1
Beta
Continuous between 0 and 1
Probability density function
µ = 0.5, σ2 = 0.023 (a = 5, b = 5) µ = 0.167, σ2 = 0.019 (a = 1, b = 5) µ = 0.833, σ2 = 0.019 (a = 5, b = 1) µ = 0.5, σ2 = 0.125 (a = 0.5, b = 0.5) 0.00 0.25 0.50 0.75 1.00
f(x | a, b) =
Γ(a+b) Γ(a)Γ(b)xa−1(1 − x)b−1
µ =
a a+b, σ2 = ab
(a+b)2.(a+b+1)
Generalized linear models
Y = β0 + β1x1 + ... + βpxp + e g(µ)
= β0 + β1x1 + ... + βpxp
Y ∼ Dist(µ, ...)
g(µ) = β0 + β1x1 + ... + βpxp
Generalized linear models
Linear model is just a special case Y = β0 + β1x1 + ... + βpxp + e g(µ)
= β0 + β1x1 + ... + βpxp
Y ∼ N(µ, σ2)
I(µ) = β0 + β1x1 + ... + βpxp
Generalized linear models
Response variable Probability Dis- tribution Canonical Link function Model name Continuous measurements Gaussian identiy:
µ
Linear regression Gamma verse:
1/µ
Gamma regression Counts Poisson log: log(µ) Poisson regression / log-linear model Negative binomial log: log(µ) Negative binomial regression Quasi-poisson log: log(µ) Poisson regression Binary,proportions Binomial logit: log
(
π 1−π
)
Logistic regression probit: 1 √ 2π
∫ α+β.X
−∞ exp
( − 1
2 Z2) dZ Probit regression complimentary: log (−log(1 − π)) Logistic regression Quasi-binomial logit: log
(
π 1−π
)
Logistic regression Percentages Beta logit: log
(
π 1−π
)
Beta regression
OLS
6 8 10 12 14 Sum of squares
µ=10
Parameter estimates
8 10 12 14
Maximum Likelihood
f(x | µ, σ2) =
1 √ 2σ2π e− (x−µ)2
2σ2
lnL(µ, σ2) = − n
2ln(2π) − n 2lnσ2 − 1 2σ2
∑2
i=1(xi − µ)2
Maximum likelihood estimates:
ˆ µ = ¯
x = 1
n
∑n
i=1 xi
ˆ σ2 = 1
n
∑n
i=1(xi − ¯
x)2
Maximum Likelihood
6 8 10 12 14 Log−likelihood
µ=10
Parameter estimates
8 10 12 14 6 8 10 12 14 6 8 10 12 14 Parameter estimates