Statistical methods in bioinformatics Brief introduction, - - PowerPoint PPT Presentation

statistical methods in bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Statistical methods in bioinformatics Brief introduction, - - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrm Biostatistics,


slide-1
SLIDE 1

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Faculty of Health Sciences

Statistical methods in bioinformatics

Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrøm

Biostatistics, University of Copenhagen E-mail: ekstrom@sund.ku.dk

Slide 1/56

slide-2
SLIDE 2

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Today’s programme

Introduction to statistical methods for high-dimensional data, linear models, dimension reduction and regularization methods.

1 Brief overview of molecular data. 2 Big-p small-n problems 3 Multiple testing techniques (inference correction, false

discovery rates, q-values)

4 The correlation vs. causation and prediction vs.

hypothesis differences

5 Generalized linear models refresher 6 Dimension reduction I: Penalized regression 7 Dimension reduction II: Partial least squares, principal

component regression

Slide 2/56 — Statistical methods in bioinformatics

slide-3
SLIDE 3

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

“Classical”statistics analysis

gene

  • besity

gender age

Could be analyzed with a multiple regression model:

  • besityi = α+β1 ·genei +β2 ·genderi +β3 ·agei +εi

Slide 3/56 — Statistical methods in bioinformatics

slide-4
SLIDE 4

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

The omics revolution

Slide 4/56 — Statistical methods in bioinformatics

slide-5
SLIDE 5

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

The“joy”of *omics for an analyst

CACAC GCGTG AAGAT CAACC

Slide 5/56 — Statistical methods in bioinformatics

slide-6
SLIDE 6

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA CCGAAATCACTCATGCGGGCTTGA CCATGTCGCCTACATGTCCTTCAC ACGCGTGAAGATCAACCGAAATCA CTCATGCGGGCTTGACCATGTCGC CTACATGTCC

Slide 6/56 — Statistical methods in bioinformatics

slide-7
SLIDE 7

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA CCGAAATCACTCATGCGGGCTTGA CCATGTCGCCTACATGTCCTTCAC ACGCGTGAAGATCAACCGAAATCA CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Evaluate P(Yi =“gene” |Y1,...,Yi−1) Do that for each i and identify the nucleotides that have a high probability of being inside a gene.

Slide 6/56 — Statistical methods in bioinformatics

slide-8
SLIDE 8

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Proteomics

Slide 7/56 — Statistical methods in bioinformatics

slide-9
SLIDE 9

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Gene expression data

Slide 8/56 — Statistical methods in bioinformatics

slide-10
SLIDE 10

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Metabolite data

Slide 9/56 — Statistical methods in bioinformatics

slide-11
SLIDE 11

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Examples

Sequence data — metabolite data

Slide 10/56 — Statistical methods in bioinformatics

slide-12
SLIDE 12

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

A bit of history

  • 2000 one SNP
  • 2003 10 SNPs
  • 2006 500 SNPs
  • 2009 22k SNPs
  • 2012 2.5 mio SNPs
  • 2013 25 mio SNPs ∼ 45 mio imputated

Slide 11/56 — Statistical methods in bioinformatics

slide-13
SLIDE 13

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Pattern recognition

Slide 12/56 — Statistical methods in bioinformatics

slide-14
SLIDE 14

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Prediction

Slide 13/56 — Statistical methods in bioinformatics

slide-15
SLIDE 15

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

The $1000 genome

Slide 14/56 — Statistical methods in bioinformatics

slide-16
SLIDE 16

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Data sizes

y x1 x2 x3 y x . . .

Slide 15/56 — Statistical methods in bioinformatics

slide-17
SLIDE 17

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Data sizes

y x1 x2 x3 y x . . .

x1 x2 ··· x99999 X . . . We need dimension reduction constantly:

  • Feature selection
  • Inference?

Slide 15/56 — Statistical methods in bioinformatics

slide-18
SLIDE 18

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

The problem with multiple comparisons

P predictors - let’s do P standard analyses! P(at least 1 false positive) = 1−(1−α)P

Slide 16/56 — Statistical methods in bioinformatics

slide-19
SLIDE 19

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple comparison problems

Slide 17/56 — Statistical methods in bioinformatics

slide-20
SLIDE 20

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple comparison problems

Possible errors committed when testing a single null hypotheses, H0. H0 is true ... is false Total Rejected α 1−β Not rejected 1−α β Total 1 1 α is the significance level, 1−β is the power.

Slide 18/56 — Statistical methods in bioinformatics

slide-21
SLIDE 21

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple comparison problems

Number of errors committed when testing m null hypotheses H0 is true ... is false Total Rejected V S R Not rejected U T m −R Total m0 m −m0 m Here R, the number of rejected hypotheses/discoveries, can be seen as a random variable. V ,S,U and T are unobserved.

Slide 19/56 — Statistical methods in bioinformatics

slide-22
SLIDE 22

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple comparison problems

Number of errors committed when testing m null hypotheses H0 is true ... is false Total Rejected V S R Not rejected U T m −R Total m0 m −m0 m Here R, the number of rejected hypotheses/discoveries, can be seen as a random variable. V ,S,U and T are unobserved. The family-wise error rate (FWER) is the probability of making at least one type I error (false positive): FWER = P(V > 0) = 1−P(V = 0)

Slide 19/56 — Statistical methods in bioinformatics

slide-23
SLIDE 23

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple comparison problems

The family-wise error rate (FWER) is the probability of making at least one type I error (false positive). For m tests we have FWER = 1−P(V = 0) = 1−(1−α)m ≤ mα where the second equality only holds under independence. (However, the inequality holds due to Boole’s inequality.)

Slide 20/56 — Statistical methods in bioinformatics

slide-24
SLIDE 24

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Bonferroni correction

The most conservative method but is free of dependence and distributional assumptions. FWER = 1−P(V = 0) = 1−(1−α)m ≤ mα So set instead the significance level at each individual test at α/m. In other words we reject the ith hypothesis if m ·pi ≤ α ⇔ pi ≤ α m

Slide 21/56 — Statistical methods in bioinformatics

slide-25
SLIDE 25

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Bonferroni correction

The most conservative method but is free of dependence and distributional assumptions. FWER = 1−P(V = 0) = 1−(1−α)m ≤ mα So set instead the significance level at each individual test at α/m. In other words we reject the ith hypothesis if m ·pi ≤ α ⇔ pi ≤ α m ˘ S´ ıd´ ak (assume independence). Want significance level α∗. 1−(1−α)m = α∗ ⇔ α = 1−

m

√ 1−α∗ Slightly less conservative (but not much).

Slide 21/56 — Statistical methods in bioinformatics

slide-26
SLIDE 26

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Holm’s correction

The Holm-Bonferroni-correction.

1 Compute and order the individual p-values:

p(1) ≤ p(2) ≤ ··· ≤ p(m).

2 Find ˆ

k = min{k : p(k) >

α m+1−k } 3 If ˆ

k exists then reject hypotheses corresponding to p(1),...,p(ˆ

k−1).

Slide 22/56 — Statistical methods in bioinformatics

slide-27
SLIDE 27

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Holm’s correction

The Holm-Bonferroni-correction.

1 Compute and order the individual p-values:

p(1) ≤ p(2) ≤ ··· ≤ p(m).

2 Find ˆ

k = min{k : p(k) >

α m+1−k } 3 If ˆ

k exists then reject hypotheses corresponding to p(1),...,p(ˆ

k−1).

Controls the FWER: Assume the (ordered) k is the first wrongly rejected true hypothesis. Then k ≤ m −(m0 −1). Hypothesis k was rejected so p(k) ≤ α m +1−k ≤ α m +1−(m −(m0 −1)) ≤ α m0 Since there are m0 true hypotheses then (Bonferroni argument) the probability that one of them is significant is at most α so FWER is controlled.

Slide 22/56 — Statistical methods in bioinformatics

slide-28
SLIDE 28

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Resampling methods

Computerintensive methods Permutation methods. Simulate data under H0, compute test statistic and compre to test statistic from

  • riginal data.
  • Bootstrap. “Simulate data under Ha”

.

Slide 23/56 — Statistical methods in bioinformatics

slide-29
SLIDE 29

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Practical problems

  • While guarantee of FWER-control is appealing, the

resulting thresholds often suffer from low power. In practice, this tends to wipe out evidence of the most interesting effects

  • FDR control offers a way to increase power while

maintaining some principled bound on error

Slide 24/56 — Statistical methods in bioinformatics

slide-30
SLIDE 30

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

False discovery rate

Number of errors committed when testing m null hypotheses H0 is true ... is false Total Rejected V S R Not rejected U T m −R Total m0 m −m0 m The proportion of false discoveries is Q = V

R . [Set to 0 for

R = 0] The false discovery rate is FDR = E(Q) = E(V R ) Focuses on different problem than FWER! Estimation of E(Q) in practice?

Slide 25/56 — Statistical methods in bioinformatics

slide-31
SLIDE 31

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 p−value Density

Slide 26/56 — Statistical methods in bioinformatics

slide-32
SLIDE 32

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 p−value Density

Slide 27/56 — Statistical methods in bioinformatics

slide-33
SLIDE 33

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 p−value Density True negatives False negatives True positives False positives

Slide 28/56 — Statistical methods in bioinformatics

slide-34
SLIDE 34

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR — BH step-up

Benjamini-Hochberg step-up procedure to control the FDR at α.

1 Compute and order the individual p-values:

p(1) ≤ p(2) ≤ ··· ≤ p(m).

2 Find ˆ

k = max{k : m

k ·p(k) ≤ α} 3 If ˆ

k exists then reject hypotheses corresponding to p(1),...,p(ˆ

k).

Slide 29/56 — Statistical methods in bioinformatics

slide-35
SLIDE 35

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR — BH step-up

Benjamini-Hochberg step-up procedure to control the FDR at α.

1 Compute and order the individual p-values:

p(1) ≤ p(2) ≤ ··· ≤ p(m).

2 Find ˆ

k = max{k : m

k ·p(k) ≤ α} 3 If ˆ

k exists then reject hypotheses corresponding to p(1),...,p(ˆ

k).

˜ p(1) = min{˜ p(2),mp(1)} . . . . . . ˜ p(m−1) = min{˜ p(m),

m m−1p(m−1)}

˜ p(m) = p(m)

Slide 29/56 — Statistical methods in bioinformatics

slide-36
SLIDE 36

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Estimating FDR — BH step-up

Note that each pi is smaller or equal to the criterium in Holm’s method so controls the FWER. If iid of the m0 tests (and all tests independent) and ordered so the m0 true tests comes first. Control FDR at level q: E(V /R) =

m

r=1

E[V r ✶R=r] =

m

r=1

1 r E[V ✶R=r] =

m

r=1

1 r E[

m0

i=1

✶pi≤ qr

m ✶R=r]

=

m

r=1

m0 r [✶p1≤ qr

m ✶R=r] = ···

=

m

r=1

m0 r [

m0

i=1

✶p1≤ qr

m ✶R=r]

= q m0 m ≤ q

Slide 30/56 — Statistical methods in bioinformatics

slide-37
SLIDE 37

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

q-values

The q-value is defined to be the FDR analogue of the p-value. q value(pi) = min

t≥pi

  • FDR(t)

The q-value of an individual hypothesis test is the minimum FDR at which the test may be called significant.

Slide 31/56 — Statistical methods in bioinformatics

slide-38
SLIDE 38

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

q-values

  • When all m null hypotheses are true then FDR control is

equivalent to FWER control.

  • FDR approach generally gives more power than FWER

control and fewer Type I errors than uncorrected testing.

  • The FDR bound holds for certain classes of dependent
  • tests. In practice, it is quite hard to“break”

.

Slide 32/56 — Statistical methods in bioinformatics

slide-39
SLIDE 39

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Multiple regression

Regression with more than one explanatory variable: For the i’th individual, yi = β0 +β1x1i +β2x2i +···+βpxpi +εi

  • Each coefficient, βj, expresses the expected change in

the response y when xj changes one unit and the remaining explanatory variables are held fixed.

  • Hence betaj is a measure of the effect of xj on the

response taking the effect of the other variables into account.

  • Might be that we have a significant effect of x1 in the

univariate model, while its effect in the mult. lin. model is non-significant. Its effect is explained by the other explanatory variables.

Slide 33/56 — Statistical methods in bioinformatics

slide-40
SLIDE 40

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Generalized linear models — a refresher

Expand the multiple regression model. We do this by applying a transformation with a suitable

  • function. This gives a so-called generalized linear model

g(E[Yi|X1i,...,Xpi]) = β0 +β1X1i +···+βpXpi

  • r equivalently

E[Yi|X1i,...,Xpi] = g−1(β0 +β1X1i +···+βpXpi) where the function g is in principle any function that maps the population mean into the linear predictor. g is called the link function

Slide 34/56 — Statistical methods in bioinformatics

slide-41
SLIDE 41

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Generalized linear models — a refresher

Equivalently E[Yi|X1i,...,Xpi] = g−1(Xβ) Only going to consider Gaussian data with identity link: E[Yi|X1i,...,Xpi] = Xβ and Y = Xβ+ε

Slide 35/56 — Statistical methods in bioinformatics

slide-42
SLIDE 42

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Generalized linear models — a refresher

From first year statistics the ordinary least squares estimator is ˆ β = (XtX)−1Xty Written as least squares, so ˆ β minimizes (y −Xˆ β)t(y −Xˆ β)

Slide 36/56 — Statistical methods in bioinformatics

slide-43
SLIDE 43

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Generalized linear models — a refresher

From first year statistics the ordinary least squares estimator is ˆ β = (XtX)−1Xty Written as least squares, so ˆ β minimizes (y −Xˆ β)t(y −Xˆ β) The“old”problem we’ve been working towards: inference with many outcomes/predictors.

Slide 36/56 — Statistical methods in bioinformatics

slide-44
SLIDE 44

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Penalized regression

Let Y = (Y1,...,Yn) be vector of outcomes in R, and X = (X1,...,Xn) a set of M predictors for each observation. Assume linear mean effect: Y = Xβ+ε The Lasso estimates β by minimizing the penalized LS function Zn(β) = 1 n(Y −Xβ)t(Y −Xβ)+λnβ so ˆ β = arg minβ∈RM Zn(β)

Slide 37/56 — Statistical methods in bioinformatics

slide-45
SLIDE 45

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Properties of regularized regression

Even for M > n or M ≫ n regularized regression methods can:

  • select a sparse model
  • lead to accurate prediction

Limitations of LASSO type regularization:

  • not consistent in variable selection
  • non-standard limiting distribution
  • no oracle property
  • multiple testing problem

Slide 38/56 — Statistical methods in bioinformatics

slide-46
SLIDE 46

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example

  • Outcome: depression (scale 0–40).
  • Input: 384 normalized genes expression value
  • 40 persons

> (cbind(y, x))[1:5, 1:7] y [1,] 22 -0.63 -0.16 -0.57 -0.51 0.43 0.41 [2,] 15 0.18 -0.25 -0.14 1.34 -0.24 1.69 [3,] 21 -0.84 0.70 1.18 -0.21 1.06 1.59 [4,] 27 1.60 0.56 -1.52 -0.18 0.89 -0.33 [5,] 12 0.33 -0.69 0.59 -0.10 -0.62 -2.29

Slide 39/56 — Statistical methods in bioinformatics

slide-47
SLIDE 47

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example — 384 linear regressions

> head(sort( + pt(-abs(MESS::mfastLmCpp(y, x)$tstat), df=38) + )) [1] 0.0003440182 0.0007709503 0.0013871463 0.0014807050 0 [6] 0.0026072106 Bonferroni, Holm and FDR“kills”all but variable 264 — however it is still not significant!

Slide 40/56 — Statistical methods in bioinformatics

slide-48
SLIDE 48

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example — lasso

> library("glmnet") > res <- glmnet(x, y) ; plot(res, lwd=2)

2 4 6 8 10 12 14 −1.0 −0.5 0.0 0.5 1.0 Coefficients 9 12 15 20 24 35 43

Slide 41/56 — Statistical methods in bioinformatics

slide-49
SLIDE 49

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Results

> coef(res, s=2) 385 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 16.909578504 V1 . V2 . V3 . V4 . V5 . V6 . V7 . V8 . V9 . V10 . V11 . V12 . V13 . V14 .

Slide 42/56 — Statistical methods in bioinformatics

slide-50
SLIDE 50

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Results

> pick <- which(coef(res, s=2) != 0) > cbind((1:385)[pick], coef(res, s=2)[pick]) [,1] [,2] [1,] 1 16.909578504 [2,] 43 0.006902524 [3,] 112 -0.067933348 [4,] 116 -0.294950702 [5,] 146 0.072430783 [6,] 229 0.024146820 [7,] 265 0.462496856 [8,] 297 -0.092082227

Slide 43/56 — Statistical methods in bioinformatics

slide-51
SLIDE 51

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Evaluating regularized regression results

How to evalute the results from regularized regression?

  • We get a list of parameters: ˆ

β(1),ˆ β(2),ˆ β(3),...,0,0,...

  • They are all shrunk and biased towards 0.
  • How do we know which of them are significant?

Slide 44/56 — Statistical methods in bioinformatics

slide-52
SLIDE 52

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Evaluating regularized regression results

How to evalute the results from regularized regression?

  • We get a list of parameters: ˆ

β(1),ˆ β(2),ˆ β(3),...,0,0,...

  • They are all shrunk and biased towards 0.
  • How do we know which of them are significant?

Classical approach: Test hypothesis that the jth predictor is significant H0 : βj = 0 Potential problems with multiple testing, selection algorithm, bias, lack of small-sample test statistic, ...

Slide 44/56 — Statistical methods in bioinformatics

slide-53
SLIDE 53

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Debiasing

> pick <- which(coef(res, s=2) != 0) > select <- pick[-1]-1 ; select [1] 42 111 115 145 228 264 296 > broom::tidy(lm(y ~ x[,select])) # A tibble: 8 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 16.6 0.401 41.5 2.11e-29 2 x[, select]1 1.36 0.473 2.88 7.06e- 3 3 x[, select]2

  • 1.29

0.529

  • 2.43

2.08e- 2 4 x[, select]3

  • 1.64

0.435

  • 3.78

6.42e- 4 5 x[, select]4 0.303 0.509 0.594 5.57e- 1 6 x[, select]5 1.71 0.415 4.11 2.55e- 4 7 x[, select]6 1.57 0.433 3.61 1.02e- 3 8 x[, select]7

  • 1.21

0.361

  • 3.35

2.07e- 3 Not giving any clue about inference!

Slide 45/56 — Statistical methods in bioinformatics

slide-54
SLIDE 54

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Selective Inference

Compute p-values and CIs for the lasso estimate, at a fixed value of the tuning parameter λ. > library("selectiveInference") > sigma <- estimateSigma(x, y) > lambda <- 2 > n <- 40 > beta <- coef(res, s=lambda/n, + exact=TRUE,x=x,y=y)[-1] > fixedLassoInf(x, y, beta, lambda, sigma=sigma) Caution: quite persnickety

Slide 46/56 — Statistical methods in bioinformatics

slide-55
SLIDE 55

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Penalized regression — II

How about using another penalty function: The Ridge regression estimates β by minimizing the penalized LS function Zn(β) = 1 n(Y −Xβ)t(Y −Xβ)+ξnβ2 Ridge regression proceeds by adding a small value, to the diagonal elements of the correlation matrix of the parameters. Introduces bias but reduces variation.

Slide 47/56 — Statistical methods in bioinformatics

slide-56
SLIDE 56

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Penalized regression — III

How about using yet another penalty function: The elastic net regression estimates β by minimizing the penalized LS function Zn(β) = 1 n(Y −Xβ)t(Y −Xβ)+λnβ+ξnβ2 has Lasso and ridge regression as special cases. Estimated using an iterative two-step procedure.

Slide 48/56 — Statistical methods in bioinformatics

slide-57
SLIDE 57

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example — ridge

> res <- glmnet(x, y, alpha=0) # Ridge regression > plot(res, lwd=2)

5 10 15 −0.15 −0.05 0.00 0.05 0.10 0.15 Coefficients 384 384 384 384

Slide 49/56 — Statistical methods in bioinformatics

slide-58
SLIDE 58

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example — elastic net

> res <- glmnet(x, y, alpha=.3) # Elastic net > plot(res, lwd=2)

2 4 6 8 10 12 14 −0.5 0.0 0.5 1.0 Coefficients 12 17 23 26 41 44 52

Slide 50/56 — Statistical methods in bioinformatics

slide-59
SLIDE 59

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

How to choose the penalty?

> res <- cv.glmnet(x, y) # 10-fold CV > plot(res, lwd=2)

−3 −2 −1 1 15 20 25 30 Log(λ) Mean−Squared Error

  • 43

43 37 37 36 33 29 24 21 17 13 11 7

Slide 51/56 — Statistical methods in bioinformatics

slide-60
SLIDE 60

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Principal component analysis

Dimension reduction of the covariates.

  • −2

−1 1 2 −2 −1 1 2 x[,1] x[,2]

Slide 52/56 — Statistical methods in bioinformatics

slide-61
SLIDE 61

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Principal component analysis

General algorithm:

1 Compute the covariance matrix of the predictor data set

X.

2 Calculate the eigenvalues and corresponding

eigenvectors of this covariance matrix

3 The eigenvectors correspond to orthogonal“directions”

, sort by eigenvalue.

Slide 53/56 — Statistical methods in bioinformatics

slide-62
SLIDE 62

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Principal component analysis

General algorithm:

1 Compute the covariance matrix of the predictor data set

X.

2 Calculate the eigenvalues and corresponding

eigenvectors of this covariance matrix

3 The eigenvectors correspond to orthogonal“directions”

, sort by eigenvalue. Reduce dimensionality so pick a unit vector u, and replace each data point with its projection utx. These new data points have variance utΣu if Σ was the variance of x. Find u s.t. utΣu is maximized which is exactly the eigenvector with the largest eigenvalue.

Slide 53/56 — Statistical methods in bioinformatics

slide-63
SLIDE 63

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Principal component analysis

Dimension reduction of the covariates.

  • −2

−1 1 2 −2 −1 1 2 x[,1] x[,2]

Slide 54/56 — Statistical methods in bioinformatics

slide-64
SLIDE 64

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Principal component regression

  • Instead of smoothly shrinking the coordinates on the

principal components, PCR either does not shrink a coordinate at all or shrinks it to zero.

  • Keep the k largest eigenvalue components and use the k

projection on them as input to a GLM.

  • Discrete shrinkage effect compared to ridge regression.
  • Ridge regression shrinks the coefficients of the principal

components, with relatively more shrinkage applied to the smaller components than the larger; principal components regression discards the p −k smallest eigenvalue components.

Slide 55/56 — Statistical methods in bioinformatics

slide-65
SLIDE 65

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0

Example — PCR

5 components Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.9500 1.3270 13.527 3.05e-15 *** prPC1 0.3544 0.3245 1.092 0.2824 prPC2 0.1961 0.3363 0.583 0.5637 prPC3

  • 0.1120

0.3397

  • 0.330

0.7436 prPC4

  • 0.6515

0.3486

  • 1.869

0.0702 . prPC5 0.1130 0.3526 0.320 0.7506 Residual standard error: 8.393 on 34 degrees of freedom Multiple R-squared: 0.1335,Adjusted R-squared: 0.006065 F-statistic: 1.048 on 5 and 34 DF, p-value: 0.4061

Slide 56/56 — Statistical methods in bioinformatics