Statistical Genetics Matthew Stephens Statistics Retreat, October - - PowerPoint PPT Presentation

statistical genetics
SMART_READER_LITE
LIVE PREVIEW

Statistical Genetics Matthew Stephens Statistics Retreat, October - - PowerPoint PPT Presentation

Statistical Genetics Matthew Stephens Statistics Retreat, October 26th 2012 Matthew Stephens Retreat Talk 2012 Two stories The two most influential statistical ideas in analysis of genetic association studies. 1 Sequence, sequence,


slide-1
SLIDE 1

Statistical Genetics

Matthew Stephens Statistics Retreat, October 26th 2012

Matthew Stephens Retreat Talk 2012

slide-2
SLIDE 2

Two stories

◮ The two most influential statistical ideas in analysis of genetic

association studies.1

◮ Sequence, sequence, everywhere.

1With apologies to Steve Stigler Matthew Stephens Retreat Talk 2012

slide-3
SLIDE 3

Story I: Genetic Association Studies

Genetic association studies aim to identify genetic variants that modify risk of common diseases or affect other phenotypes (e.g. Type I Diabetes, height, LDL cholestrol). The idea is absurdly simple: measure genetic variants (usually SNPs), and phenotypes in randomly-sampled individuals, and see which SNPs are correlated with phenotypes.

Matthew Stephens Retreat Talk 2012

slide-4
SLIDE 4

Story I: Genetic Association Studies

◮ Typical recent genome-wide studies have typed 500K-1M

SNPs in thousands of (unrelated) phenotyped individuals.

◮ Basic Analysis: test each SNP, one-by-one, for statistical

association with each phenotype.

Matthew Stephens Retreat Talk 2012

slide-5
SLIDE 5

Progress identifying variants underlying common disease

Published Genome‐Wide Associations through 09/2011 1,617 published GWA at p≤5X10‐8 for 249 traits NHGRI GWA Catalog www.genome.gov/GWAStudies

Credit: Matthew Stephens Retreat Talk 2012

slide-6
SLIDE 6

The two most influential statistical ideas in GWAS

◮ Correction for unmeasured confounding (population

structure).

◮ Imputation to combine studies.

Matthew Stephens Retreat Talk 2012

slide-7
SLIDE 7

Population Structure and Unmeasured Confounding

The Problem in a nutshell: What would happen if you conducted a Genetic Association study for “Chopstick Use” in San Francisco?

Matthew Stephens Retreat Talk 2012

slide-8
SLIDE 8

Population Structure and Unmeasured Confounding

If you know the “genetic background” of the individuals in your study (e.g. which continent they inherited their genes from), then you can correct for it. What if you don’t know it?

Matthew Stephens Retreat Talk 2012

slide-9
SLIDE 9

Principal Components Analysis to the rescue!

Novembre et al, Nature, 2008

Matthew Stephens Retreat Talk 2012

slide-10
SLIDE 10

Principal Components Analysis to the rescue!

Test for significance of genetic effect β, controlling for effects of genetic background (α): y = vα + xβ + ǫ

Price et al, Nature Genetics, 2006

Matthew Stephens Retreat Talk 2012

slide-11
SLIDE 11

The two most influential statistical ideas in GWAS

◮ Correction for unmeasured confounding (population

structure).

◮ Imputation to combine studies.

Credit: Bryan Howie

Matthew Stephens Retreat Talk 2012

slide-12
SLIDE 12

Genotype(imputa-on(background(

SNPs%genotyped%on%an%array% 0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1% 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% ?% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% ?% 0% 0% 0% 0% 2% 1% 0% ?% 1% 1% 1% 0% 1% 1% 0% 0% 2% 1% 1% 2% 1% 1% 1% Reference( haplotypes( Phenotyped ( GWAS ( samples (

Matthew Stephens Retreat Talk 2012

slide-13
SLIDE 13

0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% ?% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% ?% 0% 0% 0% 0% 2% 1% 0% ?% 1% 1% 1% 0% 1% 1% 0% 0% 2% 1% 1% 2% 1% 1% 1% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?% ?%

Genotype(imputa-on(background(

Untyped%SNPs% Reference( haplotypes( Phenotyped ( GWAS ( samples (

Matthew Stephens Retreat Talk 2012

slide-14
SLIDE 14

0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1% 0% 0% 0% 1% 1% 1% 2% 2% 2% 0% 0% 1% 1% 1% 0% 1% 1% 2% 1% 1% 0% 0% 2% 0% 1% 1% 1% 0% 0% 0% 2% 1% 1% 2% 2% 2% 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% 0% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% 2% 0% 0% 0% 0% 2% 1% 0% 1% 1% 1% 1% 0% 1% 1% 0% 0% 2% 1% 1% 2% 1% 1% 1% 0% 0% 2% 1% 0% 0% 2% 2% 2% 1% 1% 1% 1% 1% 0% 1% 1% 1% 0% 2% 2% 0% 2% 1% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1% 1% 1% Associa8on% signal%

Genotype(imputa-on(background(

Reference( haplotypes( Phenotyped ( GWAS ( samples (

Matthew Stephens Retreat Talk 2012

slide-15
SLIDE 15

0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1%

Imputa-on(facilitates(meta>analysis(

Reference( haplotypes( 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% 0% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% GWAS(1 ( GWAS(2 ( 1% 1% 1% 0% 1% 1% 2% 0% 0% 0% 0% 1% 0% 2% 2% 0% 1% 1% 1% 1% 0% 0% 0% 1%

Matthew Stephens Retreat Talk 2012

slide-16
SLIDE 16

0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1%

Imputa-on(facilitates(meta>analysis(

Reference( haplotypes( 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% 0% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% 0% 0% 2% 1% 1% 1% 2% 2% 1% 0% 0% 1% 1% 1% 0% 1% 1% 0% 1% 1% 1% 0% 2% 0% 1% 1% 1% 0% 0% 1% 2% 1% 1% 2% 2% 0% GWAS(1 ( 0% 1% 2% 1% 1% 1% 0% 2% 1% 1% 1% 1% 0% 1% 0% 0% 0% 1% 1% 0% 1% 0% 2% 1% 1% 2% 0% 0% 0% 0% 1% 0% 1% 1% 1% 1% 1% 1% 0% 0% 1% 0% 2% 2% 0% 0% 1% 1% 0% 1% 0% 0% 2% 0% 0% 1% 1% 1% 1% 1% GWAS(2 ( Associa8on% signal%

Matthew Stephens Retreat Talk 2012

slide-17
SLIDE 17

0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 1% 1% 1% 1% 1% 0% 0% 1% 1% 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 1% 1% 1% 0% 1% 0% 0% 0% 1% 0% 0% 0% 1% 1% 1% 1% 1% 0% 1% 1% 0% 0% 1%

Imputa-on(facilitates(meta>analysis(

Reference( haplotypes( 1% 1% 1% 0% 0% 2% 1% 0% 0% 0% 0% 1% 0% 0% 1% 1% 1% 1% 1% 1% 1% 0% 0% 2% 0% 0% 2% 1% 1% 1% 2% 2% 1% 0% 0% 1% 1% 1% 0% 1% 1% 0% 1% 1% 1% 0% 2% 0% 1% 1% 1% 0% 0% 1% 2% 1% 1% 2% 2% 0% GWAS(1 ( 0% 1% 2% 1% 1% 1% 0% 2% 1% 1% 1% 1% 0% 1% 0% 0% 0% 1% 1% 0% 1% 0% 2% 1% 1% 2% 0% 0% 0% 0% 1% 0% 1% 1% 1% 1% 1% 1% 0% 0% 1% 0% 2% 2% 0% 0% 1% 1% 0% 1% 0% 0% 2% 0% 0% 1% 1% 1% 1% 1% GWAS(2 (

Type%2%diabetes:%Zeggini%et%al.,%May%2008%(Nature'Gene*cs)% Crohn’s%disease:%BarreH%et%al.,%Aug%2008%(Nature'Gene*cs)% Type%1%diabetes:%Cooper%et%al.,%Nov%2008%(Nature'Gene*cs)% Matthew Stephens Retreat Talk 2012

slide-18
SLIDE 18

Story II: Sequence, Sequence, Everywhere

Matthew Stephens Retreat Talk 2012

slide-19
SLIDE 19

Sequencing Assays, and Statistical Challenges

Although DNA sequencing is best known for obtaining “genome sequences”, it is now routinely used for measuring cellular processes to try to understand how cells operate. For example:

◮ Gene expression (RNA-seq). ◮ Chromatin openness (DNase-seq). ◮ Transcription Factor Binding (ChIP-seq) ◮ Histone modifications (ChIP-seq)

A key question is how/why cells differ from one another (they share the same DNA!).

Matthew Stephens Retreat Talk 2012

slide-20
SLIDE 20

Chromatin and DNA structure

Figure from Felsenfeld and Groudine. Nature, 2003 Matthew Stephens Retreat Talk 2012

slide-21
SLIDE 21

The Data

The basic structure of these assays is the same:

◮ Do something clever to get bits of the DNA that you want

(e.g. the bits that contact a modified histone, or the bits that are bound by a particular transcription factor).

◮ Sequence these bits (producing millions of little sequences). ◮ Work out where in the genome each sequence came from. ◮ The number of sequences coming from each location (usually

0 or 1) is a measure of the “intensity” of the process at that location.

◮ Basic model: an inhomogeneous Poisson process,

xib ∼ Poi(λib).

Matthew Stephens Retreat Talk 2012

slide-22
SLIDE 22

Example: Histone Modification H3K4me1

Can you spot the difference?

32230000 32250000 32270000 32290000 0.00 0.02 0.04 0.06 0.08

Left Ventricle, H3K4me1

32230000 32250000 32270000 32290000 0.00 0.02 0.04 0.06 0.08

Right Ventricle, H3K4me1

Data from Scott Smemo, Nobrega lab Matthew Stephens Retreat Talk 2012

slide-23
SLIDE 23

Advertisement: STAT 45800

We have preliminary ideas and methods for dealing with these data, based on wavelets for count data (work with H. Shim). In STAT 45800 we will try “crowd-sourcing” these ideas, to see how much further progress we can make. Aim: to combine expertises in Bioinformatics, Computing, Biology and Statistics, to make more progress together than any of us could do alone!

Matthew Stephens Retreat Talk 2012

slide-24
SLIDE 24

Acknowledgements

◮ Bryan Howie, Heejung Shim. ◮ Funding: NHGRI, NIH GTEX project, and NIH ENDGAME

consortium.

Matthew Stephens Retreat Talk 2012