[PPT] - Course Overview 02-715 Advanced Topics in Computa8onal PowerPoint Presentation

SLIDE 1

Course ¡Overview ¡

02-‑715 ¡Advanced ¡Topics ¡in ¡Computa8onal ¡ Genomics ¡

SLIDE 2

Course ¡Overview ¡

Instructor: ¡Seyoung ¡Kim ¡(Lane ¡Center ¡for ¡Computa8onal ¡

Biology, ¡CMU) ¡

Course ¡Website: ¡

www.cs.cmu.edu/~sssykim/teaching/f11.html ¡

Loca8on: ¡ ¡

– 9115 ¡GHC ¡(first ¡2 ¡weeks ¡un8l ¡Sept ¡8) ¡ ¡ – 6115 ¡GHC ¡(star8ng ¡from ¡the ¡3rd ¡week) ¡

Time: ¡Tuesday ¡& ¡Thursday ¡1:30-‑2:50pm ¡
Office ¡hours: ¡Tuesday ¡3:000-‑4:00pm ¡

SLIDE 3

Grading ¡

Write-‑ups ¡for ¡required ¡reading ¡(30%) ¡

– Summary ¡of ¡contribu8ons, ¡cri8que ¡(weaknesses), ¡ques8ons. ¡ – Under ¡400 ¡words ¡for ¡each ¡paper. ¡ – Submit ¡to ¡blackboard/dropbox ¡by ¡midnight ¡the ¡day ¡before ¡the ¡class. ¡

Late ¡submission ¡policy: ¡70% ¡before ¡the ¡class, ¡0% ¡words. ¡
Class ¡par8cipa8on ¡(20%) ¡
Paper ¡presenta8on ¡(20%) ¡
Final ¡project ¡(30%) ¡

– One-‑page ¡project ¡proposal: ¡due ¡Nov ¡1st ¡in ¡class. ¡ – Poster ¡session: ¡the ¡last ¡day ¡of ¡the ¡course. ¡ – Final ¡project ¡report: ¡due ¡Dec ¡14th. ¡

SLIDE 4

Overview ¡

Next-‑genera8on ¡sequencing ¡technology ¡
Gene8c ¡polymorphisms ¡
Popula8on ¡gene8cs ¡review ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

From ¡Human ¡Genome ¡Sequencing ¡Project ¡to ¡HapMap ¡Project ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

SLIDE 5

Decline ¡in ¡Sequencing ¡Costs ¡

5 ¡

Science 331:666-668, 2011

SLIDE 6

Improvement ¡in ¡Sequencing ¡Technologies ¡

SLIDE 7

DNA ¡sequencing ¡– ¡vectors ¡

+ =

DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) Known location (restriction site)

Adopted ¡from ¡hdp://www.cs.utoronto.ca/~brudno/csc2431w10/2431_lec1.ppt ¡

SLIDE 8

Method ¡to ¡sequence ¡longer ¡regions ¡

cut many times at random (Shotgun) genomic segment

Get two reads from   each segment

̃500 bp ̃500 bp

Adopted ¡from ¡hdp://www.cs.utoronto.ca/~brudno/csc2431w10/2431_lec1.ppt ¡

SLIDE 9

Reconstruc8ng ¡the ¡Sequence ¡ ¡ (Fragment ¡Assembly) ¡

Cover region with ̃7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region

reads

Adopted ¡from ¡hdp://www.cs.utoronto.ca/~brudno/csc2431w10/2431_lec1.ppt ¡

SLIDE 10

DefiniAon ¡of ¡Coverage ¡

Length ¡of ¡genomic ¡segment: ¡L ¡ Number ¡of ¡reads: ¡ ¡ ¡n ¡ Length ¡of ¡each ¡read: ¡ ¡l ¡ DefiniAon: ¡ ¡Coverage ¡ ¡C ¡= ¡n ¡l ¡/ ¡L ¡ How ¡much ¡coverage ¡is ¡enough? ¡ ¡Lander-‑Waterman ¡model: ¡ ¡Assuming ¡uniform ¡distribu8on ¡of ¡reads, ¡C=10 ¡results ¡in ¡1 ¡gapped ¡ region ¡/1,000,000 ¡nucleo8des ¡ C

Adopted ¡from ¡hdp://www.cs.utoronto.ca/~brudno/csc2431w10/2431_lec1.ppt ¡

SLIDE 11

Depth ¡of ¡Coverage ¡and ¡Physical ¡Coverage ¡

Single-‑end ¡sequencing ¡
Paired-‑end ¡sequencing ¡
Paired-‑end ¡sequencing ¡

SLIDE 12

SLIDE 13

Next ¡GeneraAon ¡Sequencing ¡(NGS) ¡based ¡ methods ¡

RNA-‑Seq: ¡methods ¡for ¡determining ¡mRNA ¡abundance ¡and ¡

sequence ¡content ¡

– Rare ¡transcripts ¡discovery ¡ – Alterna8ve ¡splicing ¡event ¡detec8on ¡ – Transcript ¡sequence ¡varia8on ¡detec8on ¡

SLIDE 14

Next ¡GeneraAon ¡Sequencing ¡(NGS) ¡based ¡ methods ¡

ChIP-‑Seq: ¡methods ¡for ¡measuring ¡genome-‑wide ¡profiles ¡of ¡

immunoprecipitated ¡DNA-‑protein ¡complexes ¡

SLIDE 15

Overview ¡

Next-‑genera8on ¡sequencing ¡technology ¡
Gene8c ¡polymorphisms ¡
From ¡Human ¡Genome ¡Sequencing ¡Project ¡to ¡HapMap ¡Project ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

Popula8on ¡gene8cs ¡review ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

SLIDE 16

2001: ¡Human ¡Genome ¡ Sequencing ¡Project ¡ 2011: ¡1000 ¡Genome ¡Project ¡

T ¡ T ¡ T ¡ T ¡ A ¡ A ¡ A ¡ T ¡ A ¡ C ¡

SLIDE 17

Why ¡GeneAc ¡VariaAons? ¡

Gene8c ¡varia8ons ¡can ¡be ¡

– Causal ¡varia8ons ¡that ¡influence ¡phenotypes ¡such ¡as ¡disease ¡ suscep8bility, ¡drug ¡response: ¡finding ¡them ¡can ¡be ¡the ¡first ¡key ¡steps ¡to ¡ cures ¡in ¡medicine. ¡ – Used ¡to ¡find ¡signatures ¡of ¡evolu8on, ¡posi8ve ¡selec8on. ¡ – Giving ¡insights ¡on ¡popula8on ¡structure. ¡

SLIDE 18

GeneAc ¡VariaAons ¡

Types ¡of ¡gene8c ¡varia8ons ¡

– Single ¡nucleo8de ¡polymorphisms ¡(SNPs) ¡

Widely ¡used ¡as ¡gene8c ¡markers ¡
Highly ¡abundant ¡in ¡genomes ¡

– Structural ¡variants: ¡inser8ons/dele8ons, ¡duplica8ons, ¡copy ¡number ¡ varia8ons ¡

SLIDE 19

Other ¡GeneAc ¡VariaAons ¡

Copy ¡Number ¡Varia8on ¡

– DNA ¡segment ¡whose ¡numbers ¡ differ ¡in ¡different ¡genomes ¡

Kilobases ¡to ¡megabases ¡in ¡size ¡

– Usually ¡two ¡copies ¡of ¡all ¡ ¡ autosomal ¡regions, ¡one ¡per ¡ chromosome ¡ – Varia8on ¡due ¡to ¡dele8on ¡or ¡ duplica8on ¡

SLIDE 20

Variant ¡Frequencies ¡from ¡1000 ¡Genome ¡Pilot ¡ Project ¡

SLIDE 21

Terminology ¡

Allele: ¡different ¡forms ¡of ¡gene8c ¡varia8ons ¡at ¡a ¡given ¡gene ¡or ¡

gene8c ¡locus ¡

Genotype: ¡specific ¡allelic ¡make-‑up ¡of ¡an ¡individual’s ¡genome ¡
Heterozygous/Homozygous ¡

SLIDE 22

DetecAng ¡Genome ¡AlteraAons ¡

SLIDE 23

Working ¡with ¡SNP ¡Data ¡in ¡PracAce ¡

At ¡each ¡locus, ¡SNPs ¡are ¡represented ¡as ¡0 ¡or ¡1. ¡

– A/T/C/G ¡leders ¡are ¡converted ¡to ¡0 ¡or ¡1 ¡for ¡minor/major ¡alleles ¡ – Genotypes ¡at ¡each ¡locus ¡of ¡each ¡individual ¡are ¡coded ¡as ¡

0 ¡: ¡minor ¡allele ¡homozygous ¡
1: ¡heterozygous ¡
2: ¡major ¡allele ¡homozygous ¡
Given ¡genotype ¡data ¡for ¡N ¡individuals ¡
(Minor ¡allele ¡frequency) ¡= ¡(the ¡number ¡of ¡individuals ¡with ¡minor ¡

alleles)/(total ¡number ¡of ¡individuals) ¡

SLIDE 24

Sequencing ¡vs. ¡SNP ¡Genotyping ¡

Sequencing ¡a ¡whole ¡genome ¡is ¡much ¡more ¡costly ¡than ¡

genotyping ¡a ¡small ¡number ¡of ¡gene8c ¡loci ¡for ¡SNPs ¡

SLIDE 25

A ¡LiYle ¡Bit ¡of ¡History ¡

2001: ¡A ¡dran ¡of ¡human ¡genome ¡sequence ¡become ¡available ¡
2001: ¡The ¡Interna8onal ¡SNP ¡Map ¡Working ¡Group ¡publishes ¡a ¡SNP ¡

Map ¡of ¡1.42 ¡million ¡SNPs ¡that ¡contained ¡all ¡SNPs ¡iden8fied ¡so ¡far ¡

2005: ¡HapMap ¡Phase ¡I ¡

– Genotype ¡at ¡least ¡one ¡common ¡SNP ¡(MAF>5%) ¡every ¡5kb ¡across ¡270 ¡ individuals ¡ – Geographic ¡diversity ¡

30 ¡trios ¡from ¡Yoruba ¡in ¡Ibadan, ¡Nigeria ¡(YRI) ¡
30 ¡trios ¡of ¡European ¡ancestry ¡living ¡in ¡Utah ¡(CEPH) ¡
45 ¡unrelated ¡Han ¡Chinese ¡in ¡Beijing ¡(CHB) ¡
45 ¡nrelated ¡Japanese ¡(JPT) ¡

– 1.3 ¡million ¡SNPs ¡

SLIDE 26

A ¡LiYle ¡Bit ¡of ¡History ¡

2007: ¡HapMap ¡Phase ¡II ¡

– Genotype ¡addi8onal ¡2.1 ¡million ¡SNPs ¡for ¡the ¡same ¡individuals ¡ – SNP ¡density ¡about ¡1 ¡per ¡kb ¡ – Es8mated ¡to ¡contain ¡25-‑35% ¡of ¡all ¡9-‑10 ¡million ¡common ¡SNPs ¡in ¡ assembled ¡human ¡genome. ¡

2010: ¡HapMap ¡Phase ¡III ¡

– 1184 ¡individuals ¡from ¡11 ¡popula8ons, ¡including ¡HapMap ¡Phase ¡I, ¡II ¡ samples ¡ – Rare ¡variants ¡(MAF=0.05-‑0.5%), ¡low ¡frequency ¡variants ¡(MAF=0.5%-‑5%) ¡ – Copy ¡number ¡varia8ons, ¡resequencing ¡of ¡selected ¡regions ¡

2010 ¡: ¡1000 ¡Genome ¡Pilot ¡Project ¡

– A ¡more ¡complete ¡characteriza8on ¡of ¡human ¡gene8c ¡varia8ons ¡ ¡

SLIDE 27

Linkage ¡Disequilibrium ¡in ¡HapMap ¡Data ¡

r2 ¡in ¡HapMap ¡Data ¡

genome ¡ genome ¡

SLIDE 28

Overview ¡

Next-‑genera8on ¡sequencing ¡technology ¡
Gene8c ¡polymorphisms ¡
From ¡Human ¡Genome ¡Sequencing ¡Project ¡to ¡HapMap ¡Project ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

Popula8on ¡gene8cs ¡review ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

SLIDE 29

2 1 3 6 1 9 1 5 1 7 4 1 9 6 2 9 1 7 2 1 2 1 2 7 1 4 6 7 1 1 81 8 1 4 1 01

Genotypes Haplotypes

1 3 1 1 5 4 9 2 1 7 1 2 7 6 1 1 8 4 1 2 6 9 1 7 1 6 9 2 1 2 1 4 7 1 8 1 1 Haplotype Re-construction Chromosome phase is known Chromosome phase is unknown

Haplotype ¡and ¡Genotype ¡

A ¡collec8on ¡of ¡alleles ¡derived ¡from ¡the ¡same ¡chromosome ¡

SLIDE 30

GATCTTCGTACTGAGT GATCTTCGTACTGAGT GATTTTCGTACGGAAT GATTTTCGTACTGAGT GATCTTCGTACTGAAT GATTTTCGTACGGAAT GATTTTCGTACGGAAT GATCTTCGTACTGAAT CTG 3/8 TGA 3/8 CTA 2/8 Haplotype

Consider J binary markers in a genomic region
There are 2J possible haplotypes
but in fact, far fewer are seen in human population
Good genetic marker for population, evolution and hereditary diseases …

chromosome

disease X healthy healthy

Why ¡Haplotypes? ¡

‑-‑ ¡a ¡more ¡discriminaAve ¡state ¡of ¡a ¡chromosomal ¡region ¡

SLIDE 31

Haplotype ¡Analyses ¡

Haplotype ¡analyses ¡

– Linkage ¡disequilibrium ¡assessment ¡ – Disease-‑gene ¡discovery ¡ – Gene8c ¡demography ¡ – Chromosomal ¡evolu8on ¡studies ¡

SLIDE 32

Linkage ¡Disequilibrium ¡(LD) ¡

LD ¡reflects ¡the ¡rela8onship ¡between ¡alleles ¡at ¡different ¡loci. ¡

– Alleles ¡at ¡locus ¡A: ¡frequencies ¡p1,…, ¡pm – Alleles ¡at ¡locus ¡B: ¡frequencies ¡q1,…,qn – Haplotype ¡frequency ¡for ¡AiBj: ¡

equilibrium ¡value: ¡pi qj
Observed ¡value: ¡hij ¡
Linkage ¡disequilibrium: ¡hij -‑pi qj

– Linkage ¡disequilibrium ¡is ¡an ¡allelic ¡associa8on ¡measure ¡(difference ¡ between ¡the ¡actual ¡haplotype ¡frequency ¡and ¡the ¡equilibrium ¡value) ¡ – More ¡precisely: ¡gameAc ¡associaAon ¡

SLIDE 33

Haplotype ¡Inference ¡

Given ¡a ¡random ¡sample ¡of ¡mul8locus ¡genotypes ¡at ¡a ¡set ¡of ¡

SNPs ¡

– Frequency ¡es8ma8on ¡of ¡all ¡possible ¡haplotypes ¡ – Haplotype ¡reconstruc8on ¡for ¡individuals ¡ – How ¡many ¡out ¡of ¡all ¡possible ¡haplotypes ¡are ¡plausible ¡in ¡a ¡popula8on ¡

Haplotype ¡reconstruc8on ¡algorithm ¡

– Clark’s ¡parsimony ¡algorithm ¡(Clark, ¡Mol. ¡Biol. ¡Evol. ¡1990) ¡ – Haplotyper ¡(Niu ¡et ¡al., ¡AJHG ¡2002) ¡ – PHASE ¡(Li ¡and ¡Stephens, ¡Gene8cs ¡2003) ¡

SLIDE 34

PHASE ¡

Given ¡genotype ¡data ¡from ¡a ¡popula8on ¡

– Recover ¡haplotypes ¡ – Es8mate ¡recombina8on ¡rate, ¡recombina8on ¡hotspots ¡ – Impute ¡missing ¡genotypes ¡

SLIDE 35

PHASE ¡

(Stephens ¡et ¡al., ¡AJHG ¡2001) ¡

Treat ¡unknown ¡haplotypes ¡as ¡unobserved ¡random ¡quan88es ¡

and ¡es8mate ¡the ¡condi8onal ¡probability ¡of ¡haplotypes ¡given ¡

bserved ¡genotypes ¡
Gibbs ¡sampling ¡approach ¡

– Start ¡with ¡ini8al ¡guesses ¡on ¡haplotypes ¡ – Itera8vely ¡reconstruct ¡the ¡haplotype ¡of ¡each ¡individual ¡assuming ¡the ¡ haplotypes ¡of ¡other ¡individuals ¡have ¡been ¡correctly ¡reconstructed. ¡

SLIDE 36

PHASE ¡

(Stephens ¡et ¡al., ¡AJHG ¡2001) ¡

Given ¡the ¡genotype ¡data ¡G, ¡start ¡with ¡some ¡ini8al ¡haplotype ¡

reconstruc8on ¡H(0). ¡For ¡t ¡= ¡0, ¡1, ¡2, ¡…, ¡ ¡obtain ¡H(t+1) ¡from ¡H(t) ¡ using ¡the ¡following ¡three ¡steps: ¡ ¡1. ¡Choose ¡an ¡individual, ¡i, ¡uniformly ¡and ¡at ¡random ¡from ¡all ¡ ambiguous ¡individuals ¡(i.e., ¡individuals ¡with ¡more ¡than ¡one ¡ possible ¡haplotype ¡reconstruc8on). ¡ ¡ ¡2. ¡Sample ¡Hi

(t+1) ¡from ¡P(Hi|G,H-‑i (t)), ¡where ¡H-‑i ¡is ¡the ¡set ¡of ¡

haplotypes ¡excluding ¡individual ¡i. ¡ ¡3. ¡Set ¡Hj

(t+1) ¡= ¡Hj (t) ¡for ¡j=1,…,n, ¡j ¡≠ ¡i ¡

SLIDE 37

Phase ¡

h1, ¡h2, ¡h3: ¡unobserved ¡ancestral ¡haplotypes ¡
h4A, ¡h4B: ¡unobserved ¡haplotypes ¡for ¡individuals ¡
Circles: ¡alleles, ¡muta8ons ¡

SLIDE 38

PHASE ¡

What ¡is ¡P(Hi|G,H-‑i

(t))? ¡

How ¡to ¡model ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(i.e., ¡recombina8on ¡process) ¡

– Xl ¡: ¡random ¡variables ¡for ¡ancestral ¡haplotype ¡labels ¡inherited ¡at ¡locus ¡l

SLIDE 39

Haplotype ¡Structure ¡and ¡RecombinaAon ¡Rate ¡ EsAmates: ¡HapMap ¡I ¡vs. ¡HapMap ¡II ¡

SLIDE 40

Reducing ¡Genotyping ¡Costs ¡with ¡Tag ¡SNPs ¡

Nearby ¡SNPs ¡in ¡the ¡genome ¡are ¡in ¡linkage ¡disequilibrium ¡(LD), ¡

and ¡thus ¡contain ¡redundant ¡informa8on. ¡

If ¡we ¡knew ¡which ¡SNPs ¡are ¡in ¡LD, ¡we ¡can ¡pre-‑select ¡the ¡

representa8ve ¡SNPs ¡for ¡each ¡LD ¡block ¡of ¡chromosome, ¡and ¡ genotype ¡only ¡for ¡those ¡SNPs. ¡

genome ¡ genome ¡

SLIDE 41

Two-‑phase ¡Genotyping ¡Using ¡Tag ¡SNPs ¡

A ¡two-‑phase ¡approach ¡

– Phase ¡1: ¡Collect ¡a ¡set ¡of ¡SNPs ¡densely ¡distributed ¡across ¡ genome ¡for ¡a ¡small ¡number ¡of ¡individuals. ¡Select ¡tag ¡SNPs ¡ based ¡on ¡this ¡dataset. ¡ – Phase ¡2: ¡Genotype ¡only ¡for ¡tag ¡SNPs ¡for ¡a ¡large ¡number ¡of ¡ individuals ¡

SLIDE 42

Tag ¡SNP ¡SelecAon ¡

Tag ¡SNPs ¡summarize ¡informa8on ¡across ¡mul8ple ¡SNPs ¡in ¡

linkage ¡disequilibrium ¡

Pa8l ¡et ¡al., ¡Science ¡2001 ¡ ¡

SLIDE 43

Algorithm ¡for ¡Finding ¡Tag ¡SNP ¡

Problem: ¡Find ¡a ¡set ¡of ¡tag ¡SNPs ¡that ¡cover ¡all ¡of ¡the ¡non-‑tag ¡

SNPs ¡in ¡LD ¡(r2>α) ¡with ¡the ¡tag ¡SNPs ¡ ¡

– α ¡ ¡: ¡parameter ¡that ¡needs ¡to ¡be ¡specified ¡by ¡the ¡user ¡(e.g., ¡0.8) ¡

Greedy ¡search ¡(Carlson ¡et ¡al., ¡AJHG ¡2004) ¡

– Repeatedly ¡select ¡the ¡SNP ¡that ¡cover ¡the ¡largest ¡number ¡of ¡non-‑tag ¡ SNPs ¡given ¡α ¡

Dynamic ¡programming ¡approach ¡(Zhang ¡et ¡al., ¡PNAS ¡2002) ¡

SLIDE 44

Using ¡Reference ¡Datasets ¡for ¡Genotype ¡ ImputaAon ¡

Reference ¡data: ¡dense ¡SNP ¡

data ¡from ¡HapMap ¡III ¡

New ¡data: ¡SNP ¡data ¡for ¡

individuals ¡in ¡a ¡given ¡study ¡

Data ¡aner ¡imputa8on ¡

SLIDE 45

Using ¡Reference ¡Datasets ¡for ¡Genotype ¡ ImputaAon ¡

Reference ¡data: ¡sequence ¡

data ¡from ¡1000 ¡genome ¡ project ¡

New ¡data: ¡SNP ¡data ¡for ¡

individuals ¡in ¡a ¡given ¡study ¡

Data ¡aner ¡imputa8on ¡

SLIDE 46

Genotype ¡ImputaAon ¡

PHASE ¡can ¡be ¡used ¡for ¡imputa8on! ¡

SLIDE 47

Common ¡Variants ¡vs. ¡Rare ¡Variants ¡

First-‑genera8on ¡genome-‑wide ¡associa8on ¡study ¡(GWAS): ¡

common ¡variant ¡common ¡disease ¡hypothesis ¡

Common ¡variants ¡with ¡minor ¡allele ¡frequency ¡(MAF)>5% ¡

– dbGap: ¡~11 ¡million ¡SNPs ¡ – HapMap: ¡3.5 ¡million ¡SNPs ¡ – A ¡successful ¡GWAS ¡requires ¡a ¡more ¡complete ¡catalogue ¡of ¡gene8c ¡ varia8ons ¡

Rare ¡variants ¡(MAF<0.5%), ¡low-‑frequency ¡variants ¡(MAF:0.5%~5%) ¡

– Captured ¡by ¡sequencing ¡with ¡next-‑genera8on ¡sequencing ¡technology ¡ – Possibly ¡significant ¡contributors ¡to ¡the ¡gene8c ¡architecture ¡of ¡disease ¡

Causal ¡variants ¡are ¡subject ¡to ¡nega8ve ¡selec8on ¡

SLIDE 48

1000 ¡Genome ¡Project ¡ ¡

(The ¡1000 ¡Genome ¡Project ¡ConsorAum, ¡Nature ¡2010) ¡

¡ ¡ The ¡ goal ¡ is ¡ to ¡ characterize ¡ over ¡ 95% ¡ of ¡ variants ¡ that ¡ are ¡ in ¡ genomic ¡ regions ¡

accessible ¡ to ¡ current ¡ high-‑throughput ¡ sequencing ¡ technologies ¡ and ¡ that ¡ have ¡ allele ¡frequency ¡of ¡1% ¡or ¡higher ¡(the ¡classical ¡defini8on ¡of ¡polymorphism) ¡in ¡each ¡

f ¡five ¡major ¡populaAon ¡groups ¡(popula8ons ¡in ¡or ¡with ¡ancestry ¡from ¡Europe, ¡East ¡

Asia, ¡South ¡Asia, ¡West ¡Africa ¡and ¡the ¡Americas) ¡ ¡Pilot ¡project: ¡ ¡ ¡ ¡-‑ ¡ ¡179 ¡individuals ¡from ¡four ¡popula8ons ¡ ¡ ¡ ¡ ¡(low ¡coverage: ¡2-‑6x) ¡ ¡ ¡ ¡-‑ ¡ ¡6 ¡individuals ¡in ¡two ¡trios ¡ ¡ ¡ ¡ ¡ ¡(deep ¡sequencing: ¡average ¡42x) ¡ ¡ ¡ ¡-‑ ¡ ¡697 ¡individuals ¡from ¡seven ¡popula8ons ¡ ¡ ¡ ¡ ¡(exon ¡sequencing ¡of ¡8,140 ¡exons: ¡average ¡50x) ¡ ¡Main ¡project: ¡sequence ¡2500 ¡genomes ¡at ¡4x ¡coverage ¡ ¡

SLIDE 49

Comparison ¡of ¡Different ¡Experimental ¡Designs ¡

SLIDE 50

Catalogue ¡of ¡GeneAc ¡Variants ¡from ¡1000 ¡ Genome ¡Pilot ¡Project ¡

15 ¡million ¡SNPs ¡
1 ¡million ¡short ¡inser8ons/dele8ons ¡
20,000 ¡structural ¡variants ¡

SLIDE 51

1000 ¡Genome ¡Projects: ¡ ¡ Known ¡vs. ¡Novel ¡Variants ¡

SLIDE 52

PotenAally ¡FuncAonal ¡Variants ¡

SLIDE 53

Novel ¡Variants ¡and ¡Allele ¡Frequency ¡

Frac8on ¡of ¡variants ¡in ¡each ¡

allele ¡frequency ¡class ¡that ¡ were ¡novel, ¡compared ¡to ¡

– dbSNP ¡for ¡SNPs ¡ – dbVar ¡for ¡dele8ons ¡ – Results ¡from ¡other ¡studies ¡

SLIDE 54

Summary ¡

Next ¡genera8on ¡sequencing ¡technology ¡
Gene8cs ¡study ¡designs ¡evolve ¡as ¡the ¡technology ¡evolves ¡
Gene8c ¡polymorphisms: ¡SNPs, ¡structural ¡variants ¡
haplotype ¡inference, ¡linkage ¡disequilibrium ¡

Course ¡Overview ¡

02-­‑715 ¡Advanced ¡Topics ¡in ¡Computa8onal ¡ Genomics ¡

Course ¡Overview ¡

Biology, ¡CMU) ¡

www.cs.cmu.edu/~sssykim/teaching/f11.html ¡

– 9115 ¡GHC ¡(first ¡2 ¡weeks ¡un8l ¡Sept ¡8) ¡ ¡ – 6115 ¡GHC ¡(star8ng ¡from ¡the ¡3rd ¡week) ¡

Grading ¡

– Summary ¡of ¡contribu8ons, ¡cri8que ¡(weaknesses), ¡ques8ons. ¡ – Under ¡400 ¡words ¡for ¡each ¡paper. ¡ – Submit ¡to ¡blackboard/dropbox ¡by ¡midnight ¡the ¡day ¡before ¡the ¡class. ¡

– One-­‑page ¡project ¡proposal: ¡due ¡Nov ¡1st ¡in ¡class. ¡ – Poster ¡session: ¡the ¡last ¡day ¡of ¡the ¡course. ¡ – Final ¡project ¡report: ¡due ¡Dec ¡14th. ¡

Overview ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

Decline ¡in ¡Sequencing ¡Costs ¡

Improvement ¡in ¡Sequencing ¡Technologies ¡

DNA ¡sequencing ¡– ¡vectors ¡

+ =

Method ¡to ¡sequence ¡longer ¡regions ¡

cut many times at random (Shotgun) genomic segment

Get two reads from each segment

̃500 bp ̃500 bp

Reconstruc8ng ¡the ¡Sequence ¡ ¡ (Fragment ¡Assembly) ¡

Cover region with ̃7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region

DefiniAon ¡of ¡Coverage ¡

Depth ¡of ¡Coverage ¡and ¡Physical ¡Coverage ¡

Next ¡GeneraAon ¡Sequencing ¡(NGS) ¡based ¡ methods ¡

sequence ¡content ¡

– Rare ¡transcripts ¡discovery ¡ – Alterna8ve ¡splicing ¡event ¡detec8on ¡ – Transcript ¡sequence ¡varia8on ¡detec8on ¡

Next ¡GeneraAon ¡Sequencing ¡(NGS) ¡based ¡ methods ¡

immunoprecipitated ¡DNA-­‑protein ¡complexes ¡

Overview ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

T ¡ T ¡ T ¡ T ¡ A ¡ A ¡ A ¡ T ¡ A ¡ C ¡

Why ¡GeneAc ¡VariaAons? ¡

GeneAc ¡VariaAons ¡

– Single ¡nucleo8de ¡polymorphisms ¡(SNPs) ¡

– Structural ¡variants: ¡inser8ons/dele8ons, ¡duplica8ons, ¡copy ¡number ¡ varia8ons ¡

Other ¡GeneAc ¡VariaAons ¡

Variant ¡Frequencies ¡from ¡1000 ¡Genome ¡Pilot ¡ Project ¡

Terminology ¡

gene8c ¡locus ¡

DetecAng ¡Genome ¡AlteraAons ¡

Working ¡with ¡SNP ¡Data ¡in ¡PracAce ¡

– A/T/C/G ¡leders ¡are ¡converted ¡to ¡0 ¡or ¡1 ¡for ¡minor/major ¡alleles ¡ – Genotypes ¡at ¡each ¡locus ¡of ¡each ¡individual ¡are ¡coded ¡as ¡

alleles)/(total ¡number ¡of ¡individuals) ¡

Sequencing ¡vs. ¡SNP ¡Genotyping ¡

genotyping ¡a ¡small ¡number ¡of ¡gene8c ¡loci ¡for ¡SNPs ¡

A ¡LiYle ¡Bit ¡of ¡History ¡

Map ¡of ¡1.42 ¡million ¡SNPs ¡that ¡contained ¡all ¡SNPs ¡iden8fied ¡so ¡far ¡

A ¡LiYle ¡Bit ¡of ¡History ¡

Linkage ¡Disequilibrium ¡in ¡HapMap ¡Data ¡

Overview ¡

to ¡1000 ¡Genome ¡Project ¡ ¡

– Haplotype ¡inference, ¡recombina8on ¡rate ¡es8ma8on, ¡linkage ¡ disequilibrium, ¡tag ¡SNPs ¡

Haplotype ¡and ¡Genotype ¡

GATCTTCGTACTGAGT GATCTTCGTACTGAGT GATTTTCGTACGGAAT GATTTTCGTACTGAGT GATCTTCGTACTGAAT GATTTTCGTACGGAAT GATTTTCGTACGGAAT GATCTTCGTACTGAAT CTG 3/8 TGA 3/8 CTA 2/8 Haplotype

chromosome

disease X healthy healthy

Why ¡Haplotypes? ¡

Haplotype ¡Analyses ¡

– Linkage ¡disequilibrium ¡assessment ¡ – Disease-­‑gene ¡discovery ¡ – Gene8c ¡demography ¡ – Chromosomal ¡evolu8on ¡studies ¡

Linkage ¡Disequilibrium ¡(LD) ¡

– Alleles ¡at ¡locus ¡A: ¡frequencies ¡p1,…, ¡pm – Alleles ¡at ¡locus ¡B: ¡frequencies ¡q1,…,qn – Haplotype ¡frequency ¡for ¡AiBj: ¡

– Linkage ¡disequilibrium ¡is ¡an ¡allelic ¡associa8on ¡measure ¡(difference ¡ between ¡the ¡actual ¡haplotype ¡frequency ¡and ¡the ¡equilibrium ¡value) ¡ – More ¡precisely: ¡gameAc ¡associaAon ¡

Haplotype ¡Inference ¡

SNPs ¡

– Frequency ¡es8ma8on ¡of ¡all ¡possible ¡haplotypes ¡ – Haplotype ¡reconstruc8on ¡for ¡individuals ¡ – How ¡many ¡out ¡of ¡all ¡possible ¡haplotypes ¡are ¡plausible ¡in ¡a ¡popula8on ¡

– Clark’s ¡parsimony ¡algorithm ¡(Clark, ¡Mol. ¡Biol. ¡Evol. ¡1990) ¡ – Haplotyper ¡(Niu ¡et ¡al., ¡AJHG ¡2002) ¡ – PHASE ¡(Li ¡and ¡Stephens, ¡Gene8cs ¡2003) ¡

PHASE ¡

– Recover ¡haplotypes ¡ – Es8mate ¡recombina8on ¡rate, ¡recombina8on ¡hotspots ¡ – Impute ¡missing ¡genotypes ¡

PHASE ¡

and ¡es8mate ¡the ¡condi8onal ¡probability ¡of ¡haplotypes ¡given ¡

– Start ¡with ¡ini8al ¡guesses ¡on ¡haplotypes ¡ – Itera8vely ¡reconstruct ¡the ¡haplotype ¡of ¡each ¡individual ¡assuming ¡the ¡ haplotypes ¡of ¡other ¡individuals ¡have ¡been ¡correctly ¡reconstructed. ¡

PHASE ¡

haplotypes ¡excluding ¡individual ¡i. ¡ ¡3. ¡Set ¡Hj

Phase ¡

PHASE ¡

– Xl ¡: ¡random ¡variables ¡for ¡ancestral ¡haplotype ¡labels ¡inherited ¡at ¡locus ¡l

Haplotype ¡Structure ¡and ¡RecombinaAon ¡Rate ¡ EsAmates: ¡HapMap ¡I ¡vs. ¡HapMap ¡II ¡

Reducing ¡Genotyping ¡Costs ¡with ¡Tag ¡SNPs ¡

02-‑715 ¡Advanced ¡Topics ¡in ¡Computa8onal ¡ Genomics ¡

– One-‑page ¡project ¡proposal: ¡due ¡Nov ¡1st ¡in ¡class. ¡ – Poster ¡session: ¡the ¡last ¡day ¡of ¡the ¡course. ¡ – Final ¡project ¡report: ¡due ¡Dec ¡14th. ¡

Get two reads from   each segment

immunoprecipitated ¡DNA-‑protein ¡complexes ¡

– Linkage ¡disequilibrium ¡assessment ¡ – Disease-‑gene ¡discovery ¡ – Gene8c ¡demography ¡ – Chromosomal ¡evolu8on ¡studies ¡

Two-‑phase ¡Genotyping ¡Using ¡Tag ¡SNPs ¡

– Repeatedly ¡select ¡the ¡SNP ¡that ¡cover ¡the ¡largest ¡number ¡of ¡non-‑tag ¡ SNPs ¡given ¡α ¡