Baysian Haplotype Inference via the Dirichlet Process Eric Xing, - - PowerPoint PPT Presentation

baysian haplotype inference via the dirichlet process
SMART_READER_LITE
LIVE PREVIEW

Baysian Haplotype Inference via the Dirichlet Process Eric Xing, - - PowerPoint PPT Presentation

Baysian Haplotype Inference via the Dirichlet Process Eric Xing, Micheal Jordan, Roded Sharan presented by Amrudin Agovic Motivation 99.9 % of human DNA shared 0.1% of DNA makes up for differences Need to determine what those


slide-1
SLIDE 1

Baysian Haplotype Inference via the Dirichlet Process Eric Xing, Micheal Jordan, Roded Sharan presented by Amrudin Agovic

slide-2
SLIDE 2

Motivation

 99.9 % of human DNA

shared

 0.1% of DNA makes up for

differences

 Need to determine what those

0.1% are

 Find genes responsible for

diseases

slide-3
SLIDE 3

Background

 Humans have 23 pairs of

chromosomes in their cells

 23 come from the father, 23

from the mother

 Certain parts of the genome

are inherited unchanged

 Other genetic information

gets mixed up

slide-4
SLIDE 4

Background

 Allele: genetic coding that occupies a position on the

chromosome.

 Genotype: unordered pairs of Alleles in a region (one from each

chromosome)

 Phase: Allele Chromosome association (not given)  SNP: Single Nucleotide Polymorphism, difference in one

nucleotide (A,C,G,T)

 Haplotype: set of associated SNP alleles in a region of a

  • chromosome. A haplotype is inherited as a unit.
slide-5
SLIDE 5

Background

slide-6
SLIDE 6
slide-7
SLIDE 7

Dirichlet Process Representation

Let

 G0(Ф) be a base measure for the dirichlet process  A(k) :=[A1

(k),..,AJ (k)] be a founding haplotype configuration

(ancestral template) at loci t=[1,..,J]

 θ(k) be the mutation rate of the ancestor  Ф be the parameter associated with a mixture component.

Where Фk = {A(k), θ(k)}

slide-8
SLIDE 8

Dirichlet Process Representation

 Use Chinese Restaurant Process  Associate population haplotype with table  Sample for each table Фk = {A(k), θ(k)}

slide-9
SLIDE 9

The Model

slide-10
SLIDE 10

Assumptions

 G0(A,θ)=p(A)p(θ)  p(A) uniform distribution over all haplotypes  p(θ) is Beta(αh,βh)

slide-11
SLIDE 11

Distributions

Considering for all alleles mutations: Integrating out theta:

slide-12
SLIDE 12

Noisy Observation Model

 Observed Genotype at a locus determined by parental and

maternal alleles

 If genotype disagrees penalize  γhas Beta prior

slide-13
SLIDE 13

Pedigree-Haplotyper

slide-14
SLIDE 14

Inference - Gibbs Sampling

 γ and θ integrated out  Sample Cit , Aj

(k), Hit,j

1) Given current hidden values of haplotypes sample cit , aj

(k)

slide-15
SLIDE 15

Gibbs Sampling

2) Given ancestral assignment and ancestral pool sample haplotype

slide-16
SLIDE 16

Metropolis Hastings

 Long list of loci and uniform prior p(a), leaves probability of

sampling new ancestor very small.

 Slow mixing  Sample ancestor assignment using proposal distribution

slide-17
SLIDE 17

Metropolis Hastings

 In acceptance probability, the proposal factor

cancels out

slide-18
SLIDE 18

Experiments

 Simulated Data: Haplotypes randomly paired to

form genotypes.

 Performance compared to PHASE

slide-19
SLIDE 19

Experiments

 Two real data sets: 129 individuals, 90 individuals

from 4 populations Dataset 1:

slide-20
SLIDE 20

Experiments

Dataset 2:

 Small sample size, tougher data set  Haplotyper outperforms PHASE

slide-21
SLIDE 21

Conclusions

 Algorithm outperform PHASE on two data sets

With a big margin on one of them.

 Strength of proposed approach in flexibility  Can be extended to incorporate aspects of

evolutionary dynamics and other things

 Illustrated example: Pedigree information