A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal - PowerPoint PPT Presentation

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal Vinterbo 1 Stephan Dreiseitl 2 1 Decision Systems Groups Harvard Medical School, Boston, USA 2 Dept. of Software Engineering Upper Austria University of Applied Sciences Hagenberg, Austria EMSS 2006, Barcelona, Oct. 5 th 2006

Overview • Biomedical background • Haplotype tagging problem • Approximation algorithm • Solution size investigation

Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens )

Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs

Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs Genes are protein-coding regions, comprise about 3–5% of human genome ( ∼ 25 000 genes)

Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs Genes are protein-coding regions, comprise about 3–5% of human genome ( ∼ 25 000 genes) Compare with oryza sativa : ∼ 40 000–60 000 genes

Genomic biology primer DNA of individual humans is 99.9% identical

Genomic biology primer DNA of individual humans is 99.9% identical Sequence variations:

Genomic biology primer Single nucleotide polymorphism (SNP): Small genetic variation in a person’s DNA sequence T A A C T A A C T A A A T A A C ↑ SNP

Genomic biology primer Single nucleotide polymorphism (SNP): Small genetic variation in a person’s DNA sequence T A A C T A A C T A A A T A A C ↑ SNP Consider only variations occuring in > 1% of population About 10 million SNPs currently known

Effects of sequence variations

Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium )

Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A

Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A Haplotype-tagging SNPs (htSNPs): Those SNPs that allow the unique identification of a haplotype

Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A ↑ ↑ Haplotype-tagging SNPs (htSNPs): Those SNPs that allow the unique identification of a haplotype

Medical significance Diseases caused by variations in single genes are rare

Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences

Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences Association studies: Compare haplotype of diseased with that of healthy individuals

Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences Association studies: Compare haplotype of diseased with that of healthy individuals Advantages of htSNPs: Only need to test few locations to identify complete haplotype

Medical significance

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m }

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

Problem analysis Minimum haplotype tagging problem is NP-hard via reduction to minimum set cover (MSC) problem MSC problem: Given X = { x 1 , . . . , x n } and set F of subsets of X , find minimum cardinality C ⊆ F s.t. � c ∈ C c = X

Problem analysis Minimum haplotype tagging problem is NP-hard via reduction to minimum set cover (MSC) problem MSC problem: Given X = { x 1 , . . . , x n } and set F of subsets of X , find minimum cardinality C ⊆ F s.t. � c ∈ C c = X Simple greedy approximation algorithm

Greedy approximation algorithm For data matrix H , let d be discernability function � � d ( i ) := ( j , k ) | H ji � = H ki , j < k Example:   T A A C T A A C C T A A C A A T   H =   T T A A A C A A   A T A C A C A A

Greedy approximation algorithm For data matrix H , let d be discernability function � � d ( i ) := ( j , k ) | H ji � = H ki , j < k Example:   T A A C T A A C C T A A C A A T   H =   T T A A A C A A   A T A C A C A A d (1) = { (1 , 2) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } d (2) = { (1 , 2) , (1 , 3) , (1 , 4) } d (3) = {}

Greedy approximation algorithm S ← ∅ cols ← { 1 , . . . , m } D ← d (cols) while D � = ∅ select c ∈ cols that maximizes | D ∩ d ( c ) | D ← D \ d ( c ) S ← S ∪ { c } return S

Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A

Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4)

Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D ← { (1 , 2) , (1 , 3) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } ; S ← ∅

Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D ← { (1 , 2) , (1 , 3) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } ; S ← ∅ Algorithm step 1: Pick c = 1, 5 or 8 (say 1) D ← D \ d (1) = { 1 , 3 } S ← S ∪ { 1 } = { 1 }

Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D = { (1 , 3) } ; S = { 1 }

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal - PowerPoint PPT Presentation

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal Vinterbo 1 Stephan Dreiseitl 2 1 Decision Systems Groups Harvard Medical School, Boston, USA 2 Dept. of Software Engineering Upper Austria University of Applied Sciences

Read-based phasing for dense and accurate haplotyping of individual genomes Outline 1. Haplotype

What are polymorphisms? Genetic differences between individuals in a population.

Note: All prices, figures, sizes, specifications and information are subject to change without

NCBI2R - To navigate and annotate genes and SNPs. The Problem Genome Wide Analysis provides

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

PREDICTING SNPS AND HAPLOTYPES FROM PUBLIC EST DATA Jifeng Tang & Jack Leunissen Background

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Lecture Overview Web 2.0, Tagging, Multimedia, Introduction to Web 2.0 Overview of

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Towards Gapless, Chromosome Scale, Haplotype Assemblies Matt Settles, PhD UC Davis

Feature-Based Tagging The Task, Again Recall: tagging ~ morphological disambiguation

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

1 2 Note approXimate! FXAA does not attempt anything close to the correct solution. Rather

Web-based Y-STR database for haplotype frequency estimation and kinship index calculation I S

Course Overview 02-715 Advanced Topics in Computa8onal

1 Consider two loci and 1 generation of random mating: Random association in gametes Alleles at

ROBOT CONTROL USING LIVING ROBOT CONTROL USING LIVING NEURONAL CELLS: PROGRESS AND CHALLENGES Vi

Selecting Features Note! First Work on core mechanics (movement, shooting, etc.)

The potential for CO measurements to estimate North American fossil fuel CO

Classicality, Complexity, Observership, and Fault Tolerance Charles H. Bennett (joint work with

10/29/2020 1 A False Dichotomy: Health or Education 2 Key Challenges 1. Relational 2.

p p p Locus B Totals AB A B B b p p p p ( 1 p ) Locus A A

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal - PowerPoint PPT Presentation

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal Vinterbo 1 Stephan Dreiseitl 2 1 Decision Systems Groups Harvard Medical School, Boston, USA 2 Dept. of Software Engineering Upper Austria University of Applied Sciences

Read-based phasing for dense and accurate haplotyping of individual genomes Outline 1. Haplotype

What are polymorphisms? Genetic differences between individuals in a population.

Note: All prices, figures, sizes, specifications and information are subject to change without

NCBI2R - To navigate and annotate genes and SNPs. The Problem Genome Wide Analysis provides

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

PREDICTING SNPS AND HAPLOTYPES FROM PUBLIC EST DATA Jifeng Tang &amp; Jack Leunissen Background

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Lecture Overview Web 2.0, Tagging, Multimedia, Introduction to Web 2.0 Overview of

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning &amp; H.

Towards Gapless, Chromosome Scale, Haplotype Assemblies Matt Settles, PhD UC Davis

Feature-Based Tagging The Task, Again Recall: tagging ~ morphological disambiguation

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

1 2 Note approXimate! FXAA does not attempt anything close to the correct solution. Rather

Web-based Y-STR database for haplotype frequency estimation and kinship index calculation I S

Course Overview 02-715 Advanced Topics in Computa8onal

1 Consider two loci and 1 generation of random mating: Random association in gametes Alleles at

ROBOT CONTROL USING LIVING ROBOT CONTROL USING LIVING NEURONAL CELLS: PROGRESS AND CHALLENGES Vi

Selecting Features Note! First Work on core mechanics (movement, shooting, etc.)

The potential for CO measurements to estimate North American fossil fuel CO

Classicality, Complexity, Observership, and Fault Tolerance Charles H. Bennett (joint work with

10/29/2020 1 A False Dichotomy: Health or Education 2 Key Challenges 1. Relational 2.

p p p Locus B Totals AB A B B b p p p p ( 1 p ) Locus A A

PREDICTING SNPS AND HAPLOTYPES FROM PUBLIC EST DATA Jifeng Tang & Jack Leunissen Background

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.