a note on solution sizes in the haplotype tagging snps
play

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal - PowerPoint PPT Presentation

A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal Vinterbo 1 Stephan Dreiseitl 2 1 Decision Systems Groups Harvard Medical School, Boston, USA 2 Dept. of Software Engineering Upper Austria University of Applied Sciences


  1. A Note on Solution Sizes in the Haplotype Tagging SNPs Problem Staal Vinterbo 1 Stephan Dreiseitl 2 1 Decision Systems Groups Harvard Medical School, Boston, USA 2 Dept. of Software Engineering Upper Austria University of Applied Sciences Hagenberg, Austria EMSS 2006, Barcelona, Oct. 5 th 2006

  2. Overview • Biomedical background • Haplotype tagging problem • Approximation algorithm • Solution size investigation

  3. Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens )

  4. Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs

  5. Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs Genes are protein-coding regions, comprise about 3–5% of human genome ( ∼ 25 000 genes)

  6. Genomic biology primer Genetic information stored in DNA, ∼ 3 billion base pairs ( homo sapiens ) Compare with amoeba dubia : ∼ 670 billion base pairs Genes are protein-coding regions, comprise about 3–5% of human genome ( ∼ 25 000 genes) Compare with oryza sativa : ∼ 40 000–60 000 genes

  7. Genomic biology primer DNA of individual humans is 99.9% identical

  8. Genomic biology primer DNA of individual humans is 99.9% identical Sequence variations:

  9. Genomic biology primer Single nucleotide polymorphism (SNP): Small genetic variation in a person’s DNA sequence T A A C T A A C T A A A T A A C ↑ SNP

  10. Genomic biology primer Single nucleotide polymorphism (SNP): Small genetic variation in a person’s DNA sequence T A A C T A A C T A A A T A A C ↑ SNP Consider only variations occuring in > 1% of population About 10 million SNPs currently known

  11. Effects of sequence variations

  12. Effects of sequence variations

  13. Effects of sequence variations

  14. Effects of sequence variations

  15. Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium )

  16. Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A

  17. Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A Haplotype-tagging SNPs (htSNPs): Those SNPs that allow the unique identification of a haplotype

  18. Genomic biology primer SNPs in close proximity are not independent ( linkage disequilibrium ) Haplotype: Set of associated SNPs in a region of a chromosome T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A ↑ ↑ Haplotype-tagging SNPs (htSNPs): Those SNPs that allow the unique identification of a haplotype

  19. Medical significance Diseases caused by variations in single genes are rare

  20. Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences

  21. Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences Association studies: Compare haplotype of diseased with that of healthy individuals

  22. Medical significance Diseases caused by variations in single genes are rare Common disorders are caused by combined effect of different variants and environmental influences Association studies: Compare haplotype of diseased with that of healthy individuals Advantages of htSNPs: Only need to test few locations to identify complete haplotype

  23. Medical significance

  24. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A

  25. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m }

  26. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

  27. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

  28. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

  29. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

  30. Problem statement Given: n haplotypes h 1 , . . . , h n with m SNPs each, e.g. h 1 = T A A C T A A C h 2 = C T A A C A A T h 3 = T T A A A C A A h 4 = A T A C A C A A ↑ ↑ Let h i [ S ] denote values of h i at positions in index set S ⊆ { 1 , . . . , m } Find: Minimum cardinality set S s.t. ∀ i , j h i [ S ] = h j [ S ] ⇒ i = j

  31. Problem analysis Minimum haplotype tagging problem is NP-hard via reduction to minimum set cover (MSC) problem MSC problem: Given X = { x 1 , . . . , x n } and set F of subsets of X , find minimum cardinality C ⊆ F s.t. � c ∈ C c = X

  32. Problem analysis Minimum haplotype tagging problem is NP-hard via reduction to minimum set cover (MSC) problem MSC problem: Given X = { x 1 , . . . , x n } and set F of subsets of X , find minimum cardinality C ⊆ F s.t. � c ∈ C c = X Simple greedy approximation algorithm

  33. Greedy approximation algorithm For data matrix H , let d be discernability function � � d ( i ) := ( j , k ) | H ji � = H ki , j < k Example:   T A A C T A A C C T A A C A A T   H =   T T A A A C A A   A T A C A C A A

  34. Greedy approximation algorithm For data matrix H , let d be discernability function � � d ( i ) := ( j , k ) | H ji � = H ki , j < k Example:   T A A C T A A C C T A A C A A T   H =   T T A A A C A A   A T A C A C A A d (1) = { (1 , 2) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } d (2) = { (1 , 2) , (1 , 3) , (1 , 4) } d (3) = {}

  35. Greedy approximation algorithm S ← ∅ cols ← { 1 , . . . , m } D ← d (cols) while D � = ∅ select c ∈ cols that maximizes | D ∩ d ( c ) | D ← D \ d ( c ) S ← S ∪ { c } return S

  36. Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A

  37. Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4)

  38. Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D ← { (1 , 2) , (1 , 3) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } ; S ← ∅

  39. Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D ← { (1 , 2) , (1 , 3) , (1 , 4) , (2 , 3) , (2 , 4) , (3 , 4) } ; S ← ∅ Algorithm step 1: Pick c = 1, 5 or 8 (say 1) D ← D \ d (1) = { 1 , 3 } S ← S ∪ { 1 } = { 1 }

  40. Algorithm example T A A C T A A C C T A A C A A T T T A A A C A A A T A C A C A A (1 , 2) (1 , 2) (1 , 2) (1 , 2) (1 , 3) (1 , 2) (1 , 4) (1 , 3) (1 , 3) (1 , 3) (1 , 4) (1 , 3) (2 , 3) (1 , 4) (2 , 4) (1 , 4) (2 , 3) (1 , 4) (2 , 4) (3 , 4) (2 , 3) (2 , 4) (2 , 3) (3 , 4) (2 , 4) (2 , 4) D = { (1 , 3) } ; S = { 1 }

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend