Whole Genome Comparison: Project Presentations Felix Heeger, Max - - PowerPoint PPT Presentation

whole genome comparison project presentations
SMART_READER_LITE
LIVE PREVIEW

Whole Genome Comparison: Project Presentations Felix Heeger, Max - - PowerPoint PPT Presentation

Whole Genome Comparison: Project Presentations Felix Heeger, Max Homilius, Ivan Kel, Sabrina Krakau, Svenja Specovius, John Wiedenhoeft July 19, 2010 Whole Genome Comparison: Project Presentations F. Heeger, M. Homilius, I. Kel, S. Krakau, S.


slide-1
SLIDE 1

Whole Genome Comparison: Project Presentations

Felix Heeger, Max Homilius, Ivan Kel, Sabrina Krakau, Svenja Specovius, John Wiedenhoeft July 19, 2010

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-2
SLIDE 2

Outline

1

Evolutionary Events

2

A-Bruijn Alignment Construction of the A-Bruijn graph Simulation study Chromatin Remodeling Complex Carsonella

3

S-LAGAN

4

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-3
SLIDE 3

Evolutionary events

Nucleotide deletion, insertion and point mutation

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-4
SLIDE 4

Collinear alignment

Columns of aligned sequences

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-5
SLIDE 5

More evolutionary events

Genome rearrangements: duplication, reversal and deletion of segments

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-6
SLIDE 6

Multidomain proteins

Diverged by rearrangements of modular units, e.g. domains Multidomain proteins (MDPs) difficult to align collinearly

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-7
SLIDE 7

Multidomain protein toy example A A A A B B B C C C C 1 2 3 4

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-8
SLIDE 8

Collinear alignment

A A A A B B B C C C C 1 2 3 4 It’s not possible to align all similar domains without reordering

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-9
SLIDE 9

Graph representation of alignments

A A B B C C 2 3

Arcs: input sequences Edges: matches Some edges may be inconsistent: mixed cycles

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-10
SLIDE 10

Non-collinear alignment

A A A A B B B C C C C 1 2 3 4

1 2 3 4 C A B

Allow large cycles of similar segments

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-11
SLIDE 11

Construction of the A-Bruijn Graph

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-12
SLIDE 12

Whirls and inconsistencies

A T A A T T C C A A T A A T T C A C

A-graph

A C T A A C T

A-Bruijn graph

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-13
SLIDE 13

Evaluating ABA

  • J. Wiedenhoeft

simulate sequence evolution using PAM (point accepted mutation) two models of sequence evolution

geometric duplication/deletion model rearrangement according to fragility model

true homology can be tracked to provide a gold standard

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-14
SLIDE 14

PAM sequence evolution

amino acid substitution modeled as a Markov process PAM = transition matrix using ABA’s BLAST subroutine with PAM30 provides a null model of character homology

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-15
SLIDE 15

Geometric duplication/deletion model

pick position by uniform distribution determine deletion or duplication by binomial distribution determine direction by binomial distribution determine length by geometric distribution

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-16
SLIDE 16

Fragility model

models only translocations successful translocation increases the chance of a segment being translocated again ⇒ models conservation of substructures boundaries weighted by length of substructure borders of substructures are preferred as insertion spots ⇒ prevents disruption of other substructures

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-17
SLIDE 17

Score

true negatives are vast due to the low number of paralogs and the alignment bias (BLAST) hence precision and accuracy are not suitable measures FP + FN FP + FN + TP

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-18
SLIDE 18

Results

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-19
SLIDE 19

Analysing Multidomain Proteins with ABA

  • M. Homilius

Noncolinear alignment applied on multidomain proteins (MDPs). Histone Deacetylation / Chromatin Remodeling Complexes.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-20
SLIDE 20

HATs / CRCs

Regulation of gene expression.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-21
SLIDE 21

Dataset

262 proteins found in literature and manually annotated. Thanks to Sebastian, Ivan and Christoph! From S. cervisiae, S. pombe, D. melanogaster and H. Sapiens

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-22
SLIDE 22

Questions

Can ABA recognize domain-like structures? Do domains move around in the complexes? What structures occur often?

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-23
SLIDE 23

Output of ABA

559 209 25 538 539 2 524 138 53 554 507 115 565 516 2 523 80 13 600 532 4 552 345 183 537 13 65 460 461 3 521 279 5 586 370 2 598 588 113 560 393 83 526 490 2 566 567 127 522 353 35 580 152 189 500 432 11 599 533 3 525 497 2 587 162 595 324 4 440 423 6 437 438 2 582 545 13 534 528 15 601 520 2 604 536 10 430 431 25 602 603 146 556 335 23 551 481 2 529 485 5 477 436 3 550 139 584 585 405 504 505 3 557 236 60 578 579 170 468 113 202 492 493 182 441 280 52 475 349 198 480 2 535 7 434 297 168 474 396 84 444 445 39 541 542 28 508 509 2 422 9 513 514 9 515 2 592 7 564 471 28 563 83 476 149 92 442 443 115 540 47 558 421 257 429 187 6 527 66 530 125 137 459 281 58 449 2 573 242 134 571 572 80 519 2 549 239 71 593 455 384 489 2 435 5 469 470 244 548 11 14 506 239 570 3 576 2 484 65 590 339 91 531 30 581 464 36 596 463 142 501 59 20 450 451 34 544 6 496 3 594 2 502 503 334 543 31 420 163 1 516 494 495 37 583 487 2 479 136 82 553 473 45 562 499 37 577 347 56 486 2 597 19 37 518 189 147 561 425 124 555 219 192 569 9 472 3 439 162 86 568 112 11 591 268 87 498 65 478 38 574 575 393 398 90 428 102 99 179 376 72 366 383 264 351 44 55 410 190 233 512 182 57 400 299 313 316 231 379 414 415 343 342 263 511 78 48 360 419 214 378 304 50 210 199 406 417 81 302 312 56 382 321 409 101 45 405 54 371 289 198 322 401 315 375 320 416 186 144 300 82 147 273 303 381 361 172 392 215 145 140 404 412 53 234 212 344 352 399 96 301 159 369 183 213 146 51 160 47 367 98 357 411 43 372 2 8(3) 3 103 31 56(2) 4 3(9) 32 4(27) 5 13(12) 48 33 6 97(2) 7 93(4) 418 36(2) 251 51(2) 5(4) 8 4(9) 305 8 306 52 252 6(6) 172 9 27(8) 195 199 45 10 76(3) 192 206 413 158(2) 79(8) 193 2 194 2(6) 145 190 12 57(9) 26(11) 14 4(14) 67 15 12(13) 407 2 16 4(12) 408 41(3) 2(2) 17 7(10) 18 11(11) 32(8) 402 24(3) 20 5(15) 17 403 20(2) 21 7(14) 102 22 6(16) 126 20(6) 23 5(19) 205 104 283 29(2) 286 30 333 25 24 46(8) 397 168(2) 384 24(3) 284 3(4) 287 13(2) 334 52(4) 25 8(16) 30 27 385 11(4) 26 9(8) 395 53(4) 29 35(2) 394 10(2) 44(5) 27 38(4) 195 67(2) 26(2) 98(8) 10(9) 28 10(3) 2 44 2(13) 4(12) 30 12(4) 32(3) 34 97 35 26(4) 33 3(26) 188 50 87 228 66(4) 38 77(3) 38 59(11) 58 77 380 108(2) 317 42(2) 19(3) 39 2(12) 3(7) 7 6 318 50(3) 36 4(6) 4(2) 37 5(4) 2(2) 4(2) 11 14 40 9(10) 3 41 3(3) 46 2(2) 49 2(2) 52 2(2) 8 42 3(2) 2 2 2 2 8 5 4 3 60 4(9) 61 25 130 59(2) 373 49(2) 374 253(2) 377 86(2) 62 4(6) 131 27(3) 148 19(2) 4(6) 5 5 10 3 70 63 2(5) 64 22 65 7(4) 5(4) 44 66 23(3) 241 110 109(3) 45 128 67 87 364 115 108 368 66 308 63 65 336 277(2) 25(5) 68 3(7) 365 169(2) 309 30(2) 16(2) 10(2) 337 4(7) 69 9(4) 363 28(3) 70 4(7) 33(2) 180 65 71 5(4) 362 19(3) 87 73 5(3) 290 78(2) 291 42 74 13(5) 75 3(4) 217 82 76 2(3) 83 144 218 22(3) 77 39(4) 84 2(10) 17 97 79 12(2) 11 193 28(2) 85 3(12) 86 2(13) 87 2(10) 354 3(3) 88 4(12) 355 4(4) 89 4(15) 167 91 15(14) 92 3(13) 358 2 5 93 4(10) 359 85(2) 164 43(2) 94 6(5) 103 5(5) 57 107 95 26(3) 100 41(2) 104 3(4) 107 2 31 97 19(2) 5 7 2 2 22 105 9(3) 108 6(3) 106 2 323 9(2) 6(2) 3 22 109 29(5) 110 21(3) 115 98(2) 78 111 1054(2) 116 11(4) 16(2) 11(7) 114 2(13) 91(2) 61(2) 129 70 267 13(8) 5(3) 22 121 5(6) 105 117 2(6) 80(2) 56 118 33(3) 237 7(2) 120 33 119 14(2) 121 5(3) 122 129 122 7(17) 132 109 185 136(2) 191 154(2) 221(2) 147 142 123 72(2) 232 60(2) 235 205 124 13(5) 32 11 26(2) 17(3) 185 206(2) 437 74 127 3(10) 27(6) 128 23(4) 180 50(2) 99 46 132 4(2) 111 163 31 133 31(4) 134 7 135 14(3) 8(5) 20 8(2) 137 45(3) 22 141 61(2) 139 68(2) 142 270 198 83 29 143 64(2) 5 4 150 11(2) 41(2) 151 2 5 18(3) 2(2) 153 30(4) 154 27(8) 52(4) 155 34 161 39(3) 156 3(2) 4 3(2) 157 31(4) 29(2) 158 26(2) 96 18 165 23(4) 184 102 166 29(3) 167 40(3) 32(2) 20 168 45(4) 8 169 11(3) 170 18 173 156(2) 171 19(3) 174 6(6) 245 253(2) 55 175 49(5) 176 11(3) 181 48(2) 177 3(4) 174 108 88 178 89(3) 288 59(2) 5(3) 42(2) 188 191 44(3) 65 160 216 295 14(2) 10(4) 20 7(3) 75(2) 196 60(4) 197 5(2) 200 8(2) 6 5 201 9(5) 202 2(4) 203 2 4(3) 204 2 11(4) 205 32(5) 37 206 5(4) 207 20(2) 211 7(2) 20 208 32 7 6 2(2) 56(3) 5(5) 220 3(6) 42 221 2(5) 42(3) 222 38(2) 8 223 90 224 18(2) 225 21(6) 42 226 6(7) 8(2) 227 2(5) 38 228 19(4) 229 8(3) 230 17 48(4) 14(4) 3(5) 238 4(6) 18 265 49(2) 266 42(3) 240 73(2) 307 310 6 62(2) 426 77 243 8(2) 244 8(7) 245 2(6) 246 2 2(5) 247 2 2(6) 248 5(7) 217(2) 249 3(5) 250 19(6) 11(4) 262 30(2) 4 16 253 4(5) 254 2 3(7) 255 5(8) 256 2(4) 260 6(4) 257 5(9) 258 2(2) 2 261 2(3) 2 2(7) 259 2 2 2(8) 5(9) 2(6) 269 6(7) 59(2) 270 28(8) 271 16(3) 278 4(5) 274 97 272 29(2) 5 124(4) 275 20(3) 304 62 276 38(4) 18(2) 277 53(2) 63 173 36(5) 8(6) 282 10(7) 364(2) 296 38(5) 75 86(4) 115(2) 285 56(2) 6(4) 39 288 29 17(2) 9 44 292 9(5) 293 3(4) 294 2 5(3) 295 2 5(4) 3(5) 222 298 20(4) 22 2 3 25 307 6(3) 60 11 17 310 11(3) 314 7(2) 146 311 293(2) 8 11 2 2 18 319 29(2) 150 79 325 130(4) 122 109 119 326 69 60 327 3(4) 328 2(2) 356 3(2) 329 2(3) 3 2 330 23(4) 331 5(6) 7 332 17(5) 3(3) 227(2) 18(3) 219(2) 350 5(2) 2(5) 58 351 338 32(3) 61 243 32(2) 83(2) 46 346 60(2) 348 126(2) 340 51(3) 141 341 488(2) 3 3 145 154 107 105 12(14) 4 2(3) 29 19 86 10 386 70(2) 387 23(4) 77 388 2(3) 14 389 80(2) 12(2) 390 8(3) 51 391 15(2) 26 53 2(3) 2 2 39 17(2) 49(2) 424 6(2) 3(4) 426 14(5) 10(3) 427 33(2) 83 16 4(6) 433 13(11) 130 134 139 128(2) 132(2) 124 172 136 140 37(2) 26(2) 446 45(2) 447 9 448 3 3(3) 72(4) 452 42(3) 453 2 458 9(2) 454 3(2) 4 2 78(3) 456 2 457 2(3) 20(2) 6(2) 2 462 110(2) 56 47 37(2) 465 22(3) 466 25(2) 467 55 303 143 98(2) 33(2) 10(3) 30(4) 17(2) 482 11(2) 483 5(3) 15(4) 41(2) 488 10(2) 6 2 491 30(2) 30(4) 5(3) 152(2) 2(2) 21(2) 83(2) 510 130(2) 10 6 2(5) 517 93(2) 3 2 3(2) 7(2) 19(2) 38(3) 4(2) 4(2) 199 56 546 43(3) 265 547 6(2) 95 60 589 132(2) 28 114

Applied to only 2 species. Rendering takes a long time. Hard to interprete (manually).

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-24
SLIDE 24

Parsing output of ABA

Applied to 4 species. Reconstructed A-Bruijn Graph from ABA-Output.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-25
SLIDE 25

Distribution of edge multiplicity

High-weight edges point out to conserved and repeated elements. Within and across proteins. (Girth parameter did not seem to work.)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-26
SLIDE 26

Distribution of edge multiplicity (filtered)

Filtered distribution of the multiplicity of edges (length > 40).

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-27
SLIDE 27

Comparison with PFAM-Annotation

Hidden markov models learned from multiple sequence alignments.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-28
SLIDE 28

Comparison with PFAM-Annotation

Annotated all proteins with PFAM/HMMER. Detected 561 domains (not unique).

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-29
SLIDE 29

Distribution of edges with domains

≈ 210 edges of multiplicity 1. ≈ 150 edges of multiplicity 16.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-30
SLIDE 30

Repeated domains

Domains seem to share edges in ABA-graph.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-31
SLIDE 31

Repeated domains

Domain Average Multiplicity DUF1679 21.0 Elf1 21.0 DUF1825 21.0 Fib alpha 21.0 ZZ 17.7 Otopetrin 17.0 CDK5 activator 17.0 . . . . . . RFX DNA binding 1.0 zfC5HC2 1.0 DUF1542 1.0 Rep N 1.0 DUF3619 1.0 TIP49 1.0 HTH Mga 1.0

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-32
SLIDE 32

Whats next?

Do ABA-edges correlate with found domains? Apply real null model. Significance tests. Can ABA be used to complement the domains found with HMMER?

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-33
SLIDE 33

Non-Collinear Alignment: Reannotation of genomes.

Carosonella ruddii: an interesting thing

unclassified γ-proteobacteria. (Like e.g. E.Coli) Sequenced 2006.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-34
SLIDE 34

Carosonella ruddii

what is it?

Smallest bacterial genome known.→ 160 Mb (!). E.Coli has 4,5 Gb

Smallest genome before Carsonella

362 protein-coding genes in Buchnera aphidicola BCc

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-35
SLIDE 35

Carosonella ruddii

what is it?

CG-Content: Very low (16%). E.coli: (50%)

GC-Content

GC Content is defined as: GC-content (or guanine-cytosine content), in molecular biology, is the percentage of specific bases on a DNA molecule which are either guanine or cytosine.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-36
SLIDE 36

Carosonella ruddii

what is it?

CG-Content: Very low (16%). E.coli: (50%) First annotation: 213 genes. E.coli: 4400 genes

Minimal set of genes for life

: Moya A. et al. proposed 2003 that the minimal gene set for a endosymbiotic life is close to 313.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-37
SLIDE 37

Interesting question

DNA replication and repair system is strongly degraded.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-38
SLIDE 38

Interesting question

DNA replication and repair system is strongly degraded. Transcriptioin machinery is reduced to core subunits of RNA Polymerase (no promotor-recognition)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-39
SLIDE 39

Interesting question

DNA replication and repair system is strongly degraded. Transcriptioin machinery is reduced to core subunits of RNA Polymerase (no promotor-recognition) Translation machinery is highly reduced. (three essential rRNAs are present)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-40
SLIDE 40

Interesting question

DNA replication and repair system is strongly degraded. Transcriptioin machinery is reduced to core subunits of RNA Polymerase (no promotor-recognition) Translation machinery is highly reduced. (three essential rRNAs are present) No Shine-Dalgarno sequence present (the way it is defiend)

16S rRNA and Shine-Dalgarno Sequence

Shine-Dalgarno (SD) is a regulatory sequence strongly involved in translation of bacterial poly-cystronic mRNAs.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-41
SLIDE 41

Interesting question

Is Carsonella ruddii a living cell?

9 aminoa-cyl-tRNA synthetases and 15 out of 50 essential ribosomal protein are missing or degraded.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-42
SLIDE 42

Interesting question

Is Carsonella ruddii a living cell?

9 aminoa-cyl-tRNA synthetases and 15 out of 50 essential ribosomal protein are missing or degraded.

Two different theories

C.ruddii is a bacteria which undergoes the change to endosymbiont. C.ruddii is an former primary endosymbiont, is being driven towards its extinction and replacement by a new symbiont.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-43
SLIDE 43

Current Annotation

What has been done until now

2006: First annotaion (213 genes) 2007: Second annotation Both teams used well known Gene-prediction algorithms + collinear alignment

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-44
SLIDE 44

Current Annotation

What has been done until now

2006: First annotaion (213 genes) 2007: Second annotation Both teams used well known Gene-prediction algorithms + collinear alignment

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-45
SLIDE 45

Current Annotation

What has been done until now

2006: First annotaion (213 genes) 2007: Second annotation Both teams used well known Gene-prediction algorithms + collinear alignment Problem: Over-annotation of function of genes. Many genes that are believed to be orthologous are much shorter and therefore deffer in their function.

My goal

use an non-collinear alignment algorithm to reannotate the whole genome of C.ruddii

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-46
SLIDE 46

Reannotation

Algorithms

SuperMap + S-LAGAN A-Bruijn Alignment (ABA)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-47
SLIDE 47

S-LAGAN

Species used

Carsonella Ruddii PV (160 kb genome, 213 genes) Buchnera aphidicola BCc (Cc) (+ a plasmid) : 450 kb. (397 genes) Candidatus Blochmannia floridanus: 705 kb. (631 genes). Wigglesworthia glossinidia (+ a plasmid): 698 kb. (651 genes) Baumannia cicadellinicola str. Hc: 686 kb (651 genes)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-48
SLIDE 48

S-LAGAN

plus Supermap

A guiding tree (evolutionary tree) was build out of 16S-rRNAs

  • f the species.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-49
SLIDE 49

S-LAGAN

plus Supermap

A guiding tree (evolutionary tree) was build out of 16S-rRNAs

  • f the species.

Neighbor joining tree Maximum likelyhood tree

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-50
SLIDE 50

Trees

  • f 16S-rRNA sequence
  • uttree_phylip_nj. Sun Jul 18 17:54:23 2010 Page 1 of 1

Baumannia Blochmanni Buchnera Wiggleswor Carsonella 0.02 all_16S_rRNA_alignment_phylip_format.faa_phyml_tree.txt Sun Jul 18 17:56:23 2010 Page 1 of 1 Wiggleswor Buchnera Blochmanni Baumannia Carsonella 0.05

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-51
SLIDE 51

ABA

Using “my” 5 Species

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-52
SLIDE 52

ABA

Using “my” 5 Species

Species

0 and 5: Wigglesworthia 1 and 6: Buchnera aphidicola 2 and 7: Carsonella Ruddii 3 and 8: Blochmannia 4 and 9: Baumannia

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-53
SLIDE 53

ABA

Using “Moya’s” Species

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-54
SLIDE 54

ABA

Using “Moya’s” Species

Species

0 and 6: Buchnera aphidicola str. Cc 1 and 7: Buchnera aphidicola str. Bp 2 and 8: Buchnera aphidicola str. Sg 3 and 9: Buchnera aphidicola str. APS 4 and 10: Carsonella ruddii 5 and 11: E.Coli

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-55
SLIDE 55

ABA

Only using Carsonella and E.Coli

2 Species (Carsonella and E.Coli) produce the same alignment as 6 Species from Moya paper

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-56
SLIDE 56

ABA

Gene prediction

7 genes of 213 were cut by the prediction in C.ruddii. 22 genes of 4494 were cut by the prediction in E.Coli.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-57
SLIDE 57

ABA

Gene prediction

7 genes of 213 were cut by the prediction in C.ruddii. 22 genes of 4494 were cut by the prediction in E.Coli.

Example

region 0 - 46219 : 56 genes region 46219 - 47795 : 0 genes region 47795 - 53155 : 10 genes region 53155 - 53218 : 0 genes region 53218 - 54412 : 4 genes region 54412 - 56011 : 0 genes region 56011 - 58258 : 4 genes region 58258 - 59412 : 0 genes region 59412 - 65459 : 8 genes region 65459 - 67041 : 1 genes region 67041 - 70177 : 4 genes

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-58
SLIDE 58

ABA

Gene prediction

7 genes of 213 were cut by the prediction in C.ruddii. 22 genes of 4494 were cut by the prediction in E.Coli.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-59
SLIDE 59

ABA

Gene prediction

7 genes of 213 were cut by the prediction in C.ruddii. 22 genes of 4494 were cut by the prediction in E.Coli.

Example

region 0 - 46219 : 56 genes region 46219 - 47795 : 0 genes region 47795 - 53155 : 10 genes region 53155 - 53218 : 0 genes region 53218 - 54412 : 4 genes region 54412 - 56011 : 0 genes region 56011 - 58258 : 4 genes region 58258 - 59412 : 0 genes region 59412 - 65459 : 8 genes region 65459 - 67041 : 1 genes region 67041 - 70177 : 4 genes

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-60
SLIDE 60

A possible future

There are still at least 29 genes with no assigned function. Insightes into the possibility to create symbiotic life.

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-61
SLIDE 61

Project: Reimplementation of S-LAGAN Using SeqAn

  • F. Heeger, S. Specovius

1 Introduction to S-LAGAN 2 Implementation and Problems 3 Results Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-62
SLIDE 62

S-LAGAN

Shuffle-Limited Area Global Alignment of Nucleotides

S-LAGAN computes glocal alignments of 2 sequences → Set of local alignments which cover the whole sequence S-LAGAN is able to handle rearrangements

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-63
SLIDE 63

S-LAGAN

Rearrangements

No rearrangements Translocation Inversion Duplication

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-64
SLIDE 64

S-LAGAN

Overview

1 Computation of local alignments 2 Chaining 3 Realignment of consistent subchains Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-65
SLIDE 65

S-LAGAN

  • 1. Computation of local alignments

S-LAGAN uses CHAOS for this step Applies CHAOS twice → Sequence 1 with sequence 2 → Sequence 1 with reverse complement of sequence 2

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-66
SLIDE 66

S-LAGAN

  • 2. Chaining

1-monotonic

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-67
SLIDE 67

S-LAGAN

  • 3. Realignment of consistent subchains

Consistent (co-linear) subchains are globally aligned S-LAGAN uses LAGAN for this step

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-68
SLIDE 68

Implementation and Problems

Goal

Implementation in SeqAn Extract Chaos from SeqAn implementation of LAGAN Implement 1-monotonic chaining Use existing SeqAn implementation of LAGAN

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-69
SLIDE 69

Implementation and Problems

Local Alignments

Find seeds with q-gram index Merge overlapping seeds Chain seeds with Chaos algorithm → Segmentation Fault on certain data → Only gap-free local matches

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-70
SLIDE 70

Implementation and Problems

Chaining

Graph with nodes representing local matches Edges to all matches, which can be chained 1-monotonic → Heaviest path (Bellman-Ford Algorithm) O(n3)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-71
SLIDE 71

Implementation and Problems

Realign Consistent Subchains

Find consistent subchains Align them with global alignment algorithm LAGAN runs into an endless loop on certain data → Use Needleman-Wunsch Algorithm

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-72
SLIDE 72

Results

Our implementation... is very slow can be used on small data, like virus genomes (∼ 5000 bp) finds manually inserted rearrangements

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-73
SLIDE 73

Introduction OSL

Motivation

Assume there are two assemblies obtained from different assemblers:

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-74
SLIDE 74

Introduction OSL

Whole Genome Shotgun Approach (WGS)

Aim: Assemble a genome sequence from given reads. Reads → Collection of short sequences → Obtained from an automated sequencer → Orientation is not known

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-75
SLIDE 75

Introduction OSL

Whole Genome Shotgun Approach (WGS)

Assemble overlapping reads together to obtain contigs. Contigs → Large, contiguous fragments of assembled reads

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-76
SLIDE 76

Introduction OSL

Assembly Layout

Problem Order and orientation of contigs is unknown ↓ Search for a good assembly layout !

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-77
SLIDE 77

OSL

Optimal Syntenic Layout

  • f unfinished assemblies

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-78
SLIDE 78

OSL Idea

Maximize no.

  • f extended

local diagonals

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-79
SLIDE 79

OSL Idea

Maximize no.

  • f extended

local diagonals permute and flip contigs of assembly A

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-80
SLIDE 80

OSL Idea

Maximize no.

  • f extended

local diagonals permute and flip contigs of assembly A

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-81
SLIDE 81

OSL Idea

Maximize no.

  • f extended

local diagonals permute and flip contigs of assembly A switch roles of A and B

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-82
SLIDE 82

OSL Idea

Maximize no.

  • f extended

local diagonals permute and flip contigs of assembly A switch roles of A and B Independency in constructing the layouts of A and B !

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-83
SLIDE 83

The OSL Problem

Basics

Assemblies

A = (a1, . . . , ap) B = (b1, . . . , bq)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-84
SLIDE 84

The OSL Problem

Basics

Assemblies

A = (a1, . . . , ap) B = (b1, . . . , bq)

Set of Matches

M = (m1, . . . , mr)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-85
SLIDE 85

The OSL Problem

Layout

Local diagonal extension

c and c′ form a local diagonal extension iff y ∼ y′ and

  • = o′

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-86
SLIDE 86

The OSL Problem

Layout

Local diagonal extension

c and c′ form a local diagonal extension iff y ∼ y′ and

  • = o′

Weight of extension

w + w′− | y − y′ |

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-87
SLIDE 87

Project: Assembly Comparison

Goal

1 Assemble a set of reads with two different Assemblers 2 Compare the results using Layout Software

→ OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-88
SLIDE 88

Project: Assembly Comparison

1 Assemble a set of reads with two different Assemblers

Reads of Chromosom 21 Assembler: Mira and Celera (WGS)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-89
SLIDE 89

Project: Assembly Comparison

1 Assemble a set of reads with two different Assemblers

Reads of Chromosom 21 Assembler: Mira and Celera (WGS)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-90
SLIDE 90

Project: Assembly Comparison

Problems: WGS Assembler doesn’t work with given reads ↓ Plan B: Take given sequence of chr. 21 Create artificial contigs

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-91
SLIDE 91

Project: Assembly Comparison

Create artificial contigs:

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-92
SLIDE 92

Project: Assembly Comparison

Create artificial contigs:

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-93
SLIDE 93

Project: Assembly Comparison

BLAST

Assemblies are from the same sequence ↓ Megablast

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-94
SLIDE 94

Project: Assembly Comparison

OSLay

OSLay is the implementation of the OSL algorithm. Input: target assembly reference assembly matches (e.g. BLAST) Output:

  • riginal layout

new layout

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-95
SLIDE 95

Project: Assembly Comparison

OSLay

Problem: Input too large for OSLay

  • Chr. 21 ∼ 34 MB

↓ Plan B: segment of 210 KB

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-96
SLIDE 96

Project: Assembly Comparison

OSLay

Assembly A: sequence divided by 100 Assembly B: sequence divided by 19

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-97
SLIDE 97

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-98
SLIDE 98

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-99
SLIDE 99

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-100
SLIDE 100

Project: Assembly Comparison

OSLay

False connections:

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-101
SLIDE 101

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-102
SLIDE 102

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-103
SLIDE 103

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-104
SLIDE 104

Project: Assembly Comparison

OSLay

Create contigs with random length: Assembly A: lengths between 500 and 5000 bp (∼ 100 contigs) Assembly B: lengths between 1000 and 200000 bp (∼ 20 contigs)

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-105
SLIDE 105

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-106
SLIDE 106

Project: Assembly Comparison

OSLay

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-107
SLIDE 107

Project: Assembly Comparison

OSLay

Discussion Works only with similar sequences But: Contig borders of Assemblies should be different Just for small genomes

Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft
slide-108
SLIDE 108

References

Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Comparative, N., Program, S., Green,

  • E. D., Sidow, A., and Batzoglou, S. (2003a).

LAGAN and Multi-LAGAN : Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA Outline of Algorithms. Genome Research, (Taylor 1988):721–731. Brudno, M., Malde, S., Poliakov, A., Do, C., Couronne, O., Dubchak, I., and Batzoglou, S. (2003b). Glocal alignment: Finding rearrangements during alignment. Bioinformatics, 19(Suppl 1):i54. Parker, D. S. and Lee, C. J. (2003). Multiple Partial Order Alignment as a Graph Problem. Science (New York, N.Y.). Pevzner, P. A., Tang, H., and Tesler, G. (2004). De novo repeat classification and fragment assembly. Genome Research, 14(9):1786–96. Raphael, B., Zhi, D., Tang, H., and Pevzner, P. (2004). A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome research, 14(11):2336–46. Whole Genome Comparison: Project Presentations

  • F. Heeger, M. Homilius, I. Kel, S. Krakau, S. Specovius, J. Wiedenhoeft