Mikl os Cs ur os Department of Computer Science Yale University - - PowerPoint PPT Presentation

▶

Feb 06, 2023 105 likes •280 views

F AST R ECOVERY OF E VOLUTIONARY T REES WITH T HOUSANDS OF L EAVES Mikl os Cs ur os Department of Computer Science Yale University Molecular evolution evolutionary tree (Noro et al. 1998) Woolly mammoth African elephant Asian

SLIDE 1

FAST RECOVERY OF EVOLUTIONARY TREES

WITH THOUSANDS OF LEAVES

Mikl´

s Cs˝

ur¨

Department of Computer Science Yale University

SLIDE 2

Molecular evolution evolutionary tree (Noro et al. 1998)

Woolly mammoth African elephant Asian elephant Dugong Manatee

homologous gene sequences

Woolly mammoth ...CTAAATCATCACTGATC--AAAGAGAGC... African elephant ...CTAAATCATCACCGATC--AAAGAGAGC... Asian elephant ...CTAAATCATCGCTGATC--AAAGAGAGC... Dugong ...TTAAATCACTCCCGATCATAAAG-GAGC... Manatee ...TCAAATCATTACTGACCATAAAG-GAGC...

differences between sequences grow with time

SLIDE 3

Markov model

each character evolves independently
root sequence characters are i.i.d.
character transitions on edges

1 1 q 1-q p 1-p 10010... parent 11010... child

character at node u: ξ

✁ u ✂

– random variables forming a Markov chain on each path

SLIDE 4

Distance based algorithms Distance [coin-toss model: symmetric mutations] D

✁ u ✄ v ✂✆☎✞✝

ln

✟✡✠ ξ ✁ u ✂✆☎

ξ

✁ v ✂ ☛☞✝✌✟☞✠ ξ ✁ u ✂✎✍ ☎

ξ

✁ v ✂ ☛

symmetric
additive along paths

Distance-based algorithm:

1. distance estimation between leaves ˆ

D

2. algorithm using pairwise distance matrix

SLIDE 5

Additive tree problem build edge-weighted tree from sum-of-edge-weigths on paths between leaves – use triplets (eg., Waterman, Smith, Singh, Beyer 1977)

u v w

v w

✁ u ✄ o ✂✆☎

D

✁ u ✄ v ✂✑✏

D

✁ u ✄ w ✂✒✝

D

✁ v ✄ w ✂

2

SLIDE 6

Estimated distances Use relative frequencies in sample ˆ D

✁ u ✄ v ✂✆☎✞✝

ln ˆ P

✠ ξ ✁ u ✂✓☎

ξ

✁ v ✂✔☛☞✝

ˆ P

✠ ξ ✁ u ✂✎✍ ☎

ξ

✁ v ✂✔☛

estimation error
harder to recognize separate triplet centers
estimation error grows with distance

SLIDE 7

Triplet center estimation Similarity: S

✁ u ✄ v ✂✆☎

exp

✝ D ✁ u ✄ v ✂ ☎✕✟☞✠ ξ ✁ u ✂✓☎

ξ

✁ v ✂✔☛☞✝✌✟☞✠ ξ ✁ u ✂✎✍ ☎

ξ

✁ v ✂✔☛

Distance estimation error: for 0

✖

ε

✖

1,

✟

ˆ D

✁ u ✄ o ✂ ✝ D ✁ u ✄ o ✂ ✗ ✘

ln

✙ 1 ✘ ε ✚

2

✛

aexp

✝ b ✜ ε2S2 ✁ u ✄ v ✄ w ✂

(with a

✄ b ✢

0 constants) Average similarity: S

✁ u ✄ v ✄ w ✂ ☎

3 1 S

✙ u ✣ v ✚ ✏

1 S

✙ u ✣ w ✚ ✏

1 S

✙ v✣ w ✚

SLIDE 8

Harmonic Greedy Triplets Add one internal node and leaf at a time

greedy selection of triplet by average similarity
recognize separate inner nodes

(four-point condition)

restrict pool of triplets considered

(relevant triplets)

SLIDE 9

Sample length Bounded mutation probabilities on edges

✖

f

✛

pe

✛

g

✖

1 2 There exists

✜✤☎✦✥

log 1

δ

✏

logn

✁ 1 ✝

2g

✂★✧ ✙ d ✚ f 2

s.t. with probability 1

✝

δ, topology is recovered correctly

tree depth: d

✛

1

✏

log2

✁ n ✝

1

✂

SLIDE 10

Simulated experiments compare to Neighbor Joining (Saitou and Nei 1987) and other algorithms simulate DNA sequence evolution (Jukes-Cantor & K2P+Γ)

500 leaf tree (Chase et al. 1993)

tree of 500 seed plants from rbcL gene

1895 leaf tree (RDP 1999)

tree of 1895 eukaryotes from ribosomal SSU

3135 leaf tree (RDP 1999)

tree of 3135 Proteobacteria from ribosomal SSU evaluate by Robinson-Foulds distance (1981): percentage of misplaced internal edges

SLIDE 11

500-leaf tree

147 78 108 148 325 471 153 259 415 382 228 223 301 332 360 219 496 333 452 292 41 189 19 483 460 405 323 464 216 70 283 375 127 432 239 77 393 486 91 421 139 388 38 422 334 466 202 288 261 313 144 192 378 305 257 35 66 290 160 277 3 50 374 409 476 291 345 11 114 445 384 23 48 95 365 463 20 295 146 310 449 354 143 417 54 59 227 8 328 347 231 5 269 330 361 475 130 166 6 351 9 285 434 184 83 142 97 429 210 274 317 249 418 315 491 57 96 133 32 412 253 171 24 112 453 116 152 190 474 343 401 411 225 31 71 500 196 176 436 473 100 371 98 320 226 44 94 281 457 21 45 27 493 126 309 194 379 165 485 446 28 90 322 221 499 442 363 326 416 129 110 394 12 413 134 438 385 93 425 284 265 314 175 241 15 308 263 137 33 235 72 247 349 79 303 495 260 395 200 178 40 268 186 372 10 115 135 437 458 217 62 430 39 286 498 199 169 335 29 319 211 43 386 206 267 201 180 307 109 181 1 289 276 404 22 99 158 234 287 198 73 150 358 337 423 128 4 163 441 389 30 368 244 459 75 145 478 179 172 236 380 52 344 47 355 132 238 255 170 364 250 400 82 177 270 414 13 318 433 182 213 123 293 426 339 469 407 490 477 118 76 451 373 167 482 141 311 242 222 68 331 348 203 193 111 397 81 101 306 187 7 304 481 455 16 64 431 340 468 233 298 356 479 399 251 472 149 419 125 312 164 34 240 329 387 26 497 84 103 392 191 2 484 406 25 224 341 220 119 383 480 297 362 245 88 402 258 370 56 113 230 444 390 46 300 338 461 264 209 106 49 448 381 262 450 53 140 107 280 120 156 327 366 410 105 488 352 14 470 168 439 183 87 61 321 427 272 205 208 173 256 17 420 55 65 324 162 36 92 131 252 359 80 74 266 398 157 188 159 275 465 215 122 254 454 273 117 154 357 229 316 435 336 408 104 271 63 299 424 212 302 248 456 294 124 42 195 377 136 342 18 138 279 232 155 197 174 367 102 492 376 58 67 243 443 391 214 121 69 282 462 204 487 85 151 237 353 489 447 60 161 278 246 346 86 494 350 296 51 369 89 207 440 185 218 396 403 428 467 37

SLIDE 12

Experimental sample length — 500 leaf tree varying sample length

5000 1000 200 10000 sample length 1 10 RF% Neighbor-joining HGT/FP 500-leaf tree, high mutation probabilities

SLIDE 13

Experimental sample length — 1895 leaf tree

5000 1000 200 10000 sample length 1 10 0.1 RF% Neighbor-joining HGT/FP 1895-leaf tree, high mutation probabilities

SLIDE 14

Experimental success — 1895 leaf tree varying mutation probabilities

1 0.5 0.1 2 maximum edge length 1 10 0.1 RF% Neighbor-Joining HGT/FP 1895-leaf tree, high mutation probabilities

SLIDE 15

Experimental success — 3135 leaf tree

1 0.5 0.1 2 maximum edge length 1 10 0.1 RF% Neighbor-Joining HGT/FP 3135-leaf tree, high mutation probabilities

SLIDE 16

Summary distance-based algorithm with

polynomial sample size

(Jukes-Cantor, Kimura’s, paralinear, LogDet)

n2 running time

✥ ✁ n ✂ work space

good experimental performance on large divergent trees

✩

fastest algorithm with polynomial sample size

✪✒✫✒✫✑✬✮✭✰✯✒✯✒✱✑✱✒✱✮✲✴✳✒✵✆✲✰✶✒✷✒✸✒✹✺✲✰✹✒✻✒✼✒✯✆✽✒✳✒✵✴✼✒✾✑✿✓✵✴❀✒❁✓❂❄❃✒✸✒✿✓✵✴✯✑✬✒✷✒✬✒✹✒✾✆✵✴✯