modern computational statistics lecture 20 applications
play

Modern Computational Statistics Lecture 20: Applications in - PowerPoint PPT Presentation

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019 Introduction 2/23 While modern statistical approaches have been quite


  1. Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019

  2. Introduction 2/23 ◮ While modern statistical approaches have been quite successful in many application areas, there are still challenging areas where the complex model structures make it difficult to apply those methods. ◮ In this lecture, we will discuss some of the recent advancement on statistical approaches for computational biology, with an emphasis on evolutionary models.

  3. Challenges in Computational Biology 3/23 Adapted from Narges Razavian 2013

  4. Phylogenetic Inference 4/23 The goal of phylogenetic inference is to reconstruct the evolution history (e.g., phylogenetic trees ) from molecular sequence data (e.g., DNA, RNA or protein sequences) Taxa Characters A Species A ATGAACAT B Species B ATGCACAC C ATGCATAT Species C D Species D ATGCATGC Molecular Sequence Data Phylogenetic Tree Lots of modern biological and medical applications: predict the evolution of influenza viruses and help vaccine design, etc.

  5. Example: B Cell Evolution 5/23 This happens inside of you!

  6. Example: B Cell Evolution 5/23 This happens inside of you!

  7. Example: B Cell Evolution 5/23 This happens inside of you!

  8. Example: B Cell Evolution 5/23 This happens inside of you! These inferences guide rational vaccine design.

  9. Bayesian Phylogenetics 6/23 ATGAAC · · · ATGCAC · · · ATGCAT · · · ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6

  10. Bayesian Phylogenetics 6/23 ATGAAC · · · Evolution model: ch e ATGCAC · · · p (ch | pa , q e ) pa ATGCAT · · · q e : amount of evolution on e . ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6

  11. Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: ATGCAC · · · A p (ch | pa , q e ) ATGCAT · · · A q e : amount of evolution on e . ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i ( u,v ) ∈ E ( τ )

  12. Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i a i ( u,v ) ∈ E ( τ )

  13. Bayesian Phylogenetics 6/23 ATGAAC · · · T Evolution model: � � ATGCAC · · · T p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  14. Bayesian Phylogenetics 6/23 ATGAAC · · · G Evolution model: � � ATGCAC · · · G p (ch | pa , q e ) � � ATGCAT · · · G q e : amount of evolution on e . � � ATGCAT · · · G ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  15. Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · C q e : amount of evolution on e . � � ATGCAT · · · C ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  16. Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  17. Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  18. Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

  19. Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ ) Given a proper prior distribution p ( τ, q ), the posterior is p ( τ, q | Y ) ∝ p ( Y | τ, q ) p ( τ, q ) .

  20. Markov chain Monte Carlo 7/23 Random-walk MCMC (MrBayes, BEAST): ◮ simple random perturbation (e.g., Nearest Neighborhood Interchange) to generate new state. NNI Challenges for MCMC ◮ Large search space: (2 n − 5)!! unrooted trees ( n taxa) ◮ Intertwined parameter space, low acceptance rate, hard to scale to data sets with many sequences.

  21. Variational Inference 8/23 p ( θ | x ) q ∗ ( θ ) q ∗ ( θ ) = arg min KL ( q ( θ ) � p ( θ | x )) q ∈ Q Q ◮ VI turns inference into optimization ◮ Specify a variational family of distributions over the model parameters Q = { q φ ( θ ); φ ∈ Φ } ◮ Fit the variational parameters φ to minimize the distance (often in terms of KL divergence) to the exact posterior

  22. Evidence Lower Bound 9/23 L ( θ ) = E q ( θ ) (log p ( x, θ )) − E q ( θ ) (log q ( θ )) ≤ log p ( x ) ◮ KL is intractable; maximizing the evidence lower bound (ELBO) instead, which only requires the joint probability p ( x, θ ). ◮ The ELBO is a lower bound on log p ( x ). ◮ Maximizing the ELBO is equivalent to minimizing the KL. ◮ The ELBO strikes a balance between two terms ◮ The first term encourages q to focus probability mass where the model puts high probability. ◮ The second term encourages q to be diffuse. ◮ As an optimization approach, VI tends to be faster than MCMC, and is easier to scale to large data sets (via stochastic gradient ascent)

  23. Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B C D A B C D

  24. Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B ABC D C D A B AB CD C D

  25. Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A BC B ABC D C 1.0 D D A A B B AB CD C C D D

  26. Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A B BC B C ABC D C 1.0 D 1.0 D D 1.0 D A 1.0 A A B 1.0 B B AB CD C C 1.0 C D 1.0 D D

  27. Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D

  28. Probability Estimation Over Tree Topologies 11/23 A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D Rooted Trees � p ( S i = s i | S π i = s π i ) . p sbn ( T = τ ) = p ( S 1 = s 1 ) i> 1

  29. Probability Estimation Over Tree Topologies 11/23 A A t o o n r u 1 A / t B A B o A o S 4 r A BCD C B 1 2 B CD S 2 D C S 5 D 3 S 1 A A S 6 4 5 A B S 3 r o B B o t / 3 AB u C D n r o o t CD C C C S 7 D D D Unrooted Trees : p sbn ( T u = τ ) = � � p ( S 1 = s 1 ) p ( S i = s i | S π i = s π i ) . s 1 ∼ τ i> 1

  30. Tree Probability Estimation via SBNs 12/23 SBNs can be used to learn a probability distribution based on a collection of trees T = { T 1 , · · · , T K } . T k = { S i = s i,k , i ≥ 1 } , k = 1 , . . . , K Rooted Trees ◮ Maximum Likelihood Estimates : relative frequencies. p MLE ( S 1 = s 1 ) = m s 1 m s i ,t i p MLE ( S i = s i | S π i = t i ) = ˆ ˆ K , � s ∈ C i m s,t i Unrooted Trees ◮ Expectation Maximization � � p EM , ( n +1) = arg max � ˆ log p ( S 1 ) + log p ( S i | S π i ) E p ( S 1 | T, ˆ p EM , (n) ) p i> 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend