Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. - - PDF document

course par3culars
SMART_READER_LITE
LIVE PREVIEW

Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. - - PDF document

1/22/09 CSCI1950Z Computa3onal Methods for Biology* (*Working Title) Lecture 1 Ben Raphael January 21, 2009 Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. Func3onal Genomics: ~25% lectures 3. Network/Systems


slide-1
SLIDE 1

1/22/09 1

CSCI1950‐Z Computa3onal Methods for Biology* (*Working Title) Lecture 1

Ben Raphael January 21, 2009

Course Par3culars

  • Three major topics
  • 1. Phylogeny: ~50% lectures
  • 2. Func3onal Genomics: ~25% lectures
  • 3. Network/Systems Biology: ~25% lectures
  • Tools

– Computer Science: Algorithms and discrete math (e.g. graph theory), Programming – Mathema3cs: Discrete Probability, Linear algebra

(vectors and matrices)

– Biology: Basics. (What is DNA?)

slide-2
SLIDE 2

1/22/09 2

Course Par3culars

  • Webpage

h\p://cs.brown.edu/courses/csci1950‐z/ [readings (including some background material)

  • Textbook: None
  • Assignments: mens et manus

1. 4 wri\en assignments: ~40% of grade 2. 3 programming assignments: ~40% of grade 3. Take home final: ~20% of grade

  • Graduate credit

– Extra assignment/project – Talk to me before March 1

  • Survey

Topic 1: Phylogeny

slide-3
SLIDE 3

1/22/09 3

Early Evolu3onary Studies

200th Anniversary of birth of Charles Darwin

From Origin of the Species (1859)

Darwin  1960’s

  • Anatomical features were the dominant

criteria used to derive evolu3onary rela3onships between species.

  • Imprecise, ofen subjec3ve, observa3ons
  • fen led to inconclusive, contradictory, or

incorrect evolu3onary rela3onships between species

  • Molecular data (DNA and protein sequences)

drama3cally improved situa3on.

slide-4
SLIDE 4

1/22/09 4

Species Trees

Is a panda more closely related to a bear or a raccoon?

Bear Raccoon Looks Hiberna3on Pa\ern … Tree derived from DNA sequence data. Steven O’Brien et al. (1985) ~100 years of arguments

Human Evolu3onary History

From: Molecular Evolu7on a Phylogene7c Approach, R. Page & E. Holmes

slide-5
SLIDE 5

1/22/09 5

More Recent Human History

Out of Africa Hypothesis: Most ancient ancestor lived in Africa roughly 200,000 years ago

http://www.becominghuman.org 1 2 3 4 5

The Origin of Humans: ”Out of Africa” vs Mul3regional Hypothesis

Out of Africa:

– Humans evolved in Africa ~200,000 years ago – Humans migrated out

  • f Africa, replacing
  • ther humanoids

around the globe Multiregional:

  • Humans evolved in the last two

million years as a single species. Independent appearance of modern traits in different areas

  • Humans migrated out of Africa

mixing with other humanoids on the way

slide-6
SLIDE 6

1/22/09 6

Human Evolu3onary Tree

http://www.mun.ca/biology/scarr/Out_of_Africa2.htm

DNA‐based reconstruc3on of the human evolu3onary tree

Evolu3onary Tree of Humans (mtDNA)

Vigilant, Stoneking, Harpending, Hawkes, and Wilson (1991)

  • African population is the

most diverse (sub-populations had more time to diverge)

  • Evolutionary tree separates
  • ne group of Africans from a

group containing all five populations.

  • Tree rooted on branch

between groups of greatest difference.

slide-7
SLIDE 7

1/22/09 7

Evolu3onary Tree of Humans: (microsatellites)

  • Neighbor joining

tree for 14 human populations genotyped with 30 microsatellite loci.

Lineage of Genghis Kahn?

In humans, Y‐chromosome passed from father only. Can be used to iden3fy parental lineages.

~8% of males in parts of Asia and 0.5% world‐wide es3mated to be descendants

  • f a resident of Mongolia ~1000 years ago (Zerjal et al. AGHG 2003).
slide-8
SLIDE 8

1/22/09 8 Lafaye\e, Louisiana, 1994:

  • A woman claimed her ex‐

lover (who was a physician) injected her with HIV+ blood

  • Records show the

physician had drawn blood from an HIV+ pa3ent that day

  • Is there a way to show

that blood from that HIV + pa3ent ended up in the woman?

HIV Transmission

  • HIV has a high muta3on rate, which can be

used to trace paths of transmission

  • Two people who were infected from different

sources will have very different HIV sequences

Alignment of fourteen amino acid sequences from V3 region of HIV‐1 gp120 genes

Azizi et al. BMC Immunology 2006 7:25

slide-9
SLIDE 9

1/22/09 9

To the Lab!

Wet lab

  • Take mul3ple samples from the

pa3ent, the woman, and controls (non‐related HIV+ people)

  • Obtain DNA sequence from two HIV

genes HIV (gp120 and RT). Computer lab

  • Build phylogene3c tree from the DNA

sequences.

Phylogene3c Tree  Convic3on

  • Three different tree reconstruc3on techniques used.
  • In every reconstruc3on, vic3m’s sequences were related to

pa3ent’s sequences.

  • Nes3ng of the vic3m’s sequences within the pa3ent sequence

indicated the direc3on of transmission was from pa3ent to vic3m

  • First 3me phylogene3c analysis was used in a court case as

evidence (Metzker, et. al., 2002)

slide-10
SLIDE 10

1/22/09 10

Phylogene3c Trees

How to build a phylogene7c tree from data?

Data

  • 1. Characters/Features
  • 2. Pairwise distances

Algorithm

Phylogene3c Trees

What is a phylogene7c tree?

Biology definition:

  • None (picture)
  • A “branching diagram…”
  • Intuition:
  • Leaves represent existing

species

  • Branch points represent most

recent common ancestor.

  • Length of branches represent

evolutionary time.

  • Root represents the oldest

evolutionary ancestor.

slide-11
SLIDE 11

1/22/09 11

Phylogene3c Trees

What is a phylogene7c tree?

Computer science definition tree: A connected acyclic graph G = (V, E). graph: A set V of vertices and a set E of edges, where each edge connects a pair of vertices.

Tree Defini3ons

tree: A connected acyclic graph G = (V, E). graph: A set V of vertices and a set E of edges, where each edge (vi, vj) connects a pair of vertices. A path in G is a sequence (v1, v2, …, vn) of vertices in V such that (vi, vi+1) are edges in E. A graph is connected provided for every pair vi vj of vertices, there is a path between vi and vj. A cycle is a path with the same starting and ending vertices. A graph is acyclic provided it has no cycles.

slide-12
SLIDE 12

1/22/09 12

Tree Defini3ons

tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v. A phylogenetic tree is a tree with a label for each leaf (vertex of degree one). A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; i.e. two children. A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2)

Rooted and Unrooted Trees

In the unrooted tree the position of the root (“oldest ancestor”) is

  • unknown. Otherwise, they are like

rooted trees

slide-13
SLIDE 13

1/22/09 13

Evalua3ng Different Phylogenies

Value1 Value2 Mouth Smile Frown Eyebrows Normal Pointed

Character‐Based Tree Reconstruc3on

Which tree is beHer?

slide-14
SLIDE 14

1/22/09 14

Character‐Based Tree Reconstruc3on

Count changes on tree

Character‐Based Tree Reconstruc3on

Parsimony: minimize number of changes on edges of tree

slide-15
SLIDE 15

1/22/09 15

Character‐Based Tree Reconstruc3on

Maximum Likelihood: Given Pr[change], what is tree with maximum probability?

Iden3fying Highest Scoring Tree

  • Naïve, exhaus3ve Algorithm: check all trees.
  • How many possibili3es?

– Restrict to binary trees.

slide-16
SLIDE 16

1/22/09 16

Phylogene3c Trees

How to efficiently build trees from data?

1 4 3 2 5 1 4 2 3 5

Data

  • 1. Characters/Features
  • 2. Pairwise distances

Phylogene3c Trees

How to efficiently build trees from data?

1 4 3 2 5 1 4 2 3 5

Methods

  • 1. Characters/Features
  • Parsimony: Minimum

number of changes

  • Probabilistic Model
  • 2. Pairwise distances
  • Clustering (UPGMA,

Neighbor joining, …)

slide-17
SLIDE 17

1/22/09 17

Addi3onal Models and Extensions

  • Comparing trees

– Distances between trees. – Sta3s3cal tests: bootstrap, permuta3on tests, etc.

  • Supertrees and consensus
  • Gene trees vs. species trees.
  • Whole‐genome phylogeny.

Topic 2: Func3onal Genomics

slide-18
SLIDE 18

1/22/09 18

Biology 101 Biology 101

Central Dogma

slide-19
SLIDE 19

1/22/09 19

What can we measure?

Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!)

DNA Basepairing

slide-20
SLIDE 20

1/22/09 20

DNA Microarrays Clustering of Gene Expression

Gene expression Samples

Each microarray experiment: expression vector u = (u1, …, un) ui = expression value for each gene. Group similar vectors.

BMC Genomics 2006, 7:279

slide-21
SLIDE 21

1/22/09 21

Clustering

  • Clustering algorithms

related to distance‐based phylogene3c algorithms.

  • Phylogeny gives grouping
  • f related data points.

1 4 3 2 5 1 4 2 3 5

Classifica3on

Binary classifica@on Given a set of examples (xi, yi), where yi = +‐ 1, from unknown distribu3on D. Design func3on f: Rn  {‐1,+1} that assigns addi3onal samples xi to one of two classes op7mally.

slide-22
SLIDE 22

1/22/09 22

Topics

  • Methods for Clustering

– Hierarchical, Matrix‐based (PCA), Graph based (Clique‐finding)

  • Methods for Classifica3on

– Nearest neighbors, support vector machines

  • Data Integra3on: Bayesian Networks

Topic 3: Network and Systems Biology

slide-23
SLIDE 23

1/22/09 23

Biological Interac3on Networks

Many types:

  • Protein‐DNA

(regulatory)

  • Protein‐metabolite

(metabolic)

  • Protein‐protein

(signaling)

  • RNA‐RNA

(regulatory)

  • Gene3c interac3ons

(gene knockouts)

Regulatory Networks

slide-24
SLIDE 24

1/22/09 24

Cis‐regulatory Network Metabolic Networks

Nodes = reactants Edges = reac3ons labeled by enzyme (protein) that catalyzes reac3on

slide-25
SLIDE 25

1/22/09 25

Protein‐Protein Interac@on (PPI) Network Protein‐Protein Interac3on Network?

  • Proteins are nodes
  • Interac3ons are edges
  • Edges may have

weights

Yeast PPI network

  • H. Jeong et al. Nature 411, 41 (2001)
slide-26
SLIDE 26

1/22/09 26

Computa3onal Problems

  • 1. Classifying Network Topology

– Finding paths, cliques, dense subnetworks, etc.

  • 2. Comparing Networks Across Species
  • 3. Using networks to explain data

– Dependencies revealed by network topology

  • 4. Modeling dynamics of networks

Network Mo3fs

Subnetworks with more occurrences than expected by chance.

  • How to find?
  • How to assess sta3s3cal significance?

Shen‐Orr et al. 2002

slide-27
SLIDE 27

1/22/09 27

Network Alignment

Sharan and Ideker. Modeling cellular machinery through biological network

  • comparison. Nature Biotechnology 24, pp. 427‐433, 2006

The Network Alignment Problem

Given: k different interac3on networks belonging to different species, Find: Conserved sub‐networks within these networks Conserved defined by protein sequence similarity (node similarity) and interac3on similarity (network topology similarity)

slide-28
SLIDE 28

1/22/09 28

Protein Signaling Networks

Use machine learning methods (Bayesian networks, etc. to derive network structure. Art Salomon Biology Department

Course Themes

  • Topics: Phylogeny, Func3onal Genomics, Systems

& Network Biology

  • Mixture of theory and prac3ce (real data)
  • Graph algorithms: Path and clique finding,

isomorphism, heavy subgraphs, matching, vertex cover, spanning and Steiner problems, etc.

  • Sta@s@cs: Hypothesis tes3ng, permuta3on tests,

bootstrap and resampling, enrichment (hypergeometric), etc.

  • Data Mining and Machine Learning: Clustering

and Classifica3on

slide-29
SLIDE 29

1/22/09 29

Sources

  • h\p://bioalgorithms.info (por3ons of “Out of

Africa” and character phylogeny slides)