A Statistical Framework for Spatial Comparative Genomics Thesis - - PowerPoint PPT Presentation

a statistical framework for spatial comparative genomics
SMART_READER_LITE
LIVE PREVIEW

A Statistical Framework for Spatial Comparative Genomics Thesis - - PowerPoint PPT Presentation

A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee Dannie Durand (chair) Andrew Moore Russell Schwartz Jeffrey Lawrence (Univ. of Pittsburgh, Dept.


slide-1
SLIDE 1

A Statistical Framework for Spatial Comparative Genomics

Thesis Proposal

Rose Hoberman Carnegie Mellon University, August 2005

Thesis Committee

Dannie Durand (chair) Andrew Moore Russell Schwartz Jeffrey Lawrence (Univ. of Pittsburgh, Dept. of Biological Sciences) David Sankoff (Univ. of Ottawa, Dept. of Math & Statistics)

slide-2
SLIDE 2

Regulatory regions: Regions of DNA where regulatory proteins bind Genes: DNA sequences that code for a specific functional product, most commonly proteins. Noncoding DNA: Large stretches of DNA with unknown function.

CCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGG

Genome: the complete set of genetic material of

an organism or species

CCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCC

slide-3
SLIDE 3

Genome Evolution

speciation Sequence Mutation + Chromosomal Rearrangements

species 2 species 1

slide-4
SLIDE 4

Chromosomal Rearrangements

4 5 3 7 1 6 2 8 9 7 11 12 10 4 5 3 1 6 2 13 14 15 17 16 19 20 18 8 9 11 12 13 10 14 15 17 16 19 20 18

Inversions

4 3 1 2

Duplications

3 8 9 11 12 13 10 14 15 17 16 19 20 18 20 19 18 13 14 15 17 16

Loss

Species 1 Species 2

slide-5
SLIDE 5

My focus: Spatial Comparative Genomics

Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves.

slide-6
SLIDE 6

Terminology

Homologous: related through common ancestry

Orthologous: related through speciation Paralogous: related through duplication

4 5 3 7 1 2 8 9 7 11 12 10 5 3 1 6 2 4 3 1 2 8 9 11 12 13 10 14 15 17 16 19 20 18 20 13 14 15 17 16 3 4 3 1 2

  • rthologs

paralogs

Species 1 Species 2

slide-7
SLIDE 7

An Essential Task for Spatial Comparative Genomics

Identify homologous blocks, chromosomal regions that correspond to the same chromosomal region in an ancestral genome

My thesis: how to find and statistically validate homologous blocks

4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 8 9 11 12 13 10 14 15 17 16 19 20 18 20 13 14 15 17 16 3 4 3 1 2

slide-8
SLIDE 8

More distantly related segments:

Gene Clusters: similar gene content, but neither

gene content nor order is strictly conserved

slide-9
SLIDE 9

Gene Clusters are Used in Many Types

  • f Genomic Analysis

Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) Sequence of the human genome (Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999) ...

slide-10
SLIDE 10

reconstruct the

history of chromosomal rearrangements

infer an ancestral

genetic map

build phylogenies transfer knowledge

Spatial Comparative Genomics

Guillaume Bourque et al. Genome Res. 2004; 14: 507-516

slide-11
SLIDE 11

Consider evolution as an enormous experiment Unimportant structure is randomized or lost Exploit evolutionary patterns to infer functional

associations

Snel, Bork, Huynen. PNAS 2002

Spatial Comparative Genomics

Function

slide-12
SLIDE 12

Outline

Introduction and Applications Formal framework for gene clusters

Genome representation Gene homology mapping Cluster definition

Introduction to Statistical Issues Preliminary work: Testing cluster significance Proposed work

slide-13
SLIDE 13

Basic Genome Model

a sequence of unique genes distance between genes is equal to the

number of intervening genes

gene orientation unknown a single, linear chromosome

slide-14
SLIDE 14

Gene Homology

Identification of homologous gene pairs

generally based on sequence similarity still an imprecise science preprocessing step

Assumptions

matches are binary (similarity scores are discarded) each gene is homologous to at most one other gene

in the other genome

slide-15
SLIDE 15

Where are the gene clusters?

Intuitive notions of what clusters look like

Enriched for homologous gene pairs Neither gene content nor order is perfectly

preserved

Need a more rigorous definition

slide-16
SLIDE 16

Cluster Definitions

Descriptive:

common intervals r-window max-gap …

Constructive:

LineUp CloseUp FISH …

Cluster properties

  • rder

size length density gaps

gap = 3 length =10 size = 4

slide-17
SLIDE 17

Max-Gap: a common cluster definition

A set of genes form a max-gap cluster if the gap

between adjacent genes is never greater than g on either genome

gap ≤ 2 gap ≤ 4

slide-18
SLIDE 18

Why Max-Gap?

Allows extensive rearrangement of gene order Allows limited gene insertion and deletions Allows the cluster to grow to its natural size

It’s the most widely used in genomic analyses

no formal statistical model for max-gap clusters

slide-19
SLIDE 19

Outline

Introduction and Applications Formal framework for gene clusters Introduction to statistical issues Preliminary work: Testing cluster significance Proposed work

slide-20
SLIDE 20

Detecting Homologous Chromosomal Segments

1.

Formally define a “gene cluster”

2.

Devise an algorithm to identify clusters

3.

Verify that clusters indicate common ancestry

...statistics

...modeling

...algorithms

slide-21
SLIDE 21

Statistical Testing Provides Additional Evidence for Common Ancestry How can we verify that a gene cluster indicates common ancestry?

True histories are rarely known Experimental verification is often not

possible

Rates and patterns of large-scale

rearrangement processes are not well understood

slide-22
SLIDE 22

Statistical Testing

Goal: distinguish ancient homologies from

chance similarities

Hypothesis testing

Alternate hypothesis: shared ancestry Null hypothesis: random gene order

Determine the probability of seeing a cluster by

chance under the null hypothesis An example…

slide-23
SLIDE 23

Whole Genome Self-Comparison

Compared all human chromosomes to all other

chromosome to find gene clusters

Identified 96 clusters of size 6 or greater

McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.

29 genes

10 genes duplicated

  • ut of ~100

Could two regions display this degree of similarity simply by chance?

Chromosome 3 Chromosome 17

slide-24
SLIDE 24

1.

Are larger clusters more likely to occur by chance?

2.

Are there other duplicated segments that their method did not detect?

McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.

Chromosome 17

Clusters with similarity to human chromosome 17

slide-25
SLIDE 25

Cluster Significance: Related Work

Randomization tests

most common approach generally compare clusters by size

Very simple models

Excessively strict simplifying assumptions Overly conservative cluster definitions

Citations in proposal

slide-26
SLIDE 26

Cluster Significance: Related Work

Calabrese et al, 2003

statistics introduced in the context of

developing a heuristic search for clusters

Durand and Sankoff, 2003

definition: m homologs in a window of size r

My thesis

max-gap definition

slide-27
SLIDE 27

Outline

Introduction and Applications Formal framework for gene clusters Introduction to statistical issues Preliminary work: max-gap cluster

statistics

reference set whole-genome comparison

Proposed work

slide-28
SLIDE 28

Cluster statistics depend on how the cluster was found

Whole genome comparison: find all (maximal) sets of genes that are clustered together in both genomes.

4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 8 9 11 12 13 10 14 15 17 16 19 20 18 20 13 14 15 17 16 3 4 3 1 2

slide-29
SLIDE 29

Cluster statistics depend on how the cluster was found

Reference set: does a particular set of genes cluster together in one genome?

complete cluster: contains all genes in the set incomplete cluster: contains only a subset

slide-30
SLIDE 30

Preliminary results: Max-Gap Cluster Statistics

Reference set

complete clusters complete clusters with length restriction incomplete clusters

Whole genome comparison

upper bound lower bound

Hoberman, Sankoff, and Durand. Journal of Computational Biology 2005. Hoberman and Durand. RECOMB Comparative Genomics 2005. Hoberman, Sankoff, and Durand. RECOMB Comparative Genomics 2004.

slide-31
SLIDE 31

Do all m blue genes form a significant cluster?

Reference set, complete clusters

m = 5

Given: a genome: G = 1, …, n unique genes

a set of m genes of interest (in blue)

slide-32
SLIDE 32

Reference set, complete clusters

Test statistic: the maximum gap observed

between adjacent blue genes

P-value: the probability of observing a maximum

gap ≤ g, under the null hypothesis

g = 2

m = 5

slide-33
SLIDE 33

Compute probabilities by counting

All possible unlabeled permutations Permutations where the maximum gap ≤ g

The problem is how to count this

slide-34
SLIDE 34

number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left

w = (m-1)g + m

slide-35
SLIDE 35

ways to place the remaining m-1 blue genes, so that no gap exceeds g

g

number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left

slide-36
SLIDE 36

edge effects

w = (m-1)g + m

ways to place the remaining m-1 blue genes, so that no gap exceeds g number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left

slide-37
SLIDE 37

l = w-1

Gaps are constrained: And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

slide-38
SLIDE 38

g1 g2 g3 gm-1

l < w

A known solution:

slide-39
SLIDE 39

l = w-1

Gaps are constrained: And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

slide-40
SLIDE 40

… d(m,g,m)

w-2

Cluster Length

d(m,g,m) + … d(m,g,m) + d(m,g,m+1) + d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

w w-1

m+1 m

d(m,g,m+1) + d(m,g,m) d(m,g,m+1) +

Line of Symmetry

l = w-1 l = m

slide-41
SLIDE 41

Exploiting Symmetry

g

=

1 1 g g-1 g-1

=

w m m+1 w-1

1 2 g-2

=

m+2 w-2

1 2 g-2 g g g-1 g-1 g g

l=

slide-42
SLIDE 42

… d(m,g,m)

w-2

Cluster Length

d(m,g,m) + … d(m,g,m) + d(m,g,m+1) + d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

w w-1

m+1 m

d(m,g,m+1) + d(m,g,m) d(m,g,m+1) +

(g+ 1)m-1 (g+ 1)m-1 (g+ 1)m-1

l = w-1 l = m

slide-43
SLIDE 43

Ways to place remaining m-1 Starting positions near end

Adding edge effects…

Starting positions

slide-44
SLIDE 44

Probability of a complete cluster

100 200 300 400 10

−60

10

−40

10

−20

10 Number of genes of interest (m) Probability g= 2 g= 3 g= 5 g=10 g=15 g=25 g=50

n = 500

slide-45
SLIDE 45

Using statistics to choose parameter values

Number of genes of interest (m) Maximum allowed gap size (g) 100 200 300 400 500 5 15 25 35 45

n = 500 Significant Parameter Values (α = 0.001)

slide-46
SLIDE 46

Preliminary Results: Max-Gap Cluster Statistics

Reference set

complete clusters complete clusters with length restriction incomplete clusters

Whole genome comparison

upper and lower bounds Hoberman, Sankoff, Durand. Journal of Computational Biology 2005. Hoberman and Durand. RECOMB Comparative Genomics 2005. Hoberman, Sankoff, Durand. RECOMB Comparative Genomics 2004.

slide-47
SLIDE 47

Whole genome comparison

If gene content is identical, the probability of a max-gap cluster is 1 (regardless of the allowed gap size) A surprising result

slide-48
SLIDE 48

Whole Genome Comparison: m ≤ n

What is the probability of observing a maximal max-gap cluster of size exactly h, if the genes in both genomes are randomly ordered?

A cluster is maximal if it is not a subset of a larger cluster

g ≤ 3 g ≤ 3 Two genomes of n genes with with m homologous genes pairs

slide-49
SLIDE 49

Configurations that contain a cluster

  • f exactly size h

??

All configurations

  • f two genomes

A constructive approach

slide-50
SLIDE 50

Constructive Approach

number of ways to place h genes so they form a cluster in both genomes number of ways to place m-h remaining genes so they do not extend the cluster

Number of configurations that contain a cluster of exactly size h

slide-51
SLIDE 51

Where can we place the pink and green genes so that they do not extend this cluster of size three?

A tricky case…

g = 1 h = 3

gap > 1 gap > 1

With this placement, the cluster cannot be extended

slide-52
SLIDE 52

Moving genes further away from the cluster may make them more likely to extend the cluster

A tricky case…

g = 1 h = 3

gap = 1 gap = 1

slide-53
SLIDE 53

My whole-genome comparison results

I derived upper and lower bounds on the probability of observing a cluster containing h homologs, via whole genome comparison

Lower bound: guarantees no tricky cases Upper bound: a few tricky cases sneak in

Hoberman, Sankoff, Durand. Journal of Computational Biology 2005.

slide-54
SLIDE 54

Cluster Probability

Whole-genome comparison cluster statistics

5 10 15 20 25 30 10

−15

10

−10

10

−5

10 h l i simulation upbound lowbound

g=10

Cluster size

50 100 150 200 250 10

−6

10

−4

10

−2

10 h l i simulation upbound lowbound

g=20 n=1000, m=250

slide-55
SLIDE 55
  • E. coli vs B. Subtilis

Under null hypothesis, by g=25 all genes should form a single cluster Complete cluster doesn’t form until g=110 Typical

  • peron

sizes

clusters above the orange line are significant at the .001 level

Algorithm: Bergeron et al, 02 Statistics: Hoberman et al, 05

slide-56
SLIDE 56

Summary of preliminary work

Developed statistical tests using a combinatoric

approach

reference region whole genome comparison

Some surprising results Results raise concerns about current methods

used in comparative genomics studies

slide-57
SLIDE 57

Larger clusters do not always imply greater significance

A max-gap cluster containing many genes may be more likely to occur by chance than one containing few genes

50 100 150 200 250 10

−6

10

−4

10

−2

10 h l i simulation upbound lowbound

100 200 300 400 500 10

−15

10

−10

10

−5

10 Number of genes of interest (m) Probability g= 2 g= 3 g= 5 g=10 g=15 g=25 g=50

slide-58
SLIDE 58

Algorithms and Definition Mismatch

Greedy, bottom-up algorithms will not find all

max-gap clusters

There is an efficient divide-and-conquer algorithm

to find maximal max-gap clusters (Bergeron et al, WABI, 2002)

g = 2

slide-59
SLIDE 59

Extending the Model

Directions for generalization Circular chromosomes Multiple chromosomes Genome self-comparison Gene order and orientation Gene families

slide-60
SLIDE 60

Outline

Introduction and Applications Formal framework for gene clusters An introducton to statistical issues Preliminary work: Testing cluster

significance

Proposed work

slide-61
SLIDE 61

Proposed Work Outline

Generalizing the model At least one of the following:

  • 1. Joint detection of orthologous genes and

chromosomal regions

  • 2. Finding and assessing clusters in multiple

genomes

  • 3. Detecting selection for spatial organization

Validation

slide-62
SLIDE 62

Joint Identification of Orthologous Genes and Chromosomal Regions

The identification of orthologous genes is a

prerequisite for a marker-based approach

Orthology identification

is often difficult to determine from gene sequence

alone

is an important unsolved research problem can be improved by incorporating genomic context

slide-63
SLIDE 63

An example: Which gene is the true ortholog?

Most similar Least similar

Species 2 Species 1

Query Gene 4th of 4 3rd of 4 2nd of 4 1st of 4 1st of 1 1st of 1 1st of 1 1st of 1 1st of 1

slide-64
SLIDE 64

Problem: for more diverged genomes, unambiguous orthologs will be sparse and clusters will be more rearranged Solution: Identify orthologs and gene clusters simultaneously

Identify homologous genes Find gene clusters Similar genomic context

slide-65
SLIDE 65

Work that combines sequence similarity and

genomic context

Bansal, Bioinformatics 99 Kellis et al, J Comp Biol 04 Bourque et al, RECOMB Comp Genomics 05 Chen et al, ACM/IEEE Trans Comput Biol and Bioinf 05

Limitations

No flexible cluster definitions No statistical approaches Little real evaluation

slide-66
SLIDE 66

Possible computational approaches:

Expectation Maximization (EM)

treat ortholog assignment as a hidden variable

Maximal bipartite matching

use an objective function that incorporates

both sequence similarity and spatial clustering

slide-67
SLIDE 67

Proposed Work

Generalizing the model At least one of the following:

  • 1. Joint detection of orthologous genes and

chromosomal regions

  • 2. Finding and assessing clusters in multiple

genomes

  • 3. Detecting selection for spatial organization

Validation

slide-68
SLIDE 68

Comparing Multiple Genomes Simultaneously

Comparison of multiple genomes

  • ffers significantly more power to

detect highly diverged homologous segments

Arabidopsis thaliana Rice Arabidopsis thaliana

Vandepoele et al, 2002

1 26 26

At 2 At 1

Rice At 2 At 1

6 4 1 1 20 22 22

slide-69
SLIDE 69

Current Approaches

  • 1. Identify clusters based on conserved pairs
  • f genes, using heuristics

Limitation: A highly rearranged cluster may have no pairs in proximity

slide-70
SLIDE 70

Current Approaches

  • 2. Identify clusters with conserved gene order,

Limitation: rearranged clusters will not be detected

slide-71
SLIDE 71

Current Approaches

  • 3. Search for max-gap clusters, but require the

cluster to be found in its entirety in all genomes Will lead to a reduction in power as more genomes are added

6 4 1 1 20 22 22

…No formal statistics

slide-72
SLIDE 72

Initial Investigations

Statistics: choice of test statistic;

i.e., how to weight genes that occur in only a subset of the regions

6 4 1 1 20 22 22 Modeling: Maximum gap between

genes with a match in any of the regions must be small

Algorithms: how to find such clusters

slide-73
SLIDE 73

Proposed Work

Generalizing the model At least one of the following:

  • 1. Joint detection of orthologous genes and

chromosomal regions

  • 2. Finding and assessing clusters in multiple

genomes

  • 3. Detecting selection for spatial organization

Validation

slide-74
SLIDE 74

Tests for Selective Pressure on Spatial Organization

Probability of finding a cluster under the null hypothesis now depends on the phylogenetic distance between the species

Preliminary work:

Null hypothesis:

random gene order

Alternate hypothesis:

common ancestry

Proposed work:

Null hypothesis:

common ancestry

Alternate hypothesis:

functional selection

slide-75
SLIDE 75

Tests for selective pressure must consider phylogenetic distance

  • E. coli

Salmonella Haemophilus influenzae

  • B. subtilis

Quite likely to occur by chance. Less likely to occur by chance.

slide-76
SLIDE 76

Current Approaches

  • 1. Discard closely related genomes, and test

against random gene order

  • E. coli

Salmonella Haemophilus influenzae

  • B. subtilis
slide-77
SLIDE 77

Current Approaches

2.

Some formal statistical tests, but based on gene pairs only. Limitation: considering only pairs of genes could result in a loss of power

slide-78
SLIDE 78

Detecting Selective Pressure on Spatial Organization

Initial Explorations

Searching for evidence of selective pressure to

maintain non-operon structure in bacteria

Locations of clusters with respect to

  • rigin and terminus

left and right arm of chromosome functional classification

slide-79
SLIDE 79

Proposed Work

Generalizing the model At least one of the following:

  • 1. Joint detection of orthologous genes and

chromosomal regions

  • 2. Finding and assessing clusters in multiple

genomes

  • 3. Detecting selection for spatial organization

Validation

slide-80
SLIDE 80

How Should Gene Cluster Statistics be Validated?

No established benchmarks

True evolutionary histories are rarely known Rearrangement processes are not yet understood

We’d like to evaluate

Discriminatory power Parameter selection strategies

Possible strategies depend on specific problem

Synthetic data Hand-curated ortholog databases Databases of experimentally verified operons

slide-81
SLIDE 81

timeline

S 9 O 10 N 11 D 12 J 1 F 2 M 3 A 4 M 5 J 6 J 7 A 8 S 9 O 10 N 11 D 12 2005 Loose ends Model Extensions 2006 Selected Problem(s) And Validation Initial Investigations Writing

slide-82
SLIDE 82

Acknowledgements

My Thesis Committee Barbara Lazarus Women@IT Fellowship The Sloan Foundation The Durand Lab

slide-83
SLIDE 83

Advantages of an analytical approach

Analyzing incomplete datasets Principled parameter selection Efficiency Understanding statistical trends Insight into tradeoffs between definitions

slide-84
SLIDE 84

The Max-Gap Definition is the Most Widely Used in Genomic Analyses

Blanc et al 2003, recent polyploidy in Arabidopsis

Venter et al 2001, sequence of the human genome

Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features

...