Principles and Applicaons of Modern Principles and Applicaons of Modern DNA Sequencing DNA Sequencing
EEEB GU4055 EEEB GU4055
Session 14: Phylogenomics Session 14: Phylogenomics
1
Principles and Applicaons of Modern Principles and Applicaons of - - PowerPoint PPT Presentation
Principles and Applicaons of Modern Principles and Applicaons of Modern DNA Sequencing DNA Sequencing EEEB GU4055 EEEB GU4055 Session 14: Phylogenomics Session 14: Phylogenomics 1 Today's topics Today's topics 1. Phylogenomics
1
2
Characterize evoluonary history from a subset of sampled genomes (individuals).
few genes across many taxa many genes across few taxa
3 . 1
3 . 2
3 . 3
Characterize whole genomes from a subset of sequenced markers.
Full genome Shotgun reads Assembly Full genome RADseq reads Assembly
4 . 1
It is important to examine evoluonary history across the enre genome.
4 . 2
It is important to examine evoluonary history across the enre genome.
4 . 3
A model that describes the expected waing me unl two or more samples share a most recent common ancestor. The distribuon of coalescent mes within a populaon, or between populaons, provides informaon about their history. There are many genealogical histories that could possibly explain the genec relatedness of a set of samples. We cannot observe the genalogies directly, only the sequence data that evolved on those genealogies. Coalescent simulaons provide a means to ask: "can the genec variaon that I
4 . 4
The effecve populaon size (Ne) of a populaon describes the probability that two samples share a common ancestor in the previous generaon. This parameter does not translate directly to the actual populaon size, though they are likely correlated. Other factors like non-random mang and populaon structure also affect Ne.
4 . 5
If we assume that a populaon is randomly mang (panmicc) and neutrally evolving then the expected waing me unl n samples coalesce can be modeled enrely by Ne. Because n samples can share many possible genealogical histories (remember how big tree space is), and their genealogical relaonships are expected to vary across their genomes (recombinaon makes different regions independent of others), we expect to observe a large variaon in genealogical histories when examining many loci for n samples. The coalescent model treats genealogies as a random varaible. We are interested in the expected distribuon of variaon when integrang over many genealogies.
4 . 6
When modeling mulple populaons a "species tree" topology (e.g., "Species Tree") defines when different samples or their ancestors are able to share a parent in a previous generaon. To predict the expected genec similarity of samples in a structured coalescent model requires esmang Ne for each lineage as well as T, the divergence me of the populaons. Modern phylogenec inference methods are based on the mulspecies coalescent model which calculates the likelihood of observed genec data given a set of parameters: Ne, T, and a topology. Searching over many topologies and many parameters can idenfy a best species tree model that explains variaon among genealogies.
4 . 7
4 . 8
5 . 1
5 . 2
5 . 3
5 . 4
Locus 1 Locus 2 Locus 3 Locus 4 Locus 5
complete species-level sampling
... ... ... ... ...
6 . 1
Locus 1 Locus 2 Locus 3 Locus 4 Locus 5
complete species-level sampling
... ... ... ... ...
6 . 2
Wet lab techniques for taking extracted DNA and ligang synethesized nucleodes to it to prepare it for a sequencing machine. Adapter sequences are oligonucleodes with a sequence that binds to some feature
Barcodes/Indices are unique molecular idenfiers that can be ligated (aached) to DNA fragments so that they can be pooled for sequencing and later assigned to different samples based on the barcode (demulplexed).
7 . 1
Methods for subsampling the genome to select parcular regions for sequencing. Requires a priori knowledge about sequence at the regions of interest. Design and order synethesized RNA baits that will bind to target DNA region. These baits are ligated to magnec beads that allow them to be pulled out of soluon with powerful magnets. This will enrich the DNA sample for the targeted regions. Shotgun sequence the enriched library and assemble reads into congs overlapping the targeted region.
7 . 2
The exome is composed of all of the exons within the genome. It is different from the transcriptome, which contains all RNA transcribed in a cell. The transcriptome will vary among different cell types whereas the exome does not. Targeted exome sequencing uses hybrid target capture to enrich a DNA extracon for coding regions before shotgun sequencing. It requires a priori knowledge of the gene sequences. Whole Exome Sequencing is mostly used in human biomedical research, and model
high quality reference genome and is costly (i.e., needs to be used many mes to recoup costs).
7 . 3
For phylogenomic analyses we typically do not need the whole exome, and instead design baits for just a subset of exons. In parcular, exons that are highly conserved and occur as a single copy (not duplicated). RNA baits can be designed for many closely related taxa based on one or more closely related genomes. If the samples differ too much from the taxon used for bait design you end up with missing data.
7 . 4
Some genomic regions have been idenfied that are very very highly conserved among even very divergent taxa (e.g., all birds or all mammals). Somemes these regions have unknown funcons, some are related to important developmental genes. Baits have been designed that target these UCE regions and extend away from them for several hundred base pairs. The center has almost no variaon but on the ends of congs more variaon is detected. Whereas it is oen very hard to align orthologous regions among very distantly related species, UCEs seem to work well for obtaining many hundreds or thousands
7 . 5
8 . 1
Subsample many thousands of regions across the genome without need to design
Inially used for associaon mapping, and genec maps, where sparsely spaced markers are sufficient to idenfy ancestry relave to parents. But because it is easy to generate thousands of markers it also became popular for populaon genec and phylogenec analyses.
8 . 2
they accumulate mutaons) and so it is characterized by a lot of missing data
quesons to simply shotgun sequence the whole genome.
8 . 3
8 . 4