A base composition analysis of natural patterns for the - - PowerPoint PPT Presentation

a base composition analysis of natural patterns for the
SMART_READER_LITE
LIVE PREVIEW

A base composition analysis of natural patterns for the - - PowerPoint PPT Presentation

Introduction Contribution Materials and Methods Results Conclusions A base composition analysis of natural patterns for the preprocessing of metagenome sequences Oliver Bonham-Carter, Dhundy Bastola, Hesham Ali College of Information Science


slide-1
SLIDE 1

Introduction Contribution Materials and Methods Results Conclusions

A base composition analysis of natural patterns for the preprocessing of metagenome sequences

Oliver Bonham-Carter, Dhundy Bastola, Hesham Ali College of Information Science & Technology School of Interdisciplinary Informatics Peter Kiewit Institute University of Nebraska at Omaha Omaha, NE USA 26 April 2013

slide-2
SLIDE 2

Introduction Contribution Materials and Methods Results Conclusions

1

Introduction Problem Motivation

2

Contribution Our Study

3

Materials and Methods Spectrum Sets Examples Of Spectrum Sets Proportions Data Experiment and Flow Chart Association

4

Results Phylogeny

5

Conclusions References

slide-3
SLIDE 3

Introduction Contribution Materials and Methods Results Conclusions

A Preprocessing Step to de novo Sequencing

The reconstruction of a genetic sequence is done by merging smaller pieces (reads) together to make the whole. Contigs are made of combined reads.

slide-4
SLIDE 4

Introduction Contribution Materials and Methods Results Conclusions

Sequence Assembly: Similar to a Jigsaw Puzzle

Smaller pieces come together to build the whole.

slide-5
SLIDE 5

Introduction Contribution Materials and Methods Results Conclusions

Mixing Pieces Makes a Harder Jigsaw Puzzle

Puzzle building is frustrated by the addition of foreign pieces in the mix.

slide-6
SLIDE 6

Introduction Contribution Materials and Methods Results Conclusions

Assembly of multiple sequences by de novo technologies

Often there are multiple sequences present in the sequencing pool.

slide-7
SLIDE 7

Introduction Contribution Materials and Methods Results Conclusions

Sequencing Alignment

End-regions of reads are analyzed for adjacency properties. Increased analysis is now necessary due to the added foreign reads.

slide-8
SLIDE 8

Introduction Contribution Materials and Methods Results Conclusions

Contribution of this Study: Base Composition Analysis

We propose a statistical method to cluster related sequence data into groups. This step will reduce the search space when aligning the individual reads of the pool. Verified by synthetic and biological data.

slide-9
SLIDE 9

Introduction Contribution Materials and Methods Results Conclusions

Contribution of this Study: Base Composition Analysis

slide-10
SLIDE 10

Introduction Contribution Materials and Methods Results Conclusions

Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites are regions in foreign DNA (i.e., viruses) where bacterial enzymes cut to destroy the DNA of an invading threat.

slide-11
SLIDE 11

Introduction Contribution Materials and Methods Results Conclusions

Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites to create a list of DNA words (spectrum sets). The proportional content of all these words (motifs) is used to determine sequence relatedness.

slide-12
SLIDE 12

Introduction Contribution Materials and Methods Results Conclusions

Four Spectrum Sets From All Known RS’s (Length 6)

slide-13
SLIDE 13

Introduction Contribution Materials and Methods Results Conclusions

Examples of “Home Grown” Spectrum Sets

The RS’s Used by Clostridium and Staphylococcus are different.

slide-14
SLIDE 14

Introduction Contribution Materials and Methods Results Conclusions

Collecting Proportions of Motifs Over Sequence Data: Length 6 Motifs

Where mi is a motif, SL is a read sequence, count(mi) is the number of occurrences of mi found in SL, |mi| and |SL| are the lengths of the motif and sequence, respectively. Since we are not using the entire sample space (all possible length-n motifs), proportions were appropriate.

slide-15
SLIDE 15

Introduction Contribution Materials and Methods Results Conclusions

Organisms

Organism Contig Originator Division Bifidobacterium longum NC 004307 Actinobacteria Mycobacterium bovis NC 002945 Actinobacteria Clostridium tetani NC 004557 Firmicutes Staphylococcus aureus NC 007622 Firmicutes Burkholderia pseudomallei NC 012695 Proteobacteria Campylobacter jejuni NC 008787 Proteobacteria

Ten trials of 6

2

  • = 10 * 15 = 150 experiments, each with fresh

sequence reads. The Contig Originator column displays the fully assembled sequences processed via MetaSim1 to make synthetic reads.

1, MetaSim: http://ab.inf.uni-tuebingen.de/software/metasim/

slide-16
SLIDE 16

Introduction Contribution Materials and Methods Results Conclusions

Flowchart of Algorithm

slide-17
SLIDE 17

Introduction Contribution Materials and Methods Results Conclusions

Association

DNA sequences appear to naturally have a unique base

  • composition. Related sequences cluster (associate) together.
slide-18
SLIDE 18

Introduction Contribution Materials and Methods Results Conclusions

Clostridium and Staphylococcus Genomes, CCCGGG-Spectrum Set

Note: Clostridium Staphylococcus

slide-19
SLIDE 19

Introduction Contribution Materials and Methods Results Conclusions

Clostridium and Staphylococcus Genomes, AAATTT-Spectrum Set

Note: Clostridium Staphylococcus

slide-20
SLIDE 20

Introduction Contribution Materials and Methods Results Conclusions

An Unknown Sequence Joins the Pool Party

slide-21
SLIDE 21

Introduction Contribution Materials and Methods Results Conclusions

An Unknown Sequence Joins the Pool Party

Addition of Clostridium Sequence Note: Spectrum-Set CCCGGG Clostridium Staphylococcus

slide-22
SLIDE 22

Introduction Contribution Materials and Methods Results Conclusions

An Unknown Sequence Joins the Pool Party

Addition of Clostridium Sequence Note: Spectrum-Set AAATTT Clostridium Staphylococcus

slide-23
SLIDE 23

Introduction Contribution Materials and Methods Results Conclusions

Clostridium and Staphylococcus, CCCGGG-Spectrum Set, with Clostridium Contigs

Note: Clostridium Staphylococcus

slide-24
SLIDE 24

Introduction Contribution Materials and Methods Results Conclusions

Mixed Contigs: Clostridium, Staphylococcus and Burkholderia (Bacterial Genomes)

AAATTT-Spectrum: There is a high contrast between one of the three to the other two. Remove this set and rerun the test. Note: Clostridium Staphylococcus, Burkholderia

slide-25
SLIDE 25

Introduction Contribution Materials and Methods Results Conclusions

Phylogeny

Remark We successfully used our method to assign the first chromosomes

  • f the following organisms to their rightful phylogenetic groupings.

Organism Common Name Caenorhabditis elegans Worm Canis lupus familiaris Dog Drosophila melanogaster Fruit fly Mus musculus Mouse Mycoplasma hyorhinis GDL-1 Bacteria Oryctolagus cuniculus Rabbit Rattus norvegicus Rat

slide-26
SLIDE 26

Introduction Contribution Materials and Methods Results Conclusions

Taxonomy Tree from Genbank

http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi

slide-27
SLIDE 27

Introduction Contribution Materials and Methods Results Conclusions

CCGGAT - Best Tree

Note: The heatmap graphic is removed to show only the tree.

slide-28
SLIDE 28

Introduction Contribution Materials and Methods Results Conclusions

AAATTT - Second Best Tree

Note: The heatmap graphic is removed to show only the tree.

slide-29
SLIDE 29

Introduction Contribution Materials and Methods Results Conclusions

Limitations

Information Based Successful phylogeny grouping requires ample sequence data (> 700bps of sequence data). Next generation sequencing trends indicate that improving sequencing technology is growing longer reads each year. Contigs: longer sequences made from combined reads are suitable. Spectrum Set Behavior Spectrum sets do not perform similarly on each data set. Contrast-based analysis: knowledge of organismal natural uses

  • f restriction sites.
slide-30
SLIDE 30

Introduction Contribution Materials and Methods Results Conclusions

Conclusions: Preprocessing for Multi-Sequence Assembly

The Separation of Mixed Reads. We proposed a binning preprocessing method which separates and partitions related sequence data. This method can reduce the search space when aligning reads in assembly tasks to expedite the sequence assembly process. The structural properties of sequence material can be used to infer phylogenetic properties

slide-31
SLIDE 31

Introduction Contribution Materials and Methods Results Conclusions

References

Bonham-Carter O, Ali H, Bastola D, “A base composition analysis of natural patterns for the pre-processing of metagenome sequences,” BMC Bioinformatics. (accepted, 2013) Bonham-Carter O, Ali H, Bastola D, “A Meta-genome Sequencing and Assembly Preprocessing Algorithm Inspired by Restriction Site Base Composition”, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) Wang Y, Leung HC, Yiu SM, Chin FY, Bioinformatics. “MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample,” 2012 Sep 15;28(18):i356-i362.

slide-32
SLIDE 32

Introduction Contribution Materials and Methods Results Conclusions

We would like to thank the support students, faculty and staff in the UNO- Bioinformatics Core Facility. This project has been funded by the grants from the National Center for Research Resources (5P20RR016469) and the National Institute for General Medical Science (NIGMS) (8P20GM103427).

Thank You! Questions?

  • bonhamcarter@unomaha.edu

IS&T Bioinformatics http://bioinformatics.ist.unomaha.edu/

slide-33
SLIDE 33

Introduction Contribution Materials and Methods Results Conclusions

Motifs

Set Seed Available Motifs AAATTT 12 CCCGGG 12 AATTCG 156 CCGGAT 156 The numbers of available motifs belonging to each spectrum. The motifs in the spectrum set are non-palindromic and are permutations of the set seeds.