Genomic sequence analysis: - PDF document

We ¡want ¡to ¡know ¡how ¡this… TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC Genomic ¡sequence ¡analysis: ¡ AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG gene ¡predic4on CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT Becomes ¡this What ¡are ¡we ¡looking ¡for? Repetitive elements Promoters Gene A Gene B Gene C Gene D mRNA A RNA B mRNA C mRNA D Protein A Protein C Protein D Function A Function B Function C Function D Ge>ng ¡all ¡genes Gene ¡iden4fica4on ¡methods • Genome ¡sequencing • Molecular ¡techniques – Access ¡to ¡en4re ¡genome, ¡allows ¡to ¡learn ¡more ¡about ¡ – Very ¡laborious genome ¡organiza4on – Time ¡consuming – Regulatory ¡elements – Expensive – Only ¡small ¡percentage ¡of ¡the ¡genome ¡codes ¡for ¡genes – Low ¡rate ¡of ¡false ¡posi4ves – Hard ¡to ¡iden4fy ¡less ¡typical ¡genes • Computa4onal ¡methods – High ¡rate ¡of ¡false ¡posi4ves – Fast • EST ¡sequencing – Rela4vely ¡low ¡cost – Requires ¡less ¡sequencing ¡since ¡it ¡is ¡focused ¡on ¡coding ¡ sequence ¡only – High ¡rate ¡of ¡false ¡posi4ves – Small ¡rate ¡of ¡false ¡posi4ves, ¡although ¡even ¡10% ¡of ¡EST ¡ – Poor ¡performance ¡on ¡less ¡typical ¡genes sequences ¡could ¡be ¡ar4facts – Genes ¡with ¡very ¡restricted ¡expression ¡may ¡never ¡be ¡ discovered – In ¡most ¡cases ¡gives ¡only ¡par4al ¡sequences

Genome ¡sequencing Shotgun ¡sequencing ¡is ¡ the ¡method ¡of ¡choice ¡ for ¡small ¡genomes Shotgun Clone into vector Sequencing Assembly Clone-‑by-‑clone ¡approach Repe66ve ¡sequences ¡make ¡ correct ¡assembly ¡difficult Gene ¡finding ¡methods ¡classifica4on Before ¡we ¡start ¡analysis… Similarity based predictors : make use of similarity to already • We ¡have ¡to: known genes and proteins coded by these genes as well as – Check ¡sequences ¡quality expression data including sequences from cDNAs and data – Remove ¡contamina4on from hybridization experiments (tiling arrays for example) – Assembly ¡sequence ¡reads ¡into ¡longer ¡con4gs Dual- and multi-genome predictors: – Close ¡gaps ¡(in ¡perfect ¡situa4on) rely on the fact that functional regions of a genome sequence are more conserved during evolution Model based predictors : use a single genome sequence and exon/ intron structure is predicted based on absolute and bulk properties of the sequence

Comparative genomics - Comparative genomics - Similarity ¡search MultiPipMaker MultiPipMaker • We ¡can ¡check ¡if ¡any ¡fragment ¡of ¡our ¡sequence ¡ shows ¡similarity ¡to ¡already ¡known ¡protein. ¡We ¡ can ¡also ¡check ¡if ¡there ¡are ¡any ¡mRNA ¡ sequences ¡and ¡ESTs ¡which ¡align ¡well ¡with ¡the ¡ genomic ¡sequence. ¡Based ¡on ¡similarity ¡we ¡can ¡ deduct ¡the ¡gene ¡structure ¡and ¡protein ¡ func4on http://pipmaker.bx.psu.edu/pipmaker/ Model ¡based ¡methods All ¡informa4on ¡is ¡in ¡the ¡DNA. ¡We ¡just ¡have ¡to ¡ learn ¡how ¡to ¡read ¡the ¡code, ¡the ¡program ¡for ¡life. • We ¡take ¡advantage ¡of ¡what ¡we ¡already ¡ learned ¡about ¡gene ¡structures ¡and ¡features ¡of ¡ coding ¡sequences. ¡Based ¡on ¡this ¡knowledge ¡ TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG we ¡can ¡build ¡theore4cal ¡model, ¡develop ¡an ¡ CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG algorithm ¡to ¡search ¡for ¡important ¡features, ¡ CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC train ¡it ¡on ¡known ¡data ¡and ¡use ¡to ¡search ¡for ¡ TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT coding ¡sequences ¡in ¡anonymous ¡genomic ¡ AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA fragments AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC Gene4c ¡code Program ¡for ¡life • DNA ¡in ¡our ¡cells ¡store ¡informa4on ¡in ¡a ¡way ¡that ¡is ¡very ¡similar ¡to ¡the ¡way ¡computers ¡do. • Instead ¡of ¡being ¡a ¡binary ¡memory, ¡where ¡everything ¡is ¡either ¡0 ¡or ¡1, ¡DNA ¡is ¡a ¡4 ¡le]er ¡ alphabet: ¡A, ¡C, ¡G, ¡T • Using ¡computer ¡metaphor ¡we ¡can ¡say ¡that: – Plant ¡cell ¡do ¡not ¡look ¡like ¡a ¡mouse ¡cell ¡because ¡their ¡“programs” ¡are ¡different – Liver ¡cells ¡work ¡differently ¡than ¡lung ¡cells ¡because ¡of ¡different ¡input ¡to ¡the ¡program ¡ – Children ¡look ¡like ¡parents ¡because ¡their ¡program ¡is ¡a ¡“revision” ¡of ¡parents ¡program – Many ¡diseases ¡are ¡caused ¡by ¡“bugs” ¡in ¡program: • Familial ¡dysautonomia: ¡A ¡simple ¡mistake ¡in ¡one ¡line ¡of ¡code • Hun4ngton’s ¡disease: ¡A ¡“line” ¡of ¡code ¡gets ¡repeated ¡a ¡bunch ¡of ¡4mes ¡by ¡ accident • Different ¡ways ¡to ¡solve ¡the ¡same ¡problem: – Plants: ¡photosynthesis ¡= ¡turn ¡light ¡into ¡sugar – Animals: ¡eat ¡plant ¡or ¡other ¡animals

Gene4c ¡code From ¡DNA ¡to ¡protein ATGGTCCTACACACGATCGATCGATCGATGTGA ATG GTC CTA CAC ACG ATC GAT CGA TGC ATG TGA M V L H T I N R C M STOP M L V C I N R H M T Pseudogenes ¡and ¡repe44ve ¡elements Gene ¡structure Genes ¡may ¡overlap Complicated ¡gene ¡structures

Genomic sequence analysis: - PDF document

We want to know how this TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Genomic Knowledge Standards (GKS) genomicsandhealth.org Genomic Knowledge Standards GKS aims

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Determining coding CpG islands as regions significant for Markov chain based counting statistics

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

Lecture 9: Mapping Reads to a Reference Burrows Wheeler Transform and FM Index Spring 2020

Common intervals of genomes Mathieu Raffinot CNRS - LIAFA Context: - comparative genomics. -

Structural Biology Michael Sattler Institute of Structural Biology (STB)

Folding, Assembly, Flexible Systems Maxim Petoukhov EMBL, Hamburg Outstation Outline Outline

Always be Cross-compiling Matthew Bauer, John Ericson October 9, 2019 Always be cross compiling

Sambuz

Useful Links

Newsletter

Mail Us

Genomic sequence analysis: - PDF document

We want to know how this TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Genomic Knowledge Standards (GKS) genomicsandhealth.org Genomic Knowledge Standards GKS aims

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Determining coding CpG islands as regions significant for Markov chain based counting statistics

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

Lecture 9: Mapping Reads to a Reference Burrows Wheeler Transform and FM Index Spring 2020

Common intervals of genomes Mathieu Raffinot CNRS - LIAFA Context: - comparative genomics. -

Structural Biology Michael Sattler Institute of Structural Biology (STB)

Folding, Assembly, Flexible Systems Maxim Petoukhov EMBL, Hamburg Outstation Outline Outline

Always be Cross-compiling Matthew Bauer, John Ericson October 9, 2019 Always be cross compiling

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or