EMBL Database Growth (growth statistics as of 05/30/02) EMBL - PDF document

EMBL Database Growth (growth statistics as of 05/30/02)

EMBL Divisions (05/30/02) �

Computational Genomics Sequence determination Signal processing Assembly of shotgun sequence fragments and determination of a consensus sequence Sequence analysis Search for genes, regulatory patterns, repeats, etc, in the inferred genomic sequence Conceptual translation of the predicted genes into protein sequences Inferences about the structural, functional, evolutionary, etc, properties of the putative proteins �

Looking for similarities in biological Databases Biological Databases : Primary Data: sequences : nucleic acids or proteins curated or not complete or not redundant or not general or specialized (organisms...) structures ( ) Derived data: patterns, motifs, profiles etc Query : Sequence : nucleic acid or protein, "finished" or not, etc Structural data Motif, profile... Similarity : Global vs local, and other variations How to measure it? What is its biological interpretation? significance? Need to be better defined and formalized. �

Interpretation of biological similarity Divergent evolutionary process HOMOLOGY (mutations and selection) x x x xx Similarity ANALOGY Convergent evolution x x x x x x Similarity From W. M. Fitch, 2000: "Homology; a personal view on some of the problems", Trends in genetics, 16, p227−231 �

HOMOLOGIES Speciation ORTHOLOGY x x x x x x Gene duplication PARALOGY x x x x x x Lateral transfert XENOLOGY x x x x x x x x �

Introduction to Computational sequence analysis 1. Pairwise sequence comparison 1.1 Algorithms for sequence alignments 1.2 Scoring schemes 2. Similarity searching in sequence databases 2.1 Sequence databases 2.2 The Fasta and Blast programs 2.3 Scores and statistics 3. Multiple alignments and motifs 3.1 Algorithms: how to compute global or local multiple alignments? 3.2 Representation of the information contained in a multiple alignment 3.3 Examples of resources and applications of these concepts in the context of similarity searches 3.4 Back to the algorithms: motif inference Other important aspects of computational sequence analysis: Gene prediction Protein structure prediction and homology modeling Phylogenetic inference �

� � ��

DOT−PLOT A C R S K E H Y D D E K E H . . A . C R . . . . . T . . K E . H . . . Y . . . . E K . . . E H �

Word matches C R S K E H Y D D E K E H A . A C R . . T . K E H Y . . E . K E H ��

Approximate word matches A C R S K E H Y D D E K E H . A . . . C R . . . T . K . E . H Y . . . E . K E H Parameters: w = size of the sliding words or windows t = threshold which is applied to the score of the comparison between two windows (or words). The score itself may be computed as the number of identities between the two words, or using a scoring matrice for amino acids. ��

Global alignment seq A MAKSKSSQGASGARRKPAPSLYQHISSFKPQFSTRVDDVLHFSKTLTWRSEIIPDKSKGT | |. . .. .|| |.||||..|.| .|||.||| |.||. . ..|||...| seq B MPKK −−−−−−−− VWKSSTPSTYEHISSLRPKFVSRVDNVLHQRKSLTFSNVVVPDKKNNT seq A LTTSLLYSQGSDIYEIDTTLPLKTFYDDDDDDDNDDDDEEGNGKTKSAATPNPEYGDAFQ ||.|..||||||||||| ..||. ..|. | .|||||. seq B LTSSVIYSQGSDIYEIDFAVPLQ −−−−−−−−−−−−−−−−−−−−−− EAASEPVKDYGDAFE seq A DVEGKPLRPKWIYQGETVAKMQYLESSDDSTAIAMSKNGSLAWFRDEIKVPVHIVQEMMG .|. | ||..||||||.|| ||... ..| ..||||||||||.. ||||.|||||.|| seq B GIENTSLSPKFVYQGETVSKMAYLDKTGETTLLSMSKNGSLAWFKEGIKVPIHIVQELMG seq A PATRYSSIHSLTRPG −−−−−−− SLAVSDFDVSTNMDTVVKSQSNGYEEDSILKIIDNSDR ||| |.||||||||| |||.||| .|.. .|.||||||| |||||||||||. . seq B PATSYASIHSLTRPGDLPEKDFSLAISDFGISNDTETIVKSQSNGDEEDSILKIIDNAGK seq A PGDILRTVHVPGTNVAHSVRFFNNHLFASCSDDNILRFWDTRTADKPLWTLSEPKNGRLT ||.||||||||||.|.|.||||.||.|||||||||||||||||.|||.|.|.|||||.|| seq B PGEILRTVHVPGTTVTHTVRFFDNHIFASCSDDNILRFWDTRTSDKPIWVLGEPKNGKLT seq A SFDSSQVTENLFVTGFSTGVIKLWDARAVQLATTDLTHRQNGEEPIQNEIAKLFHSGGDS ||| |||..||||||||||.||||||||.. ||||||.|||||.|||||||...|.|||| seq B SFDCSQVSNNLFVTGFSTGIIKLWDARAAEAATTDLTYRQNGEDPIQNEIANFYHAGGDS seq A VVDILFSQTSATEFVTVGGTGNVYHWDMEYSFSRNDDDNEDEVRVAAPEELQGQCLKFFH |||. || ||..|| |||||||.|||. .||.|. . |. | || | . |.|.| seq B VVDVQFSATSSSEFFTVGGTGNIYHWNTDYSLSKYNPDDTIAPPQDATEESQTKSLRFLH seq A TGGTRRSSNQFGKRNTVALHPVINDFVGTVDSDSLVTAYKPFLASDFIGRGYDD ||.||| .|.|.|||.| ||||...|||||.||||. |||. . . seq B KGGSRRSPKQIGRRNTAAWHPVIENLVGTVDDDSLVSIYKPYTEES −−−−−−− E ��

Alignment as a path in a graph A A T C T A C _ _ _ _ _ _ _ . . . . . . . . . A A T C T A C A A A A A A A A A A A A A A A A _ _ _ _ _ _ _ _ A A T C T A C _ _ _ _ _ _ _ . . . . . . . . . A A T C T A C T T T T T T T T T T T T T T T T _ _ _ _ _ _ _ _ A A T C T A C _ _ _ _ _ _ _ . . . . . . . . . A A T C T A C C C C C C C C C C C C C C C C C _ _ _ _ _ _ _ _ C A A T T A C _ _ _ _ _ _ _ . . . . . . . . . A A T C T A C A A A A A A A A A A A A A A A A _ _ _ _ _ _ _ _ A A T C T A C _ _ _ _ _ _ _ . . . . . . . . . A A T C T A C T T T T T T T T T T T T T T T T _ _ _ _ _ _ _ _ T A A T C A C _ _ _ _ _ _ _ . . . . . . . . . A A A T C T C C C C C C C C C C C C C C C C C _ _ _ _ _ _ _ _ C A A T C T A _ _ _ _ _ _ _ . . . . . . . . . . . A A T C T A C Alignment corresponding to the colored path: _ _ A T C A T C _ A A T C T A C ��

Scoring Schemes Similarity measures: . the scores may be either positive or negative . the greater the similarity between two compared symbols, the greater the score of their comparison, . when using a similarity scoring scheme, we want to maximize the score of the alignment. Distance measures: . scores are always positive and increase when the similarity decreases . a simple distance scoring scheme: if x = y, then d(x,y) = 0, else d(x,y) = 1 d(x,−) = d(−,y) = 1 . using a distance measure, we want to minimize the score of the alignment ��

Alignments An alignment is defined as a series of paired symbols, that are either letters from the alphabet of the sequences, or the symbol for a gap. Given two sequences � � � � � � � � � � � � � � � � and � � � � � � one alignment between the two sequences is represented as: � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � The score of an alignment is defined as the sum of the scores of all the paired symbols: � � � � � � � � �� which can be written as: � � � � � � �� with: � � � � � � � � � � � � � � � � � � � � � � � � � The optimal alignment is defined as the one having the best score ("best" meaning "lowest" or "highest", depending the kind of scoring scheme that is used) ��

Dynamic Programming � � � � � � � � � � � � �� _ � � � � � � _ 0 � � � � � ��

Dynamic Programming − Global alignment A A T C T A C 0 1 2 3 4 5 6 7 A 1 0 1 2 3 4 5 6 T 2 1 1 1 2 3 4 5 C 3 2 2 2 1 2 3 4 A 4 3 2 3 2 2 2 3 T 5 4 3 2 3 2 3 3 C 6 5 4 3 2 3 3 3 Scoring parameters (distance scheme): Matches = 0 Mismatches and insertions/deletions = 1 ��

Dynamic Programming − Local Alignment A A T C T A C 0 0 0 0 0 0 0 0 A 0 1 1 0 0 0 1 0 T 0 0 0 2 1 1 0 0 C 0 0 0 1 3 2 1 1 A 0 1 1 0 2 2 3 2 T 0 0 0 2 1 3 2 2 C 0 0 0 1 3 2 2 3 Scoring parameters (similarity scheme): Matches: −1 Mismatches and insertions/deletions = −1 ��

EMBL Database Growth (growth statistics as of 05/30/02) EMBL - PDF document

EMBL Database Growth (growth statistics as of 05/30/02) EMBL Divisions (05/30/02) Computational Genomics Sequence determination Signal processing Assembly of shotgun sequence fragments and determination of a consensus sequence

EMBL Outstation at DESY, Hamburg 1974 40 years later Macromolecular crystallography EMBL

The European Molecular Biology Laboratory: Excellence by choice Prof. Iain W. Mattaj EMBL

Joint use of SAXS and crystallography Rob Meijers EMBL Hamburg Outstation SAXS EMBO Course EMBL

Small-Angle Scattering Atomic Structure Based Modeling Alejandro Panjkovich EMBL Hamburg

Rigid body refinement (basics) D.Svergun, EMBL-Hamburg Shapes from recent projects at EMBL-HH

Rigid body refinement (basics) D.Svergun, EMBL-Hamburg Shapes from recent projects at EMBL-HH

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Shapes from recent projects at EMBL-HH Complexes and assemblies Domain and quaternary structure

Shapes from recent projects at EMBL-HH Complexes and assemblies Domain and quaternary structure

Processing and analysis of SEC-SAXS data Alejandro Panjkovich EMBL Hamburg 11.12.2017 A.

SREFLEX: S AXS REF inement through FLEX ibility Alejandro Panjkovich EMBL Hamburg 20.10.2016 A.

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

National Address Database National Address Database What is a National Address Database?

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

Convergence, reproducibility and accuracy in the simulation of conformational ensembles of

Interactions of HIV-1 Gag Protein with RNA Alan Rein HIV Dynamics and Replication Program Some

Patent Law Prof. Roger Ford March 21, 2016 Class 14 Nonobviousness: Life after KSR ;

Assay Optimization for ddPCR Liane D. Fairfull Genomics Research Core University of Pittsburgh

Boston Biosafety Update Simon Muchohi, PhD, MPH, CIH, CSP, CHMM, Director of Biological Safety,

OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs Robert M. Waterhouse 1,2

Foundations I Fall, 2017 Prof. J.M. Tepper Aidekman 109F X 3618 Course Organization Brief

Introduction to bioActors Weizhong Li UCSD SDSC September 5-6 2012

Sambuz

Useful Links

Newsletter

Mail Us