 
              Databases indexation Laurent Falquet, Basel October, 2006 Swiss Institute of Bioinformatics Swiss EMBnet node Overview � Data access concept � BLAST � sequential � Why indexing? � direct � formatdb � Parsing output � Indexing � Excel import/export � EMBOSS � Fetch � Tab delimited � Other � Coma delimited LF, Basel October 2006
Why indexing? � Human tendency to � Advantages: classify and group � Fast access � Easy data finding � Examples: � Disadvantages: � Dictionnary � Time to prepare indices � Book � Library � DVD chapters � iPod play lists LF, Basel October 2006 Data access: sequential vs direct � Sequential access � Direct access Vary from very short to very long Very small variations track sector head LF, Basel October 2006
Similar concept for databases � Flat files = sequential � Indexing = simulated direct >seq1 ID Position Length cga t g t c a t g t g (byte) (byte) >seq2 cga t cg t agc t g t a g c t g t ag seq1 0 19 >seq3 ca t g t g c a t g cgacg t seq2 19 28 seq3 47 23 LF, Basel October 2006 Tools � EMBOSS � Other examples � dbxflat � SRS (icarus language) � dbxfasta � http://srs.ebi.ac.uk � dbiblast � http://www.lionbioscience.com/ � indexer & fetch (w arning � seqret local SIB tool) � seqretsplit � Relational (MySQL, Oracle…) � entret � Web (Google!!) LF, Basel October 2006
EMBOSS how to index? � Where is your file? � Other EMBOSS tools � textsearch � What is the format? � Whichdb � Where should be the indices? � More details � Where is the � w w w .emboss.org emboss.default file? (.embossrc) LF, Basel October 2006 EMBOSS example � Input file and directory � ~/embossidx/ECOLI.dat � cd embossidx � Index creation � dbxflat -idformat sw iss -dbname ecoli -filenames '* .dat' -dbresource sw iss -directory . -release 1.0 -date 26/09/06 -fields id,acc � Generates 5 files (default) � ECOLI.ent � ECOLI.pxac � ECOLI.pxid � ECOLI.xac � ECOLI.xid � Don’t forget to modify ~/.embossrc LF, Basel October 2006
.embossrc s e t embos s _ f i l t e r1 � Example of queries # Eco l i � seqret ecoli:thio_ecoli DB eco l i[ � seqret ecoli:P00274 t y p e : P � entret ecoli:thio_ecoli c ommen t : "E . c o l ip r o t eome" me t h od : emboss � and even f o rma t : sw i s s d i r : " {pa th } / embo s s i dx " � seqret ‘ecoli:* _ECOLI’ f i l e : "ECOLI . d a t " r e l e a s e : " 1 . 0 " i n d e xd i r : " {pa t h} / embos s i d x " ] Where {path} is the path to your home directory LF, Basel October 2006 Indexer & fetch � Warning this is a local SIB tool!! � Input file and directory � ~/embossidx/ECOLI.dat � cd embossidx � Index creation � indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx � Generates 1 file � ecoli.idx � Don’t forget to modify config file LF, Basel October 2006
Config file: fetch.conf � fetch.conf #dbkeyf o r ma t i ndex f i l ed a t a f i l e e co l is p ~ / embos s i dx/ e co l i . i d x ~ / embos s i dx /ECOLI. d a t � Example of queries � fetch -c fetch.conf ecoli:thio_ecoli � fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’ LF, Basel October 2006 BLAST � Maintained at NCBI � blastall contains � blastp � Source distributed freely with � blastn several accessory tools � blastx � ftp://ftp.ncbi.nlm.nih.gov/too � tblastn lbox/ncbi_tools/ncbi.tar.gz � tblastx � May require compilation to � Other tools install on your local computer � blastpgp � megablast � formatdb LF, Basel October 2006
Available Blast programs Program Query Database protein protein VS blastp nucleotide blastn nucleotide VS blastx nucleotide protein protein VS tblastn nucleotide protein protein VS nucleotide nucleotide tblastx protein protein VS LF, Basel October 2006 What makes BLAST so fast? � Indexing all words of 3 aa or 11 bp in the sequence database � Searching the query for all words of a score > T � Search the indexed database for all perfect matches � Try to align matches that are on the same diagonal LF, Basel October 2006
Indexing for Blast (1) � A substitution matrix is used to compute the word scores � A substitution matrix is used to compute the word scores Query REL RSL RSL score > T LKP LKP score < T AAA ACT AAA ACT ... AAC AAC AAD RSL AAD RSL ... ... TVF TVF YYY YYY List of all possible words with List of words matching the 3 amino acid residues (8000) query with a score > T L L K K P P LF, Basel October 2006 Indexing for Blast (2) Database sequences ACT ACT ACT Search for ACT RSL ACT exact matches ... RSL RSL RSL RSL ... RSL TVF TVF RSL TVF TVF RSL TVF List of words matching the query with a score > T � List of sequences containing � List of sequences containing words similar to the query (hits) words similar to the query (hits) LF, Basel October 2006
Indexing for Blast (3) Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A Database sequence Query A Extension using dynamic programming limited to a restricted region limited through a score drop-off threshold LF, Basel October 2006 BLAST indexing w ith formatdb � Formatdb � mydb.seq must contain sequences in FASTA format � formatdb -i mydb.seq -p T -n mydb � Generates 3 files � mydb.psq � mydb.pin � mydb.phr � Then start a Blast: � blastall -p blastp -d mydb -i myseq (-optional parameters) LF, Basel October 2006
Blast local vs remote � blastall � blastall.remote � Executed locally � Executed remotely � Slow � Fast � No need to transfert db � Requires special priviledges and db transfert � Using BioPerl (remoteblast.pm) � Blast at NCBI � No user db � See w w w .bioperl.org LF, Basel October 2006 Multiple Blasts? � 1 seq vs db seq � Use Perl to automatize the � 1 FASTA seq as input queries and parse the � db seq vs db seq output � Several single FASTA seq files as input or � 1 Multiple FASTA seq file as input � Possibility to export results as XML LF, Basel October 2006
Parsing Blast output BLASTP 2 . 2 . 10 [Oc t - 19- 2004 ] Re f e r ence : Al t s chu l , S t ephen F . , Thomas L . Madden , A l e j a nd ro A . S cha f f e r , J i nghu i Zhang , Zheng Zhang , Webb Mi l l e r , a nd Dav id J . L i p man ( 1997 ) , "Gapped BLAST and PSI -BLAST : a new gene r a t i o n o f p r o t e i n d a t aba s es ea r ch p rog r ams" , Nuc l e i c Ac i d s Re s . 25 : 3 389 -3402 . Que ry= ACCA_BACSU O34847 Ace ty l - co enzyme A c a r boxy l a s e c a r boxy l t r a n s f e r a s es ubun i ta l pha (EC 6 . 4 . 1 . 2 ) . ( 3 25 l e t t e r s ) Da t aba s e : e c o l i _b l a s t 4 339 s equence s ; 1 , 3 73 , 039 t o t a l l e t t e r s Sea r ch i ng . . . . . . . . . d one S co r e E Sequence s p r oduc i ngs i g n i f i c an ta l i g nmen t s : ( b i t s ) Va l u e ACCA_ECOLI P30867 Ace t y l - co e nzyme A c a r boxy l a s e c a r boxy lt r a n s f e . . . 2 66 1 e - 72 LF, Basel October 2006 Parsing Blast output (2) >ACCA_ECOLI P30867 Acet y l - c o enzyme A ca r boxy l a s e c a r boxy l t r a n s f e r a s es ubun i ta l p ha (EC 6 . 4 . 1 . 2 ) . L eng t h= 318 Sco r e = 2 66 b i t s ( 681 ) , Expect= 1 e - 72 I d en t i t i e s= 143 / 312 ( 45%) , P o s i t i v e s = 188 / 312 ( 60%) , Gap s = 3 / 312 ( 0 %) Que ry : 5 LEFEKPVIELQTKIAELKKFTQDS- - -DMDLSAEIERLEDRLAKLQDDIYKNLKP W DRVQ 61 L+FE+P+ EL+ K I L ++ D+++ E+ RL ++ +L I + +L W Q Sb j c t : 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKI FADLGA W Q IAQ 64 Que ry : 6 2 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV + IGHQ+G++TK Sb j c t : 6 5 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPV MIIGHQKGRETK 124 Que ry : 1 22 ENLVRNFG MPHPEGYRKALRL MKQADKFNRPI ICF IDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFG MP PEGYRKALRL M+ A++F P I I F IDT GAYPG AEERGQSEAIA+NL Sb j c t : 1 25 EKIRRNFG MPAPEGYRKALRLM Q M AERFKMPI ITF IDTPGAYPGVGAEERGQSEAIARNL 184 Que ry : 1 82 FEM A GLRVPXXXXXXXXXXXXXXXXXXXXXXXH M LENSTYSVISPEGAAALLW K DSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++L WK + A Sb j c t : 1 85 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVN MLQYSTYSVISPEGCASILWKSADKAP 244 Que ry : 2 42 KAAET MKITAPDLKELGI IDH MIKEVKGGAHHDVKLQASY M DXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL + ID + I E GGAH + + A+ + Sb j c t : 2 45 LAAEA M GI IAPRLKELKLIDS I IPEPLGGAHRNPEA MA ASLKAQLLADLADLDVLSTEDL 304 Que ry : 3 02 VQQRYEKYKAIG 313 +RY++ + G Sb j c t : 3 05 KNRRYQRL MSYG 316 LF, Basel October 2006
Recommend
More recommend