Databases indexation Laurent Falquet, Basel October, 2006 Swiss - - PDF document

databases indexation
SMART_READER_LITE
LIVE PREVIEW

Databases indexation Laurent Falquet, Basel October, 2006 Swiss - - PDF document

Databases indexation Laurent Falquet, Basel October, 2006 Swiss Institute of Bioinformatics Swiss EMBnet node Overview Data access concept BLAST sequential Why indexing? direct formatdb Parsing output Indexing


slide-1
SLIDE 1

Databases indexation

Laurent Falquet, Basel October, 2006 Swiss Institute of Bioinformatics Swiss EMBnet node

LF, Basel October 2006

Overview

Data access concept

sequential direct

Indexing

EMBOSS Fetch Other

BLAST

Why indexing? formatdb Parsing output

Excel import/export

Tab delimited Coma delimited

slide-2
SLIDE 2

LF, Basel October 2006

Why indexing?

Human tendency to

classify and group

Examples:

Dictionnary Book Library DVD chapters iPod play lists

Advantages:

Fast access Easy data finding

Disadvantages:

Time to prepare indices

LF, Basel October 2006

Data access: sequential vs direct

Sequential access Direct access

Vary from very short to very long Very small variations

track sector head

slide-3
SLIDE 3

LF, Basel October 2006

Similar concept for databases

Flat files = sequential Indexing = simulated direct

>seq1 cga t g t c a t g t g >seq2 cga t cg t agc t g t a g c t g t ag >seq3 ca t g t g c a t g cgacg t

23 47 seq3 28 19 seq2 19 seq1

Length (byte) Position (byte) ID

LF, Basel October 2006

Tools

EMBOSS

dbxflat dbxfasta dbiblast seqret seqretsplit entret

Other examples

SRS (icarus language)

http://srs.ebi.ac.uk http://www.lionbioscience.com/

indexer & fetch (w arning

local SIB tool)

Relational (MySQL, Oracle…) Web (Google!!)

slide-4
SLIDE 4

LF, Basel October 2006

EMBOSS how to index?

Where is your file? What is the format? Where should be the

indices?

Where is the

emboss.default file? (.embossrc)

Other EMBOSS tools

textsearch Whichdb

More details

w w w .emboss.org

LF, Basel October 2006

EMBOSS example

Input file and directory

~/embossidx/ECOLI.dat cd embossidx

Index creation

dbxflat -idformat sw iss -dbname ecoli -filenames '* .dat' -dbresource

sw iss -directory . -release 1.0 -date 26/09/06 -fields id,acc

Generates 5 files (default)

ECOLI.ent ECOLI.pxac ECOLI.pxid ECOLI.xac ECOLI.xid

Don’t forget to modify ~/.embossrc

slide-5
SLIDE 5

LF, Basel October 2006

.embossrc

Example of queries

seqret ecoli:thio_ecoli seqret ecoli:P00274 entret ecoli:thio_ecoli

and even

seqret ‘ecoli:* _ECOLI’ s e t embos s _ f i l t e r1 # Eco l i DB eco l i[ t y p e : P c

  • mmen

t : "E . c

  • l

ip r

  • t

eome" me t h

  • d

: emboss f

  • rma

t : sw i s s d i r : " {pa th } / embo s s i dx " f i l e : "ECOLI . d a t " r e l e a s e : " 1 . " i n d e xd i r : " {pa t h} / embos s i d x " ]

Where {path} is the path to your home directory

LF, Basel October 2006

Indexer & fetch

Warning this is a local SIB tool!! Input file and directory

~/embossidx/ECOLI.dat cd embossidx

Index creation

indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx

Generates 1 file

ecoli.idx

Don’t forget to modify config file

slide-6
SLIDE 6

LF, Basel October 2006

Config file: fetch.conf

Example of queries

fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’

fetch.conf

#dbkeyf

  • r

ma t i ndex f i l ed a t a f i l e e co l is p ~ / embos s i dx/ e co l i . i d x ~ / embos s i dx /ECOLI. d a t

LF, Basel October 2006

BLAST

Maintained at NCBI Source distributed freely with

several accessory tools

ftp://ftp.ncbi.nlm.nih.gov/too

lbox/ncbi_tools/ncbi.tar.gz

May require compilation to

install on your local computer

blastall contains

blastp blastn blastx tblastn tblastx

Other tools

blastpgp megablast formatdb

slide-7
SLIDE 7

LF, Basel October 2006

Available Blast programs

Program Query Database

blastp protein protein blastn nucleotide nucleotide blastx protein nucleotide protein tblastn protein protein nucleotide tblastx protein nucleotide protein nucleotide

VS VS VS VS VS

LF, Basel October 2006

What makes BLAST so fast?

Indexing all words of 3 aa or

11 bp in the sequence database

Searching the query for all

words of a score > T

Search the indexed database

for all perfect matches

Try to align matches that are

  • n the same diagonal
slide-8
SLIDE 8

LF, Basel October 2006

Indexing for Blast (1)

REL Query RSL RSL AAA AAC AAD YYY AAA AAC AAD YYY List of all possible words with 3 amino acid residues (8000) ... ACT RSL TVF ACT RSL TVF List of words matching the query with a score > T score > T ... ... LKP LKP L K P L K P score < T

A substitution matrix is used to compute the word scores A substitution matrix is used to compute the word scores

LF, Basel October 2006

Indexing for Blast (2)

ACT RSL TVF ACT RSL TVF List of words matching the query with a score > T ... ... ACT ACT ACT RSL RSL TVF RSL RSL RSL RSL TVF TVF Database sequences

List of sequences containing

words similar to the query (hits)

List of sequences containing

words similar to the query (hits) Search for exact matches

slide-9
SLIDE 9

LF, Basel October 2006

Indexing for Blast (3)

Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A Database sequence Query A Extension using dynamic programming limited to a restricted region limited through a score drop-off threshold

LF, Basel October 2006

BLAST indexing w ith formatdb

Formatdb

mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb

Generates 3 files

mydb.psq mydb.pin mydb.phr

Then start a Blast:

blastall -p blastp -d mydb -i myseq (-optional parameters)

slide-10
SLIDE 10

LF, Basel October 2006

Blast local vs remote

blastall

Executed locally Slow No need to transfert db

blastall.remote

Executed remotely Fast Requires special

priviledges and db transfert

Using BioPerl (remoteblast.pm)

Blast at NCBI No user db See w w w .bioperl.org

LF, Basel October 2006

Multiple Blasts?

1 seq vs db seq

1 FASTA seq as input

db seq vs db seq

Several single FASTA

seq files as input or

1 Multiple FASTA seq

file as input

Possibility to export

results as XML

Use Perl to automatize the

queries and parse the

  • utput
slide-11
SLIDE 11

LF, Basel October 2006

Parsing Blast output

BLASTP 2 . 2 . 10 [Oc t

  • 19-

2004 ] Re f e r ence : Al t s chu l , S t ephen F . , Thomas L . Madden , A l e j a nd ro A . S cha f f e r , J i nghu i Zhang , Zheng Zhang , Webb Mi l l e r , a nd Dav id J . L i p man ( 1997 ) , "Gapped BLAST and PSI

  • BLAST

: a new gene r a t i

  • n o

f p r

  • t

e i n d a t aba s es ea r ch p rog r ams" , Nuc l e i c Ac i d s Re s . 25 : 3 389

  • 3402

. Que ry= ACCA_BACSU O34847 Ace ty l

  • co

enzyme A c a r boxy l a s e c a r boxy l t r a n s f e r a s es ubun i ta l pha (EC 6 . 4 . 1 . 2 ) . ( 3 25 l e t t e r s ) Da t aba s e : e c

  • l

i _b l a s t 4 339 s equence s ; 1 , 3 73 , 039 t

  • t

a l l e t t e r s Sea r ch i ng . . . . . . . . . d

  • ne

S co r e E Sequence s p r

  • duc

i ngs i g n i f i c an ta l i g nmen t s : ( b i t s ) Va l u e ACCA_ECOLI P30867 Ace t y l

  • co

e nzyme A c a r boxy l a s e c a r boxy lt r a n s f e . . . 2 66 1 e

  • 72

LF, Basel October 2006

Parsing Blast output (2)

>ACCA_ECOLI P30867 Acet y l

  • c
  • enzyme A ca

r boxy l a s e c a r boxy l t r a n s f e r a s es ubun i ta l p ha (EC 6 . 4 . 1 . 2 ) . L eng t h= 318 Sco r e = 2 66 b i t s ( 681 ) , Expect= 1 e

  • 72

I d en t i t i e s= 143 / 312 ( 45%) , P

  • s

i t i v e s = 188 / 312 ( 60%) , Gap s = 3 / 312 ( %) Que ry : 5 LEFEKPVIELQTKIAELKKFTQDS-

  • DMDLSAEIERLEDRLAKLQDDIYKNLKP W DRVQ 61

L+FE+P+ EL+ K I L ++ D+++ E+ RL ++ +L I + +L W Q Sb j c t : 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKI FADLGA W Q IAQ 64 Que ry : 6 2 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV + IGHQ+G++TK Sb j c t : 6 5 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPV MIIGHQKGRETK 124 Que ry : 1 22 ENLVRNFG MPHPEGYRKALRL MKQADKFNRPI ICF IDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFG MP PEGYRKALRL M+ A++F P I I F IDT GAYPG AEERGQSEAIA+NL Sb j c t : 1 25 EKIRRNFG MPAPEGYRKALRLM Q M AERFKMPI ITF IDTPGAYPGVGAEERGQSEAIARNL 184 Que ry : 1 82 FEM A GLRVPXXXXXXXXXXXXXXXXXXXXXXXH M LENSTYSVISPEGAAALLW K DSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++L WK + A Sb j c t : 1 85 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVN MLQYSTYSVISPEGCASILWKSADKAP 244 Que ry : 2 42 KAAET MKITAPDLKELGI IDH MIKEVKGGAHHDVKLQASY M DXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL + ID + I E GGAH + + A+ + Sb j c t : 2 45 LAAEA M GI IAPRLKELKLIDS I IPEPLGGAHRNPEA MA ASLKAQLLADLADLDVLSTEDL 304 Que ry : 3 02 VQQRYEKYKAIG 313 +RY++ + G Sb j c t : 3 05 KNRRYQRL MSYG 316

slide-12
SLIDE 12

LF, Basel October 2006

Parsing Blast output (3)

With BioPerl:

# ! / u s r / l

  • c

a l / b i n / p e r l u s e B i

  • :

: S e a r c h IO ; my $b l a s t _ r epo r t = n ew Bio : : S e a r ch I O ( '

  • f
  • rma

t '=>' b l a s t ' , '

  • f

i l e ' => $ARGV[0 ] ) ; p r i n t"Que ry n ame : \ tQuer y d e s c r i p t i

  • n

: \ tH i tn ame: \ tH i td e s c r i p t i

  • n

: \ tE

  • v

a l ue \ t S co r e \ n " ; wh i l e ( my $ r e s u l t= $b l a s t _ r epo r t

  • >nex

t _ r e su l t ) { p r i n t$ r e su l t

  • >que

ry_name( ) ," \ t " , $ r e su l t

  • >que

r y _de s c r i p t i

  • n(

) ," \ n " ; wh i l e ( my $ h i t = $ r e su l t

  • >nex

t _h i t ( ) ) { p r i n t " \ t \ t " , $h i t

  • >name

( ) ," \ t " , $ h i t

  • >de

s c r i p t i

  • n(

) ; wh i l e ( my $ h sp = $h i t

  • >

nex t _h sp ( ) ) { p r i n t" \ t " , $ h s p

  • >eva

l u e ( ) ," \ t " , $ h sp-> s co r e ( ) ; } p r i n t" \ n " ; } } ex i t0 ;

LF, Basel October 2006

MS-Excel import/export

Excel can import

Tab delimited Coma delimited

Excel can export

Tab delimited Space delimited

AC/ID desc score e-value THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5 THIO_HUMAN thioredoxin Homo sapiens 120 0.001

slide-13
SLIDE 13

LF, Basel October 2006

MS-Excel import/export

Tab delimited file:

\t delimits the columns \n delimits the lines Optional first line contains columns title Example:

AC/ ID \ t d e s c \ t s c

  • r

e \ t e

  • v

a l u e \ n THIO_ECOLI \ t t h i

  • r

edox i n E s che r i ch i a c

  • l

i \ t 2 34 \ t 2 . 1 e

  • 5

\ n THIO_HU M AN\ t t h i

  • r

edox i n Homo s ap i en s \ t 1 20 \ t . 001 \ n

LF, Basel October 2006

MS-Excel import/export

Coma delimited file:

, delimits the columns, each value is surrounded by ‘ ’ \n delimits the lines Optional first line contains columns title Example:

‘AC / ID ’ , ’ d e s c ’ , ’ s co r e ’ , ’ e

  • va

l u e ’ \ n ’THIO_ECOLI ’ , ’ t h i

  • r

edoxi n E s che r i c h i a c

  • l

i ’ , ’ 234 ’ , ’ 2 . 1 e

  • 5

’ \ n ’THIO_HU M AN’ , ’ t h i

  • r

edox i n Homo s ap i e n s ’ , ’ 120 ’ , ’ . 001 ’ \ n