Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold

Text example Id matrix, wordsize=3, threshold=1 \ S A A A S S A A T T T T A * * * A * * * * * A * * * T * T T

matplotlib example Id matrix, wordsize=3, threshold=1

Building an internal description of the dotplot def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range (len(s1)-ws+1): for j in range (len(s2)-ws+1): sc=0.0 for k in range (ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

Printing in in text format def printDP(s1,s2,dp,shift=0): ''' s1,s2=sequences, dp=dotplot, shif=shifting print positions''' print "\\", for j in range (len(s2)): print s2[j], print for i in range (len(s1)): print s1[i], for j in range (len(s2)): pc=" " if (i-shift,j-shift) in dp : pc="*" print pc, print

Printing using matplotlib def printMDP(s1,s2,dp,shift=0): import matplotlib.pyplot as plt L1= len (s1) L2= len (s2) plt.axis([0, L1+1, 0, L2+1]) x,y=[],[] for i,j in dp: x.append(i+shift+1) y.append(j+shift+1) plt.plot(x, y, 'ro') plt.show()

Computational complexity is quadratic: O (|s1|*|s2|*ws) def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range (len(s1)-ws+1): for j in range (len(s2)-ws+1): sc=0.0 for k in range (ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

Computational complexity is quadratic: O (|s1|*|s2|*ws) Is it possible to improve the computational complexity?

Idea: pre-process one sequence in order to know in advance the possible matches, then scan the other sequence.

Example: assume you have a function getSimpos >>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> al,sm=readScoreMatrix('id.mat') >>> getSimpos(s1,al,sm, ws=1 , th=0 ) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> getSimpos (s1,al,sm, ws=3 , th=3 ) {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>>

With the preprocess given by getSimpos, it is possible to know for each word in the second sequence if there is match with the first and directly in which positions! >>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> ws,th = 3,2 >>> simpos=getSimpos(s1,al,sm,ws,th) >>> simpos {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>> for j in range(len(s2)-ws+1): ... word=s2[j:j+ws] ... for i in simpos.get(word,[]): ... print i,j,s1[i:i+ws],s2[j:j+ws] ... 3 1 AAA AAA 4 6 AAT AAT >>>

getSimpos : simplified version def getSimpos_simple(s1,sm,ws,th): ''' s1=sequence, ws=score matrix, ws=word size, th=threshold ''' simpos={} for i in range ( len (query)-ws+1): word=query[i:i+ws] simpos[word]=simpos.get(word,[]) simpos[word].append(i) return simpos It works only for identity matrix! Why??

Blosum62 has more possible matches! >>> al,sm=readScoreMatrix('id.mat') >>> s1 'TTTAAATTT' >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> al,sm=readScoreMatrix('blosum62.mat') >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], ' S ': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'T': [0, 1, 2, 6, 7, 8]} >>>

Blosum62 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Idea: add a dictionary that takes care of all the words that we can predict will have a score higher than a threshold for a given matrix and a given word in our sequence.

Given a word: find all high scoring similar words: itertools.product >>> import itertools >>> alphabet='AB' >>> ws=3 >>> for e in itertools.product (alphabet, repeat =ws): ... print "".join(e), e ... AAA ('A', 'A', 'A') AAB ('A', 'A', 'B') ABA ('A', 'B', 'A') ABB ('A', 'B', 'B') BAA ('B', 'A', 'A') BAB ('B', 'A', 'B') BBA ('B', 'B', 'A') BBB ('B', 'B', 'B')

Given a word: find all high scoring similar words: def findSumWords(word,alphabet,sm,th): ''' findSumWords(word,alphabet,sm,th) ''' Similar=[] # list of similar words for w in itertools.product(alphabet,repeat=len(word)): cscore=0.0 for i in range(len(word)): cscore+=sm[word[i],w[i]] if cscore >= th: similar.append("".join(w)) return similar

GetSimpos def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos

Finally: find all the dotplot matches! def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): word=s2[j:j+ws] for i in simpos.get(word,[]): dp.add((i+hw,j+hw)) return dp

Computational complexity ??? Blue + Red def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp

Computational complexity: Blue def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos O(|s1|*dictionary) = O(c* n ) ~ O( n ) with a big constant c!

Computational complexity: Red for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) O(|s2|*simpos) => worst case O( n 2 ) !!! But In the average case with reasonable threshold end score The number of positions that matches can be considered constant on average, so for average case: O(n) !

Computational complexity ??? Blue + Red def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp O(n)+O(n) = O(n) !

Experimental time performace of the two implementations * dotplot ^ lindotplot

Exercise: Use the lin_dotplot idea to find local ungapped alignments: 1. use the dotplot pairs and link them together in a single alignment if the pairs are on diagonals ( eg. (i-1,j-1), (i,j), (i+1,j+1), …) 2. extend the idea to process a sequence and search similar local alignments in a set of sequences.

Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold Text example Id matrix,

Dot Dot Dot COLUMBIA COOPER Environmentally friendly manufacture iGEM 2011 of quantum dots

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

DOT CUSHION DESIGN BY HAY 19/11/19 DOT CUSHION DOT CUSHION Characterised by the

Novel Plot Structure Hello! 2 what is plot is all the events that happen in a story

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Contour plot: NO Old plot New plot Likelihood space: NO Outlook: 2 more plots almost done.

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

fuzzing & exploiting wireless device drivers Vienna, 23 November 2007 Sylvester Keil

Nine Dot Solutions Consulting Mechanical Engineers Pieter van Zyl About Nine Dot Nine Dot is a

Stateful Fuzzing of Wireless Device Drivers in an Emulated Environment Tokyo 25 October 2007

The DOT Calculus ( D ependent O bject T ypes) Nada Amin Scala Days June 18, 2014 1 DOT:

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DOT: Dependent Object Types Semester Project, Spring 2012 Nada Amin EPFL Nada Amin (EPFL) DOT:

DOT ( D ependent O bject T ypes) Nada Amin ECOOP PC Workshop February 28, 2016 1 DOT:

The Dot Product and Orthogonal Vectors The Dot Product Defn. The dot product (or inner product )

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

STAT 113: EXAM 2 STUDY GUIDE COLIN REIMER DAWSON, FALL 2015 (1) Chapter 3: Inference Foundations

Chapter 8 Section 1 MA1020 Quantitative Literacy Sidney Butler Michigan Technological University

Statistical Graphics using lattice Deepayan Sarkar Fred Hutchinson Cancer Research Center 29

Analyzing developers network in a community Community Devroom - FOSDEM Brussels, Feb. 03, 2018

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

STAT 113 Inference Using Normal Approximations Colin Reimer Dawson Oberlin College November 4,

Motivation S T A T I S T I K A U S T R I A D i e I n f o r m a t i o n s m a n a g e r EU-SILC

QCD AMPLITUDES WITH THE GLUON EXCHANGE AT HIGH ENERGIES (AND GLUON REGGEIZATION PROOF)

Sambuz

Useful Links

Newsletter

Mail Us

Dot Plot in Python We want to write a python program that create a - PowerPoint PPT Presentation

Dot Plot in Python We want to write a python program that create a dotoplot. The program should take from command line: 1. a fasta file with two sequences, 2. a scoring matrix 3. a word size 4. a display threshold Text example Id matrix,

Dot Dot Dot COLUMBIA COOPER Environmentally friendly manufacture iGEM 2011 of quantum dots

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

DOT CUSHION DESIGN BY HAY 19/11/19 DOT CUSHION DOT CUSHION Characterised by the

Novel Plot Structure Hello! 2 what is plot is all the events that happen in a story

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Contour plot: NO Old plot New plot Likelihood space: NO Outlook: 2 more plots almost done.

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

fuzzing &amp; exploiting wireless device drivers Vienna, 23 November 2007 Sylvester Keil

Nine Dot Solutions Consulting Mechanical Engineers Pieter van Zyl About Nine Dot Nine Dot is a

Stateful Fuzzing of Wireless Device Drivers in an Emulated Environment Tokyo 25 October 2007

The DOT Calculus ( D ependent O bject T ypes) Nada Amin Scala Days June 18, 2014 1 DOT:

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DOT: Dependent Object Types Semester Project, Spring 2012 Nada Amin EPFL Nada Amin (EPFL) DOT:

DOT ( D ependent O bject T ypes) Nada Amin ECOOP PC Workshop February 28, 2016 1 DOT:

The Dot Product and Orthogonal Vectors The Dot Product Defn. The dot product (or inner product )

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

STAT 113: EXAM 2 STUDY GUIDE COLIN REIMER DAWSON, FALL 2015 (1) Chapter 3: Inference Foundations

Chapter 8 Section 1 MA1020 Quantitative Literacy Sidney Butler Michigan Technological University

Statistical Graphics using lattice Deepayan Sarkar Fred Hutchinson Cancer Research Center 29

Analyzing developers network in a community Community Devroom - FOSDEM Brussels, Feb. 03, 2018

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

STAT 113 Inference Using Normal Approximations Colin Reimer Dawson Oberlin College November 4,

Motivation S T A T I S T I K A U S T R I A D i e I n f o r m a t i o n s m a n a g e r EU-SILC

QCD AMPLITUDES WITH THE GLUON EXCHANGE AT HIGH ENERGIES (AND GLUON REGGEIZATION PROOF)

Sambuz

Useful Links

Newsletter

Mail Us

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

fuzzing & exploiting wireless device drivers Vienna, 23 November 2007 Sylvester Keil