November 09 CSE 182
CSE182-L8
Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding
CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) - - PowerPoint PPT Presentation
CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding November 09 CSE 182 QUIZ! Question: your friend likes to gamble. He tosses a coin: HEADS, he gives you a dollar. TAILS, you give
November 09 CSE 182
Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding
November 09 CSE 182
while’, he uses a loaded coin.
the coin?
database, find all sequences that match the pattern.
the sequence.
November 09 CSE 182
Fa 07 CSE182
R = {ε} Base case R = {σ},σ ∈ Σ R = R
1 + R2 Union of strings
R = R
1 ⋅ R2 Concatenation
R = R
1
* 0 or more repetitions
Fa 07 CSE182
– Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E?
– R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE?
Fa 07 CSE182
Regular Expression & Automata
graph) with the following properties:
– The automaton has a start and end node – Each edge is labeled with a symbol from ∑, or ε
labeled with s.
Fa 07 CSE182
Examples: Regular Expression & Automata
C A C start end E E
November 09 CSE 182
Constructing automata from R.E
November 09 CSE 182
is a path from START to END in RA, labeled by s.
RA), and a database D, is there a string D[b..c] that matches RA (D[b..c] ∈ R)
automaton of R?
November 09 CSE 182
– There is a path labeled D[1]…D[c] that goes from START to END in RA
D[1] D[2] D[c]
November 09 CSE 182
– There is a path labeled D[1]…D[c] that goes from START to END in RA – There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END
D[1] .. D[c-1] D[c]
u
November 09 CSE 182
D.P. to match regular expression
– A[u,σ] = Automaton node reached from u after reading σ – Eps(u): set of all nodes reachable from node u using epsilon transitions. – N[c] = subset of nodes reachable from START node after reading D[1..c] – Q: when is v ∈ N[c]
ε
November 09 CSE 182
D.P. to match regular expression
November 09 CSE 182
November 09 CSE 182
– Is D[1..c] accepted by R? – Yes, if END ∈ N[c]
– Is D[l..c] (for some l, and some c) accepted by R
November 09 CSE 182
A Fam(B)
C-X-[DE]-X{10,12}-C-X-C--[STYLV] C E F
sequence is not accepted.
November 09 CSE 182
– Regular expressions are intolerant to an
– The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. – Profiles capture some of these ideas.
November 09 CSE 182
alignment of strings
alphabet A,
matrix F=(fki)
represents the frequency of symbol k in position i
0.71 0.14 0.14 0.28
November 09 CSE 182
alignment of strings
alphabet A,
matrix F=(fki)
represents the frequency of symbol k in position i
0.71 0.14 0.14 0.28
November 09 CSE 182
belong to the family described by a profile?
the profile, and score it
aligning position i of the profile to residue sj
is the sum of column scores.
s sj i
November 09 CSE 182
k
k,s j
k i s fki Scoring Matrix
November 09 CSE 182
domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences.
sequences weakly (using BLAST), but does not match any known profile?
November 09 CSE 182
– Find homologs using Blast on query – Discard very similar homologs – Align, make a profile, search with profile. – Why is this more sensitive?
Seq Db
the red sequence will be thrown out.
in non-essential residues
November 09 CSE 182
– Use ungapped multiple alignments only
– If profile of length m must score >= T – Then, a sub-profile of length l must score >= lT|/m – Generate all l-mers that score at least lT|/M – Search using an automaton
November 09 CSE 182
alignments.
– Difficult if there are gaps in the alignment. – Psi-BLAST/BLOCKS etc. work with gapless alignments.
helps put the alignment construction/ membership query in a uniform framework.
scoring.
V
November 09 CSE 182
while’, he uses a loaded coin.
loads the coin?
November 09 CSE 182
the alignment as generating a distribution.
node that outputs a residue with the appropriate distribution
0.71 0.14 Pr[F]=0.71 Pr[Y]=0.14
November 09 CSE 182
generates random sequences.
– Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S]
November 09 CSE 182
page.
paths.
November 09 CSE 182
– M1I1M2M3 – M1M2I2M3
probabilities of transition between states
A L - L A I V L A I - L
November 09 CSE 182
– Nodes emit symbols according to ‘emission probabilities’ – Transition from node to node is guided by ‘transition probabilities’
P
– Pr[S,P|M] = Pr[S|P,M] Pr[P|M] – Pr[ALIL AND M1I1M2M3| M]
= Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3| M]
November 09 CSE 182
transitioning from state j to state k.
November 09 CSE 182
(emit) the sequence S.
– The sum over the joint probabilities over all paths.
– OR, it is the probability of the most likely path
similar algorithms to solve them.
November 09 CSE 182
likely solution that emits S1…Si, and ends in state j (is it sufficient to compute this?)
A L - L A I V L A I - L
November 09 CSE 182
that the sequence belongs to the family.
alignment
A L - L A I V L A I - L Path: M1 M2 I2 M3 A L I L
November 09 CSE 182
penalties, and allow for automated training to get a good alignment.
families and foucs on key residues
needs special algorithms to query efficiently.
November 09 CSE 182
capture proteins (domains) using various representations
associated with structure/ function information, parsed from the literature.
query mechanisms that allow us to compare our sequences against them, and assign function
3D HMM
November 09 CSE 182
What is a Gene?
November 09 CSE 182
location on the genome that codes for proteins.
used to manufacture proteins through transcription, and translation.
mapping from triplets to amino-acids
November 09 CSE 182
Eukaryotic gene structure
November 09 CSE 182
reads mRNA.
translated into a unique amino-acid until the STOP codon is encountered.
signal where translation starts, usually at the ATG (M) codon.
November 09 CSE 182
reads mRNA.
into a unique amino-acid until the STOP codon is encountered.
where translation starts, usually at the ATG (M) codon.
many ways can you translate it?
November 09 CSE 182
ATG
5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start
November 09 CSE 182
– Location that codes for a protein – The transcript sequence(s) that encodes the protein – The protein sequence(s)
spent isolating a single gene sequence.
development of high throughput methods like EST sequencing
November 09 CSE 182
mRNA from a cell.
transcriptase is used to make a DNA copy of the RNA.
complementary DNA strand.
both ends.
transcripts/expressed sequences (ESTs).
AAAA TTTT AAAA TTTT
November 09 CSE 182
(mRNA) has a poly-A tail at the end, which can be used as a template for Reverse Transcriptase.
the spliced message!
sequenced from one (3’/5’) or both ends.
times.
sequences is called an EST database AAAA TTTT AAAA TTTT
November 09 CSE 182
good thing?
sequence