SLIDE 1
TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald - - PowerPoint PPT Presentation
TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald - - PowerPoint PPT Presentation
TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background The Problem and our goal Related research General idea Implementations Summary BIOLOGICAL BACKGROUND [1] PROTEIN PROTEIN One of the four of
SLIDE 2
SLIDE 3
BIOLOGICAL BACKGROUND [1]
SLIDE 4
PROTEIN
SLIDE 5
PROTEIN
One of the four of life's basic building blocks DNA -> RNA -> Protein Peptide bound: the link between two amino acids Polypeptide: chain of amino acide Once the chain of amino acide is in its final shape, it is called protein Twenty types of amino acids Three group: COOH, NH2 and R
SLIDE 6
PROTEIN
Protein can have very complex shapes, and the final form is essential to its intended function Primary structure: the chain of amino acides reaches its final form Secondary structure describes common folding patterns Tertiary structure describes the overall three- dimensional structure of a single folded amino acid chain Quaternary structure for protein with multiple chains describes all subunits consist of the protein
SLIDE 7
PROTEIN
Primary structure determines all other structures However...
SLIDE 8
PROTEIN
The shape of the 3D protein structure has a direct impact on its function. Secondary structure is much more conserved than sequence (primary structure), over evolution. [2]
SLIDE 9
THE PROBLEM AND OUR GOAL
SLIDE 10
THE ORIGINAL FORMAT
Saved as x, y, z coordinates for each atom along the chain Complicated operations for even simple tasks
SLIDE 11
OUR GOAL
Simplify the representation keep important information such that it retains the biological meanings Demo the performance: search similar structures in a protein domain database 80 query domains, database size: 23500
SLIDE 12
STRUCTURAL CLASSIFICATION OF PROTEIN (SCOP) DATABASE
Protein Domain: part of a protein that can evolve, function and exist independently of the rest of the protein chain Manually classified Hierarchical structure: Class, Fold, Super-family, family
SLIDE 13
RELATED RESEARCH
SLIDE 14
DIRECTLY ALIGN 3D SHAPES
High accuracy Involve complex operation. Time consuming Example: DALI [3, 4](distance alignment matrix method) algorithm
SLIDE 15
CONVERT 3D SHAPE INTO 2D TEXTURES
SLIDE 16
CONVERT 3D SHAPE INTO STRING
Not as accurate as 3D method but close Much faster Example: Ramachandran codes [5]
SLIDE 17
FRAGMENT APPROACH [6]
Library of fragments/short structure motifs (hand picked?) Represent protein structure as the frequency of the fragments Bag of words method
SLIDE 18
GENERAL IDEA
Decompose a shape into a sequence of segments Represent the segments with basic primitives: segment type, segment length and transition angle between segments Encoded into shape string Answer biological question by applying string/text algorithms on the shape strings N-Gram, TF/IDF and cosine similarity are used when compare similarity between shape strings
SLIDE 19
IMPLEMENTATIONS
SLIDE 20
DIHEDRAL ANGLES (RAMACHANDRAN ANGLES)
One of the most important local parameters that control protein folding Three angles:
- 1. φ involves atoms C'-N-Cα-C'
- 2. ψ involves atoms N-Cα-C'-N
- 3. ω involves atoms Cα-C'-N-Cα (usually 0 or 180
due to peptide bond)
SLIDE 21
RAMACHANDRAN PLOT
SLIDE 22
CLUSTERING DIHEDRAL ANGLES
SLIDE 23
PRECISION VS RECALL
4 Clusters 6 Clusters
SLIDE 24
FOLD VS RECALL
4 Clusters 6 Clusters
SLIDE 25
CLASS VS RECALL
4 Clusters 6 Clusters
SLIDE 26
TRIPLES
Dihedral angles involve three consecutive residues
- nly.
Pick three residues/points that can best represent a segment of a given length The three residues is selected as following:
- 1. Select the first and last residue A and B
- 2. Select residue C such that the distance d from C to
straight line segment AB is maximized
- 3. Using three distances to represent the triple: d,
|AB|, max(|AC|, |BC|) Another predefined parameter determines how much two adjacent fragments overlap.
SLIDE 27
ILLUSTRATION
SLIDE 28
DISTRIBUTION (SEGMENT SIZE = 5)
SLIDE 29
DISTRIBUTION (SEGMENT SIZE = 10)
SLIDE 30
PRECISION VS RECALL
Triples, 6 Clusters Dihedral angles, 6 Clusters
SLIDE 31
FOLD VS RECALL
Triples, 6 Clusters Dihedral angles, 6 Clusters
SLIDE 32
CLASS VS RECALL
Triples, 6 Clusters Dihedral angles, 6 Clusters
SLIDE 33
SUMMARY
SLIDE 34
SIGNIFICANCE
Avoid alignment. Runs fast: O(n) complexity. Automatically learn important patterns. No predefined fragment libraries are needed
SLIDE 35
WEAKNESS AND FUTURE WORK
Performance is not as good as alignment based methods Possible improvement one: Using multiple strings
SLIDE 36
QUESTIONS?
SLIDE 37
REFERENCE
1.
- 2. Whitford D, Proteins: Structure and Function, John
Whiley & Sons, West Sussex, 2005.
- 3. Holm, L. and Sander, C, “Touring protein fold space
with Dali/FSSP”, Nucleic Acids Res., 26, 316-319, 1998.
- 4. Holm L, Kaariainen S, Rosenstrom P, Schenkel A.,
“Searching protein structure databases with DaliLite v.3”, Bioinformatics 24, 2780-2781, 2008. An Introduction to Proteins
SLIDE 38
REFERENCE (COND.)
- 5. Lo WC, Huang PJ, Chang CH and Lyu PC, “Protein
structural sim- ilarity search by Ramachandran codes”, BMC Bioinformatics, 8, 307, 2007.
- 6. Budowski-Tal, Inbal, Yuval Nov, and Rachel Kolodny.