TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald - - PowerPoint PPT Presentation

text encoding for protein structure representation
SMART_READER_LITE
LIVE PREVIEW

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald - - PowerPoint PPT Presentation

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background The Problem and our goal Related research General idea Implementations Summary BIOLOGICAL BACKGROUND [1] PROTEIN PROTEIN One of the four of


slide-1
SLIDE 1

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION

Jun Tan Donald Adjeroh

slide-2
SLIDE 2

Biological background The Problem and our goal Related research General idea Implementations Summary

slide-3
SLIDE 3

BIOLOGICAL BACKGROUND [1]

slide-4
SLIDE 4

PROTEIN

slide-5
SLIDE 5

PROTEIN

One of the four of life's basic building blocks DNA -> RNA -> Protein Peptide bound: the link between two amino acids Polypeptide: chain of amino acide Once the chain of amino acide is in its final shape, it is called protein Twenty types of amino acids Three group: COOH, NH2 and R

slide-6
SLIDE 6

PROTEIN

Protein can have very complex shapes, and the final form is essential to its intended function Primary structure: the chain of amino acides reaches its final form Secondary structure describes common folding patterns Tertiary structure describes the overall three- dimensional structure of a single folded amino acid chain Quaternary structure for protein with multiple chains describes all subunits consist of the protein

slide-7
SLIDE 7

PROTEIN

Primary structure determines all other structures However...

slide-8
SLIDE 8

PROTEIN

The shape of the 3D protein structure has a direct impact on its function. Secondary structure is much more conserved than sequence (primary structure), over evolution. [2]

slide-9
SLIDE 9

THE PROBLEM AND OUR GOAL

slide-10
SLIDE 10

THE ORIGINAL FORMAT

Saved as x, y, z coordinates for each atom along the chain Complicated operations for even simple tasks

slide-11
SLIDE 11

OUR GOAL

Simplify the representation keep important information such that it retains the biological meanings Demo the performance: search similar structures in a protein domain database 80 query domains, database size: 23500

slide-12
SLIDE 12

STRUCTURAL CLASSIFICATION OF PROTEIN (SCOP) DATABASE

Protein Domain: part of a protein that can evolve, function and exist independently of the rest of the protein chain Manually classified Hierarchical structure: Class, Fold, Super-family, family

slide-13
SLIDE 13

RELATED RESEARCH

slide-14
SLIDE 14

DIRECTLY ALIGN 3D SHAPES

High accuracy Involve complex operation. Time consuming Example: DALI [3, 4](distance alignment matrix method) algorithm

slide-15
SLIDE 15

CONVERT 3D SHAPE INTO 2D TEXTURES

slide-16
SLIDE 16

CONVERT 3D SHAPE INTO STRING

Not as accurate as 3D method but close Much faster Example: Ramachandran codes [5]

slide-17
SLIDE 17

FRAGMENT APPROACH [6]

Library of fragments/short structure motifs (hand picked?) Represent protein structure as the frequency of the fragments Bag of words method

slide-18
SLIDE 18

GENERAL IDEA

Decompose a shape into a sequence of segments Represent the segments with basic primitives: segment type, segment length and transition angle between segments Encoded into shape string Answer biological question by applying string/text algorithms on the shape strings N-Gram, TF/IDF and cosine similarity are used when compare similarity between shape strings

slide-19
SLIDE 19

IMPLEMENTATIONS

slide-20
SLIDE 20

DIHEDRAL ANGLES (RAMACHANDRAN ANGLES)

One of the most important local parameters that control protein folding Three angles:

  • 1. φ involves atoms C'-N-Cα-C'
  • 2. ψ involves atoms N-Cα-C'-N
  • 3. ω involves atoms Cα-C'-N-Cα (usually 0 or 180

due to peptide bond)

slide-21
SLIDE 21

RAMACHANDRAN PLOT

slide-22
SLIDE 22

CLUSTERING DIHEDRAL ANGLES

slide-23
SLIDE 23

PRECISION VS RECALL

4 Clusters 6 Clusters

slide-24
SLIDE 24

FOLD VS RECALL

4 Clusters 6 Clusters

slide-25
SLIDE 25

CLASS VS RECALL

4 Clusters 6 Clusters

slide-26
SLIDE 26

TRIPLES

Dihedral angles involve three consecutive residues

  • nly.

Pick three residues/points that can best represent a segment of a given length The three residues is selected as following:

  • 1. Select the first and last residue A and B
  • 2. Select residue C such that the distance d from C to

straight line segment AB is maximized

  • 3. Using three distances to represent the triple: d,

|AB|, max(|AC|, |BC|) Another predefined parameter determines how much two adjacent fragments overlap.

slide-27
SLIDE 27

ILLUSTRATION

slide-28
SLIDE 28

DISTRIBUTION (SEGMENT SIZE = 5)

slide-29
SLIDE 29

DISTRIBUTION (SEGMENT SIZE = 10)

slide-30
SLIDE 30

PRECISION VS RECALL

Triples, 6 Clusters Dihedral angles, 6 Clusters

slide-31
SLIDE 31

FOLD VS RECALL

Triples, 6 Clusters Dihedral angles, 6 Clusters

slide-32
SLIDE 32

CLASS VS RECALL

Triples, 6 Clusters Dihedral angles, 6 Clusters

slide-33
SLIDE 33

SUMMARY

slide-34
SLIDE 34

SIGNIFICANCE

Avoid alignment. Runs fast: O(n) complexity. Automatically learn important patterns. No predefined fragment libraries are needed

slide-35
SLIDE 35

WEAKNESS AND FUTURE WORK

Performance is not as good as alignment based methods Possible improvement one: Using multiple strings

slide-36
SLIDE 36

QUESTIONS?

slide-37
SLIDE 37

REFERENCE

1.

  • 2. Whitford D, Proteins: Structure and Function, John

Whiley & Sons, West Sussex, 2005.

  • 3. Holm, L. and Sander, C, “Touring protein fold space

with Dali/FSSP”, Nucleic Acids Res., 26, 316-319, 1998.

  • 4. Holm L, Kaariainen S, Rosenstrom P, Schenkel A.,

“Searching protein structure databases with DaliLite v.3”, Bioinformatics 24, 2780-2781, 2008. An Introduction to Proteins

slide-38
SLIDE 38

REFERENCE (COND.)

  • 5. Lo WC, Huang PJ, Chang CH and Lyu PC, “Protein

structural sim- ilarity search by Ramachandran codes”, BMC Bioinformatics, 8, 307, 2007.

  • 6. Budowski-Tal, Inbal, Yuval Nov, and Rachel Kolodny.

“FragBag, an accurate representation of protein structure, retrieves structural neigh- bors from the entire PDB quickly and accurately.” Proceedings of the National Academy of Sciences 107.8 (2010): 3481-3486.