SAM-T04: whats new for CASP6 Kevin Karplus Richard Hughey Jenny - - PowerPoint PPT Presentation

sam t04 what s new for casp6
SMART_READER_LITE
LIVE PREVIEW

SAM-T04: whats new for CASP6 Kevin Karplus Richard Hughey Jenny - - PowerPoint PPT Presentation

SAM-T04: whats new for CASP6 Kevin Karplus Richard Hughey Jenny Draper, Sol Katzman, Martina Koeva, George Shackelford Bret Barnes, Marcia Soriano karplus@soe.ucsc.edu Biomolecular Engineering University of California, Santa Cruz CASP6,


slide-1
SLIDE 1

SAM-T04: what’s new for CASP6

Kevin Karplus Richard Hughey Jenny Draper, Sol Katzman, Martina Koeva, George Shackelford Bret Barnes, Marcia Soriano

karplus@soe.ucsc.edu

Biomolecular Engineering University of California, Santa Cruz

CASP6, SAM-T04 – p.1/43

slide-2
SLIDE 2

Steps of SAM-Txx Methods

Iterative search and alignment [rewritten, minor improvements] Local structure prediction [new alphabets, minor tweaks] Multi-track HMMs [minor tweaks] Finding medium-length fragments (fragfinder) [multi-track HMMs, filter implausible] Contact prediction [all new] Conformation generation (undertaker) [major changes]

CASP6, SAM-T04 – p.2/43

slide-3
SLIDE 3

Contact prediction: new in 2004!

Use mutual information between columns. Thin alignments aggressively (30%, 35%, 40%, 50%, 62%). Compute e-value for mutual info (correcting for small-sample effects). Compute z-score of log(e-value) within protein. Feed e-values, z-scores, conservation, amino-acid profile, separation along chain into neural net.

CASP6, SAM-T04 – p.3/43

slide-4
SLIDE 4

Evaluating contact prediction

Two measures of contact prediction: Accuracy:

χ(i, j) 1

(favors short-range predictions, where contact probability is higher) Weighted accuracy:

  • χ(i,j)

Prob

contact|separation=|i−j|
  • 1

(1 if predictions no better than chance based on separation).

CASP6, SAM-T04 – p.4/43

slide-5
SLIDE 5

Contact prediction results

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.01 0.1 1 true positives/predicted predictions/residue Accuracy of contact prediction, by protein Neural net thin62, e-value thin62, raw mi 5 10 15 20 25 30 0.01 0.1 1 avg is_contact/prob(contact|sep) predictions/residue Weighted-accuracy of contact prediction, by protein Neural net thin62, e-value thin62, raw mi

CASP6, SAM-T04 – p.5/43

slide-6
SLIDE 6

Undertaker

Undertaker is UCSC’s attempt at a fragment-packing program (named because it optimizes burial). New cost functions (especially H-bonds) Improved clash detection. New conformation change operators (tweaking torsion angles, rigid body movements of chunks). New ways to specify constraints (Hbond, SSbond, HelixConstraint, StrandConstraint, SheetConstraint). Improved adaptation of genetic algorithm.

CASP6, SAM-T04 – p.6/43

slide-7
SLIDE 7

Model 1 vs. Robetta 1

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 SAM-T04 model1 Robetta model1 smooth GDT scores

248 270 197 214 230 281 201 243 280_1 249_1 199_1 207 222_2 231 229_1 262 202 206 262_2 202_1 264 247_1

CASP6, SAM-T04 – p.7/43

slide-8
SLIDE 8

Good stuff from Murzin

We won’t discuss the following: T0270: 1t0tA became available after servers ran. T0213: Murzin suggested using 1t62A for T0213, T0214, and T0227. T04 scored 1t62A best—we messed up the good alignment. T0214: We used 1t62A, but we never got a good alignment. T0227: T04 scored 1t62A best, but 2◦ prediction was poor, so we had bad alignments. T0240: We submitted both dimer and monomer, but mistakenly put the dimer first. T0245: 1tljA became available, but we don’t have the true structure yet.

CASP6, SAM-T04 – p.8/43

slide-9
SLIDE 9

Best vs. Robetta best (NF and FR/A)

5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 SAM-T04 best Robetta best smooth GDT scores

201 199_3 248_1 281 209_2 215 235_2 230 248_2 248_3 272_1 212 280_2

CASP6, SAM-T04 – p.9/43

slide-10
SLIDE 10

Good stuff from Robetta

We won’t discuss the following, because the good stuff in them seems to have come from better Robetta models: T0209_2: sheet constraints from Robetta-model1 T0248 (all 3 domains): borrows heavily from Robetta-model2

CASP6, SAM-T04 – p.10/43

slide-11
SLIDE 11

Model 1 vs. alignment (NF and FR/A)

10 20 30 40 50 60 5 10 15 20 25 30 35 SAM-T04 model1 SAM-T04 first alignment smooth GDT scores

216_2 272_2 235_2 209_2 248_1 238 212 199_3 248_2 201 248_3 230 215 281

CASP6, SAM-T04 – p.11/43

slide-12
SLIDE 12

Auto vs. align (NF and FR/A)

5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 SAM-T04 automatic SAM-T04 first alignment smooth GDT scores

216_2 272_2 209_2 248_1 235_2 241_2 238 212 272_1 242 280_2 199_3 248_2 201 248_3 230 215 281 273 209_1

CASP6, SAM-T04 – p.12/43

slide-13
SLIDE 13

Target T0201 (NF)

We tried forcing various sheet topologies and selected 4 by hand. Model 1 has right topology (5.9117 all-atom RMSD). Unconstrained cost function not good at choosing topology. Contact prediction didn’t help, though first prediction right. Helices were too short. Highest GDT and lowest RMSD model (try41-opt2.repack-nonPC 5.4912 all-atom) has wrong topology.

CASP6, SAM-T04 – p.13/43

slide-14
SLIDE 14

Target T0201 (NF)

CASP6, SAM-T04 – p.14/43

slide-15
SLIDE 15

Target T0201 (NF)

Wrong topology, but best scoring decoy.

CASP6, SAM-T04 – p.15/43

slide-16
SLIDE 16

Target T0230 (FR/A)

Good except for C-terminal loop and helix flopped wrong way. We have secondary structure right, including phase of beta strands. Contact prediction helped, but we put too much weight

  • n it—decoys fit predictions better than real structure

does.

CASP6, SAM-T04 – p.16/43

slide-17
SLIDE 17

Target T0230 (FR/A)

CASP6, SAM-T04 – p.17/43

slide-18
SLIDE 18

Target T0230 (FR/A)

Real structure with contact predictions:

CASP6, SAM-T04 – p.18/43

slide-19
SLIDE 19

Target T0281 (FR/A)

Third strand has off-by-one error. Top T04 hit (1gefA) is good, T2K put it 3rd. We submitted the best model we had (in GDT score, try7-opt1 had better rmsd). Sol’s hand work helped, but my attempts to force M1-P4 as a first strand and to remove the bulge at R22 were misguided.

CASP6, SAM-T04 – p.19/43

slide-20
SLIDE 20

Target T0281 (FR/A)

Red is real structure.

CASP6, SAM-T04 – p.20/43

slide-21
SLIDE 21

Target T0215 (FR/A)

Secondary structure good, but helix packing angles wrong. Need helix packing info in undertaker—hand-added constraints were wrong. Too few homologs for contact prediction.

CASP6, SAM-T04 – p.21/43

slide-22
SLIDE 22

Target T0215 (FR/A)

Red is real structure.

CASP6, SAM-T04 – p.22/43

slide-23
SLIDE 23

Target T0212 (FR/A)

We tried to force a jelly-roll structure with the N-terminal strand omitted. Swapping the N- and C-terminal strands of our model would make it almost right. Strand T60-A66 is off by one.

CASP6, SAM-T04 – p.23/43

slide-24
SLIDE 24

Target T0212 (FR/A)

CASP6, SAM-T04 – p.24/43

slide-25
SLIDE 25

Web sites

UCSC bioinformatics degrees:

http://www.soe.ucsc.edu/programs/bionformatics/

SAM tool suite info:

http://www.soe.ucsc.edu/research/compbio/sam.html

HMM servers: http://www.soe.ucsc.edu/research/compbio/HMM-apps/ These slides:

http://www.soe.ucsc.edu/˜karplus/papers/casp6-slides.pdf

CASP6 all working files: http://www.soe.ucsc.edu/˜karplus/casp6

CASP6, SAM-T04 – p.25/43

slide-26
SLIDE 26

Iterative search using HMMs

SAM-T98, T99, T2K, and T04 methods all use similar method for building a target HMM, given a single sequence (or a seed alignment). The target04 script uses perl modules to encapsulate programs, for greater flexibility. uses fastacmd instead of grep for counting and retrieving sequences. uses blastpgp on each iteration to prefetch sequences for hmmscore. uses cheap_gaps transition regularizer throughout.

CASP6, SAM-T04 – p.26/43

slide-27
SLIDE 27

Local Structure Alphabets

Use more backbone alphabets: DSSP & DSSP-ehl2 Str2 Stride Bystroff alpha Use burial alphabets: CB-14-7 near-backbone-11

CASP6, SAM-T04 – p.27/43

slide-28
SLIDE 28

Neural Net

We use neural nets to predict local properties. Input is profile with probabilities of amino acids at each position of target chain, plus insertion and deletion

  • probabilities. New in 2004 is additional 20 inputs with
  • ne-hot encoding of amino acid in the target sequence.

Neural nets were retrained using T04 alignments and better training set.

CASP6, SAM-T04 – p.28/43

slide-29
SLIDE 29

Multi-track HMMs

Using more 2-track HMMs: amino acid plus each local structure alphabet. Using 3-track HMMs: amino acid, backbone (str2), burial (CB-14-7) Generate many alignments for each potential template. use different HMMs. use both local and global. use both Viterbi and posterior decoding.

CASP6, SAM-T04 – p.29/43

slide-30
SLIDE 30

Fragfinder

Medium-length fragments (9 long) for every position Generated from 3-track HMMs. Residues filtered to remove improbable φ-ψ pairs (creating smaller fragments).

CASP6, SAM-T04 – p.30/43

slide-31
SLIDE 31

Best vs. Robetta best

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 SAM-T04 best Robetta best smooth GDT scores

270 240 243 201 213 197 231 207 249_1 268_2 202 262 206 227 202_1 264 262_2

CASP6, SAM-T04 – p.31/43

slide-32
SLIDE 32

SAM-T04 auto vs. Robetta 1

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 SAM-T04 automatic Robetta model1 smooth GDT scores

270 243 280_1 254 249_1 247_3 231 229_1 222_2 234 209_2 206 251 202_1 262_2 199_2 228_2

CASP6, SAM-T04 – p.32/43

slide-33
SLIDE 33

Model 1 vs. SAM-T04 auto

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 SAM-T04 model1 SAM-T04 automatic smooth GDT scores

228_2 197 199_2 222_2 209_2 248_1 234 281 243 199_1 280_1 229 207 262_1 208 205 268_2

CASP6, SAM-T04 – p.33/43

slide-34
SLIDE 34

Model 1 vs. alignment

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 SAM-T04 model1 SAM-T04 first alignment smooth GDT scores

272_2 262_2 235_2 209_2 248_1 249_1 222_2 264_2 212 201 248_2 207 215 281 232 251 275 277 231 233_1 223 223_1 247_1 268_2

CASP6, SAM-T04 – p.34/43

slide-35
SLIDE 35

Undertaker sidechains vs. Rosetta

  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 5 10 15 20 25 30 35 (Rosetta repack-undertaker) rmsd all-atom rmsd for undertaker increase in all-atom rmsd from running Rosetta repacking increase average=0.019

CASP6, SAM-T04 – p.35/43

slide-36
SLIDE 36

Undertaker sidechains vs. SCWRL

  • 0.15
  • 0.1
  • 0.05

0.05 0.1 0.15 0.2 0.25 0.3 5 10 15 20 25 30 35 (SCWRL-undertaker) rmsd all-atom rmsd for undertaker increase in all-atom rmsd from running SCWRL repacking increase average=0.008

CASP6, SAM-T04 – p.36/43

slide-37
SLIDE 37

Target T0197 (FR/H)

Robetta did surprisingly poorly for an FR/H model. Our scores indicated more distant relationship, and meta-servers got wrong family. SAM-T04’s secondary prediction better than SAM-T02’s. We tried assembling sheets into various barrels, based

  • n top few fold-recognition hits.

We used conserved residues, but not contact predictions.

CASP6, SAM-T04 – p.37/43

slide-38
SLIDE 38

Target T0197 (FR/H)

Real structure is red.

CASP6, SAM-T04 – p.38/43

slide-39
SLIDE 39

Target T0209_2 (NF)

Our best model was try15-opt2 (model3) (5.7115 Ang all-atom RMSD). Good, but final strand misregistered (off by 2). Model is more complete than crystal. Sheet constraints came from robetta-model1, which

  • utperformed it.

CASP6, SAM-T04 – p.39/43

slide-40
SLIDE 40

Target T0209_2 (NF)

Real structure is red.

CASP6, SAM-T04 – p.40/43

slide-41
SLIDE 41

Target T0235_2 (FR/A)

43-residue inserted domaim—not fully resolved in crystal. We had made separate predictions for P347-P426, and had a good alignment to 1occJ, which we then messed

  • up. We ended up not using the separate domain

prediction. Good score only because first and last helix constrained by surrounding domain. We made last helix of domain too short, despite prediction that it was longer.

CASP6, SAM-T04 – p.41/43

slide-42
SLIDE 42

Target T0235_2 (FR/A)

Real structure is red.

CASP6, SAM-T04 – p.42/43

slide-43
SLIDE 43

Target T0248

Borrows heavily from robetta model2, which beats it.

CASP6, SAM-T04 – p.43/43