Supervised Convolutional GSN for Protein Secondary Structure - - PowerPoint PPT Presentation

supervised convolutional gsn for protein secondary
SMART_READER_LITE
LIVE PREVIEW

Supervised Convolutional GSN for Protein Secondary Structure - - PowerPoint PPT Presentation

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga Troyanskaya Princeton University Whats In this talk.. Problem: Predict protein secondary structure Iterative prediction with multi-layer


slide-1
SLIDE 1

Supervised Convolutional GSN for Protein Secondary Structure Prediction

Jian Zhou Olga Troyanskaya

Princeton University

slide-2
SLIDE 2

What’s In this talk..

  • Problem: Predict protein secondary structure
  • Iterative prediction with multi-layer hierarchical

representation

– Supervised GSN – Convolutional architecture for GSN – A trick for improving convergence and performance

  • Performance evaluations
slide-3
SLIDE 3

Previous Approaches: neural network

from 1988 (Qian & Sejnowski); bidirectioal recurrent neural network (Baldi et al., 1999); conditional neural fields (Peng et al., 2009); many more…

Protein secondary structure prediction

MDLSALRVEEVQNVINAMQKILECP ICLELIKEPVSTKCDHIFCKFCMLKL LNQKKGPSQCPLCKNDITKRSLQE STRFSQLVEELLKIICAFQLDTGLEY ANSYNFAKKGK

Protein sequence

CCGGGSSHHHHHHHHHHHHHHTS CSSSCCCCSSCCBCTTSCCCCSH HHHHHHHSSSSSCCCTTTSCCCC TTTCBCCCSSSHHHHHHHHHHHH HHHHTCCCCCC

Secondary structure

Image credit: Wikimedia common

Predict

20 types of amino acids

8 classes

3D structure

slide-4
SLIDE 4

Protein Sequence -> Secondary Structure

Protein sequence

20 types of amino acids

8 classes

Secondary structure label sequence Predict

Evolutionary neighborhood

3D structure

slide-5
SLIDE 5

Motivation

  • Challenge: Prediction with both local and long-range dependencies
  • Plan:
  • Multi-layer hierarchical representation
  • Both ‘upward’ and ‘downward’

connections

  • Supervised GSN formulation
slide-6
SLIDE 6

Model

𝐼𝑢+1 ~ 𝑄𝜄1 𝐼 𝐼𝑢,𝑌𝑢 𝑌𝑢+1 ~ 𝑄𝜄2 𝑌 𝐼𝑢+1)

𝐼1 𝐼2 𝑌0 𝑌1 𝑌2 𝐼3

  • Generative Stochastic Network

Learning the transition operators of a Markov chain whose stationary distribution estimates the data distribution 𝑄 (𝑌).

Learning 𝑄 𝑌 𝐼) can be much easier than 𝑄 (𝑌) by design. Trainable using back-propagation 𝐼0

Bengio, Y., Thibodeau-Laufer, É., Alain, G., and Yosinski, J. Deep Generative Stochastic Networks Trainable by Backprop

slide-7
SLIDE 7

Model

𝐼𝑢+1 ~ 𝑄𝜄1 𝐼 𝐼𝑢,𝑌𝑢 𝑌𝑢+1 ~ 𝑄𝜄2 𝑌 𝐼𝑢+1) 𝐼𝑢+1 ~ 𝑄𝜄1 𝐼 𝐼𝑢,𝑍

𝑢,𝑌0

𝑍

𝑢+1 ~ 𝑄𝜄2 𝑍

𝐼𝑢+1)

𝐼1 𝐼2 𝑌0 𝑌1 𝑌2 𝐼3 𝐼1 𝐼2 𝑍 𝑍

1

𝑌0 𝑍

2

𝐼3

P(X) P(Y|X)

GSN

Supervised GSN

𝐼0 𝐼0 Learning 𝑄 𝑍 𝐼) can be much easier than 𝑄 𝑍 𝑌 , utilizing previous state of the chain

slide-8
SLIDE 8

Model

𝐼𝑢+1 ~ 𝑄𝜄1 𝐼 𝐼𝑢,𝑍

𝑢,𝑌0

𝑍

𝑢+1 ~ 𝑄𝜄2 𝑍

𝐼𝑢+1)

𝐼1 𝐼2 𝑍 𝑍

1

𝑌0 𝑍

2

𝐼3

P(Y|X)

Supervised GSN

Maximize log-likelihoods True 𝑄(𝑍|𝑌0) 𝑄𝜄(𝑍|𝐼1) 𝑄𝜄(𝑍|𝐼2) 𝑍 𝑍

1

slide-9
SLIDE 9

Architecture for protein secondary structure prediction

Multi-scale representation – multi-layer convolutional architecture Local information sensitive – output unit at bottom layer

𝑍 𝐼0 𝑌

W1’ W1 W1’ W1 W1 W1’ W2 W2’ W2 W2 W2’

𝑍

1

𝐼1 𝑍

2

𝑍

3

𝐼1 𝐼2 𝑍 𝑌0 𝐼3

Model

Conv Pool Conv

𝑍 𝐼0 𝑌 𝐼1

tanh tanh

Mean pooling

slide-10
SLIDE 10

Training

Initialize at a specified test initialization value for a subset of training batches:

Experiments on initialization of chain during training 𝑍 𝐼0 𝑌

W1’ W1 W1’ W1 W1 W1’ W2 W2’ W2 W2 W2’

𝑍

1

𝐼1 𝑍

2

𝑍

3

𝐼1 𝐼2 𝑍 𝑌0 𝐼3

Accuracy # of iterations

0% 20% 50% 80% 100%

Accuracy # of iterations

  • Optimal performance at 50% test initialization

𝑍

0 𝑢𝑠𝑣𝑓

𝑍

0 𝑢𝑓𝑡𝑢

slide-11
SLIDE 11

Performance

CullPDB-30 test set Overall Accuracy (8-class) 1 layer 0.714 ± 0.006 2 layers 0.720 ± 0.006 3 layers 0.721 ± 0.006 CB513 dataset Overall Accuracy (8-class) RaptorSS8/CNF 0.649 ± 0.003 Our method 0.664 ± 0.005

Cull PDB dataset (6133 proteins with <30% identity between any protein pairs); available at www.princeton.edu/~jzthree/datasets

single protein prediction example Performance through averaging iterative predictions: 𝑍

1

𝑍

2

𝑍

4

𝑍

8

𝑍

16

𝑍

32

𝑀𝑏𝑐𝑓𝑚

slide-12
SLIDE 12

Summary

  • We developed supervised convolutional GSN model for

protein secondary structure prediction.

  • Supervised GSN

– Stochastic iterative prediction through Markov chain – Initialization trick improve both performance and convergence rate empirically

  • Convolutional architecture for Supervised GSN

– Combine high level representation and local prediction – Improved over previous best performance

slide-13
SLIDE 13
  • Filters: Layer1, 𝑌, 𝑍 ↔ 𝐼0

𝑋𝑌→𝐼0

(Amino acids)

Position Channel

𝑋𝑍→𝐼0

(Secondary structure)

𝑋𝐼0→𝑍

(Secondary structure)