Nonparametric combinatorial sequence models
Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic (MSR) and Michael I. Jordan (UCB) 30th March, 2011
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 1
Nonparametric combinatorial sequence models Fabian L. Wauthier, UC - - PowerPoint PPT Presentation
Nonparametric combinatorial sequence models Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic (MSR) and Michael I. Jordan (UCB) 30 th March, 2011 Fabian L. Wauthier: Nonparametric combinatorial sequence models, 1 Biological motivation:
Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic (MSR) and Michael I. Jordan (UCB) 30th March, 2011
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 1
◮ Suppose we are given aligned sequences.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
◮ Many simplifying assumptions in previous work:
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
◮ Many simplifying assumptions in previous work:
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
◮ Many simplifying assumptions in previous work:
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
◮ Many simplifying assumptions in previous work:
Our interest: sequences where these assumptions do not hold
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability:
◮ Many simplifying assumptions in previous work:
Our interest: sequences where these assumptions do not hold
◮ Partial, long-range site dependencies
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell. ◮ Variability: duplication + mutation + fitness pressure.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007
◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell. ◮ Variability: duplication + mutation + fitness pressure.
Our Interest: model sequence variability, not its origins.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3
Freeman and Company, 2007 Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4
Freeman and Company, 2007
◮ Binding site decomposes into pockets (Sidney et al., 2008)
Expect partial site linkage.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4
Freeman and Company, 2007
◮ Binding site decomposes into pockets (Sidney et al., 2008)
Expect partial site linkage.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4
Freeman and Company, 2007
◮ Binding site decomposes into pockets (Sidney et al., 2008)
Expect partial site linkage.
◮ Variability due to evolutionary pressure on 3D binding site.
Variable sites are discontiguous ⇒ long-range dependencies.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4
Freeman and Company, 2007
◮ Binding site decomposes into pockets (Sidney et al., 2008)
Expect partial site linkage.
◮ Variability due to evolutionary pressure on 3D binding site.
Variable sites are discontiguous ⇒ long-range dependencies.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4
Main idea: Each sequence is composed of smaller components.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Main idea: Each sequence is composed of smaller components.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Main idea: Each sequence is composed of smaller components.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Main idea: Each sequence is composed of smaller components.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Main idea: Each sequence is composed of smaller components.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Main idea: Each sequence is composed of smaller components.
C.f. Probabilistic index map (Jojic and Caspi, CVPR 2004; Jojic et al., UAI 2004)
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5
Do not know how many site groups/PSSMs there are!
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6
Do not know how many site groups/PSSMs there are!
◮ Our approach: put a prior distribution on these unknowns
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6
Do not know how many site groups/PSSMs there are!
◮ Our approach: put a prior distribution on these unknowns ◮ Our model: A Chinese Restaurant Franchise (CRF)
conditioned on a Chinese Restaurant Process (CRP)
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6
Do not know how many site groups/PSSMs there are!
◮ Our approach: put a prior distribution on these unknowns ◮ Our model: A Chinese Restaurant Franchise (CRF)
conditioned on a Chinese Restaurant Process (CRP)
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6
Do not know how many site groups/PSSMs there are!
◮ Our approach: put a prior distribution on these unknowns ◮ Our model: A Chinese Restaurant Franchise (CRF)
conditioned on a Chinese Restaurant Process (CRP)
among sequences.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 7
◮ First customer sits at the first table
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 7
◮ First customer sits at the first table ◮ Subsequent customers
customers sitting at it,
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 7
◮ First customer sits at the first table ◮ Subsequent customers
customers sitting at it,
◮ Key point: The number of tables is random and inferred.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 7
◮ First customer sits at the first table ◮ Subsequent customers
customers sitting at it,
◮ Key point: The number of tables is random and inferred. ◮ For us: Infer the number of site groups.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 7
◮ CRF = “multiple coupled CRPs”
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules. ◮ Global menu of “dishes”
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules. ◮ Global menu of “dishes” ◮ Each newly opened table assigned a dish
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules. ◮ Global menu of “dishes” ◮ Each newly opened table assigned a dish
◮ assigned a dish with probability proportional to the number of
past tables that were assigned that dish,
◮ or with small probability assigned a new dish at random. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules. ◮ Global menu of “dishes” ◮ Each newly opened table assigned a dish
◮ assigned a dish with probability proportional to the number of
past tables that were assigned that dish,
◮ or with small probability assigned a new dish at random.
◮ Key point: The number of distinct dishes and sharing pattern
is random and inferred.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
◮ CRF = “multiple coupled CRPs” ◮ One restaurant per dataset. Customers seated by CRP rules. ◮ Global menu of “dishes” ◮ Each newly opened table assigned a dish
◮ assigned a dish with probability proportional to the number of
past tables that were assigned that dish,
◮ or with small probability assigned a new dish at random.
◮ Key point: The number of distinct dishes and sharing pattern
is random and inferred.
◮ For us: Infer the number of PSSMs and sharing pattern.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 8
s1 s2 s3
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
s1 s2 s3
1 2 3 4 5 6 7
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
CRP: site groups = “linkage”
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
CRF: secondary site grouping
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2 s3
1 2 3 4 5 6 7
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C A
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C A T
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C A T A
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C A T A C
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
1 4 3 5 7 2 6
s1 s2
1 2 3 4 5 6 7
s3
A C A T A C C
Parameters = PSSMs
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 9
◮ Inference algorithm: collapsed Gibbs sampler.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 10
◮ Inference algorithm: collapsed Gibbs sampler. ◮ Varying hyperparameters varies posterior model “complexity.”
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 10
◮ Inference algorithm: collapsed Gibbs sampler. ◮ Varying hyperparameters varies posterior model “complexity.” ◮ Given posterior complexity, compare average model likelihood
with that of a mixture model with similar complexity.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 10
◮ Inference algorithm: collapsed Gibbs sampler. ◮ Varying hyperparameters varies posterior model “complexity.” ◮ Given posterior complexity, compare average model likelihood
with that of a mixture model with similar complexity.
◮ Look at three datasets: blue = our model, red = mixture.
MHC I Flu KIR
4 6 8 −1.4 −1.2 −1 −0.8 −0.6 x 10
4
Complexity Loglik 3.5 4 4.5 5 5.5 −1500 −1000 −500 Complexity Loglik 10 20 30 −6000 −4000 −2000 Complexity Loglik
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 10
◮ Let binary vector mik encode latent variables of sequence si
for posterior sample k.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 11
◮ Let binary vector mik encode latent variables of sequence si
for posterior sample k.
◮ Similar sequences have similar encodings; can use mi· to share
phenotype information.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 11
◮ Let binary vector mik encode latent variables of sequence si
for posterior sample k.
◮ Similar sequences have similar encodings; can use mi· to share
phenotype information.
◮ Example: binding affinities of MHC I proteins
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 11
◮ Let binary vector mik encode latent variables of sequence si
for posterior sample k.
◮ Similar sequences have similar encodings; can use mi· to share
phenotype information.
◮ Example: binding affinities of MHC I proteins
yij = p⊤
j Θkmik = trace(Θkmikp⊤ j ).
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 11
◮ Let binary vector mik encode latent variables of sequence si
for posterior sample k.
◮ Similar sequences have similar encodings; can use mi· to share
phenotype information.
◮ Example: binding affinities of MHC I proteins
yij = p⊤
j Θkmik = trace(Θkmikp⊤ j ).
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 11
◮ Information transfer
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285 Random 0.5
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285 Random 0.5
◮ Compare with state of the art (SOA) (Peters et al., 2006):
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285 Random 0.5
◮ Compare with state of the art (SOA) (Peters et al., 2006):
Method AUC MAP 0.8378 Averaging 0.8911
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285 Random 0.5
◮ Compare with state of the art (SOA) (Peters et al., 2006):
Method AUC MAP 0.8378 Averaging 0.8911 SOA 0.85–0.91
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
◮ Information transfer
Method AUC Independent 0.8290 Transfer only 0.7285 Random 0.5
◮ Compare with state of the art (SOA) (Peters et al., 2006):
Method AUC MAP 0.8378 Averaging 0.8911 SOA 0.85–0.91
◮ Similar performance, but use only limited information:
no spatial proximity, chemical properties, interaction features, nonlinearities.
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 12
Fabian L. Wauthier: Nonparametric combinatorial sequence models, 13