selective integration of multiple biological data for
play

Selective Integration of Multiple Biological Data for Supervised - PowerPoint PPT Presentation

Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS


  1. Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS 1

  2. Biological Networks • Physical Interaction network – Edge ⇔ Two proteins physically interact (e.g. docking) • Metabolic networks of enzymes – Edge ⇔ Two enzymes catalyzing successive reactions • Gene regulatory networks • Large graphs with sparse connections – 1,000~10,000 nodes – 10,000 – 100,000 edges 2005/8 DIMACS 2

  3. 3 Physical Interaction Network 2005/8 DIMACS

  4. 4 Metabolic Network 2005/8 DIMACS

  5. Statistical Inference of Networks • Infer the network by data about proteins – Gene expressions, Phylogenetic profiles etc • Propose a Kernel-based inference method – 1. Supervised Inference • Learning from data and training network – 2. Weighted combination of multiple data • Identify unnecessary data that do not contribute for network inference 2005/8 DIMACS 5

  6. Unsupervised Supervised Inference • Unsupervised network inference – Bayesian network (Friedman et al., 2000) – Infer every edge from scratch (no known edges) • Supervised network inference – A part of the network is known (training network) – Infer the rest of the network from data and training net – Kernel CCA (Yamanishi et al., ISMB, 2004) 2005/8 DIMACS 6

  7. Supervised Network Inference Extra nodes Training network 2005/8 DIMACS 7

  8. Single Data Multiple Data • Multiple data for inferring networks – Gene expression profiles – Subcellular locations – Phylogenetic profiles • Identify relevant data for inference • Weighted integration of multiple data ! – Feature selection to data selection • Kernel CCA: No mechanism for data selection 2005/8 DIMACS 8

  9. Inferring a Network from Multiple Data Physical Interaction Gene Expression Metabolic Network Functional Category Gene Interaction Phylogenetic Profile Subcell. Localization Gene Regulatory Net 3D Structure 2005/8 DIMACS 9

  10. Outline • Network Inference from a kernel matrix – Unsupervised, Single Data – Thresholding: Nearest neighbor connection • Incorporating the training network – Supervised, Single Data – Kernel Matrix Completion (Tsuda et al., 2003) • Weighted integration of multiple data – Supervised, Multiple Data – Weights determined by the EM algorithm 2005/8 DIMACS 10

  11. Unsupervised, Single Data • Convert the data to a kernel matrix – Similarity among proteins – Gene expression: Pearson correlation – Phylogenetic profile: Tree kernel (Vert 2002) – 3D structure: Graph kernel (Borgwardt et al., 2005) 2005/8 DIMACS 11

  12. Construct the network by thresholding • Establish an edge where the kernel value is more than threshold t=0.1 t=0.2 t=0.4 2005/8 DIMACS 12

  13. Supervised, Single Data • Known Training Network • Data about all proteins (Only for first n nodes) Kernel Matrix 2005/8 DIMACS 13

  14. Incomplete kernel matrix from training network • Convert the training graph to a kernel matrix • Synchronizing the representation • Diffusion kernel (Kondor and Lafferty, 2002) – Measure closeness of nodes by random walking *Thresholding approximately recover the original network 2005/8 DIMACS 14

  15. Computation of Diffusion Kernel • A: Adjacency matrix, • D: Diagonal matrix of Degrees • L = D-A: Graph Laplacian Matrix • Diffusion kernel matrix : Diffusion paramater – • Characterizes closeness among nodes • Often used with SVM (Lanckriet et al, PSB 2004) 2005/8 DIMACS 15

  16. Adjacency Matrix and Degree Matrix 2005/8 DIMACS 16

  17. 17 Graph Laplacian Matrix L 2005/8 DIMACS

  18. Actual Values of Diffusion Kernels 0.00 0.01 0.00 0.01 0.00 0.00 1.00 0.10 0.10 0.66 0.00 0.01 0.00 0.01 β=0 0.00 0.10 β=0.15 0.03 0.02 0.15 0.13 0.47 Closeness from the 0.02 0.03 “central node” 0.15 β=0.3 2005/8 DIMACS 18

  19. Kernel Matrix Completion P Q • P: Kernel matrix of the data • Q: Incomplete kernel matrix ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh Missing values estimated by minimizing the KL divergence • KL ( Q , P ) Minimize w.r.t. − − = − − 1 1 l 1 1 1 KL ( Q , P ) tr( P Q ) logdet( P Q ) 2 2 2 • Closed from solution Q* • Threshold Q* to obtain the network 2005/8 DIMACS 19

  20. Supervised, Multiple Data • Known Training Network • Multiple data about all proteins Diffusion Kernel Kernel Matrices Matrix 2005/8 DIMACS 20

  21. Overview of Our Approach Adjacency Matrix Q Diffusion completion kernel threshold Kernel Matrices P ( b ) Result Weighted Combination 2005/8 DIMACS 21

  22. Notations ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh = b 1 + b 2 + b 3 + b 4 P ( b ) K 1 K 2 K 3 K 4 = ∑ n k + σ 2 P ( b ) b K I i i = i 1 Unknowns : Submatrices Q vh , Q hh , Weights b 2005/8 DIMACS 22

  23. Objective Function • KL divergence − − = − − 1 1 l 1 1 1 b b b KL ( Q , P ( )) tr( P ( ) Q ) logdet( P ( ) Q ) 2 2 2 Minimize w.r.t. Q Submatrices Q vh , Q hh , Weights b P ( b ) Solved by the EM algorithm 2005/8 DIMACS 23

  24. EM Algorithm • Repeat the following two steps KL ( , ( b )) Q P 1. E-step: minimize w.r.t. Q vh , Q hh KL ( Q , P ( b )) 2. M-step: minimize w.r.t. b E-step: Same as the single kernel case • M-step: Cannot be solved in closed form • 2005/8 DIMACS 24

  25. EM Algorithm for Extended Matrices • Extended Kernel Matrices Λ ⎡ ⎤ ⎡ ⎤ Q Q P ( b ) ~ = = xz ⎢ ⎥ ⎢ ⎥ Q R ( b ) Λ σ T T 2 ⎣ ⎦ Q Q I ⎣ ⎦ xz zz where ∈ ℜ ∈ ℜ Q Q , × × l l l l zz n n xz n [ ] k k k Λ = Λ Λ = Λ Λ 1 L T , , K , n i i i k • The solution of the following problem is also optimal in the original problem ~ min KL ( Q , R ( b )) Q , Q Q , Q , b vh hh , xz zz 2005/8 DIMACS 25

  26. Solutions of the steps • E-step − − − − = = − + 1 T 1 T 1 1 Q K P P , Q P P P P P P K P P vh I vv vh hh hh vh vv vh vh vv I vv vh − − − = σ Λ Λ = + σ Λ Λ 1 2 T 4 T V , Q V V Q V z zz z z z • M-step kN [ ] 1 ∑ = b Q k zz jj N = − + j ( k 1 ) N 1 2005/8 DIMACS 26

  27. Edge prediction experiments ・ Metabolic Network ( KEGG ) Network ・ Protein interaction network ( von Mering, 2002 ) ・ exp: gene expression Data ・ y2h: Interaction net by yeast2hybrid ・ loc: subcellular location ・ phy: phylogenetic profile ・ rnd1,…,rnd4: random noise ・ Q: Proposed method Methods ・ P: Simple combination of kernel matrices ・ cca: kernel CCA (without noises) Evaluation ROC score of edge prediction accuracy (10-fold cross validation) 2005/8 DIMACS 27

  28. Metabolic network • Made from LIGAND Database (KEGG) (Vert and Kanehisa, NIPS, 2003) • Connect enzymes of two successive reactions • 769 nodes, 3702 edges 2005/8 DIMACS 28

  29. Interaction network (Von Mering et al., Nature, 2002) • Middle Confidence • Interactions validated by multiple experiments – High-throughput yeast two hybrid – Correlated mRNA expression – Genetic interaction – Tandem affinity purification, – High-throughput mass-spectrometric protein complex identification • 984nodes, 2438 edges 2005/8 DIMACS 29

  30. Dataset Details Metabolic Net http://www.genome.jp/kegg/ Von Mering et al., Nature, 417 399--403 , 2002 Interaction Spellman et al., MBC, 9, 3273—3297, 1998 Expression Eisen et al., PNAS, 95, 14863—8, 1998 Ito et al., PNAS, 98, 4569—74, 2001 Y2H Uetz et al., Nature, 10, 601—3, 2000 Huh et al. Nature, 425, 686-91, 2003 Subcellular location Phylogenetic profile http://www.genome.jp/kegg/ 2005/8 DIMACS 30

  31. 31 Metabolic Network 2005/8 DIMACS

  32. 32 Physical Interaction Network 2005/8 DIMACS

  33. Introduce More Random Matrices (Metabolic network) Sensitivity at 95% specificity 2005/8 DIMACS 33

  34. Summary of Experiments • Simple combination (P) < Completed matrix ( Q) – Training network is essential • Selection did not improve accuracy • Accuracy comparable to kernel CCA • Automatic selection of datasets • 4 noise kernel matrices removed 2005/8 DIMACS 34

  35. Conclusion • Supervised Inference of Network – Part of network known – Selection from multiple data – Formulation as kernel completion problem – Validation experiments on metabolic and interaction networks • Future work – Biological interpretation of selection results – Applications to non-bio data T. Kato, K. Tsuda, and K. Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics , 21(10):2488--2495, 2005. 2005/8 DIMACS 35

  36. 36 Experiments 2005/8 DIMACS

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend