Selective Integration of Multiple Biological Data for Supervised - PowerPoint PPT Presentation

Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS 1

Biological Networks • Physical Interaction network – Edge ⇔ Two proteins physically interact (e.g. docking) • Metabolic networks of enzymes – Edge ⇔ Two enzymes catalyzing successive reactions • Gene regulatory networks • Large graphs with sparse connections – 1,000~10,000 nodes – 10,000 – 100,000 edges 2005/8 DIMACS 2

3 Physical Interaction Network 2005/8 DIMACS

4 Metabolic Network 2005/8 DIMACS

Statistical Inference of Networks • Infer the network by data about proteins – Gene expressions, Phylogenetic profiles etc • Propose a Kernel-based inference method – 1. Supervised Inference • Learning from data and training network – 2. Weighted combination of multiple data • Identify unnecessary data that do not contribute for network inference 2005/8 DIMACS 5

Unsupervised Supervised Inference • Unsupervised network inference – Bayesian network (Friedman et al., 2000) – Infer every edge from scratch (no known edges) • Supervised network inference – A part of the network is known (training network) – Infer the rest of the network from data and training net – Kernel CCA (Yamanishi et al., ISMB, 2004) 2005/8 DIMACS 6

Supervised Network Inference Extra nodes Training network 2005/8 DIMACS 7

Single Data Multiple Data • Multiple data for inferring networks – Gene expression profiles – Subcellular locations – Phylogenetic profiles • Identify relevant data for inference • Weighted integration of multiple data ! – Feature selection to data selection • Kernel CCA: No mechanism for data selection 2005/8 DIMACS 8

Inferring a Network from Multiple Data Physical Interaction Gene Expression Metabolic Network Functional Category Gene Interaction Phylogenetic Profile Subcell. Localization Gene Regulatory Net 3D Structure 2005/8 DIMACS 9

Outline • Network Inference from a kernel matrix – Unsupervised, Single Data – Thresholding: Nearest neighbor connection • Incorporating the training network – Supervised, Single Data – Kernel Matrix Completion (Tsuda et al., 2003) • Weighted integration of multiple data – Supervised, Multiple Data – Weights determined by the EM algorithm 2005/8 DIMACS 10

Unsupervised, Single Data • Convert the data to a kernel matrix – Similarity among proteins – Gene expression: Pearson correlation – Phylogenetic profile: Tree kernel (Vert 2002) – 3D structure: Graph kernel (Borgwardt et al., 2005) 2005/8 DIMACS 11

Construct the network by thresholding • Establish an edge where the kernel value is more than threshold t=0.1 t=0.2 t=0.4 2005/8 DIMACS 12

Supervised, Single Data • Known Training Network • Data about all proteins (Only for first n nodes) Kernel Matrix 2005/8 DIMACS 13

Incomplete kernel matrix from training network • Convert the training graph to a kernel matrix • Synchronizing the representation • Diffusion kernel (Kondor and Lafferty, 2002) – Measure closeness of nodes by random walking *Thresholding approximately recover the original network 2005/8 DIMACS 14

Computation of Diffusion Kernel • A: Adjacency matrix, • D: Diagonal matrix of Degrees • L = D-A: Graph Laplacian Matrix • Diffusion kernel matrix ： Diffusion paramater – • Characterizes closeness among nodes • Often used with SVM (Lanckriet et al, PSB 2004) 2005/8 DIMACS 15

Adjacency Matrix and Degree Matrix 2005/8 DIMACS 16

17 Graph Laplacian Matrix L 2005/8 DIMACS

Actual Values of Diffusion Kernels 0.00 0.01 0.00 0.01 0.00 0.00 1.00 0.10 0.10 0.66 0.00 0.01 0.00 0.01 β=0 0.00 0.10 β=0.15 0.03 0.02 0.15 0.13 0.47 Closeness from the 0.02 0.03 “central node” 0.15 β=0.3 2005/8 DIMACS 18

Kernel Matrix Completion P Q • P: Kernel matrix of the data • Q: Incomplete kernel matrix ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh Missing values estimated by minimizing the KL divergence • KL ( Q , P ) Minimize w.r.t. − − = − − 1 1 l 1 1 1 KL ( Q , P ) tr( P Q ) logdet( P Q ) 2 2 2 • Closed from solution Q* • Threshold Q* to obtain the network 2005/8 DIMACS 19

Supervised, Multiple Data • Known Training Network • Multiple data about all proteins Diffusion Kernel Kernel Matrices Matrix 2005/8 DIMACS 20

Overview of Our Approach Adjacency Matrix Q Diffusion completion kernel threshold Kernel Matrices P ( b ) Result Weighted Combination 2005/8 DIMACS 21

Notations ⎡ ⎤ K Q = I vh ⎢ ⎥ Q T ⎣ ⎦ Q Q vh hh = b 1 + b 2 + b 3 + b 4 P ( b ) K 1 K 2 K 3 K 4 = ∑ n k + σ 2 P ( b ) b K I i i = i 1 Unknowns ： Submatrices Q vh , Q hh ， Weights b 2005/8 DIMACS 22

Objective Function • KL divergence − − = − − 1 1 l 1 1 1 b b b KL ( Q , P ( )) tr( P ( ) Q ) logdet( P ( ) Q ) 2 2 2 Minimize w.r.t. Q Submatrices Q vh , Q hh ， Weights b P ( b ) Solved by the EM algorithm 2005/8 DIMACS 23

EM Algorithm • Repeat the following two steps KL ( , ( b )) Q P 1. E-step: minimize w.r.t. Q vh , Q hh KL ( Q , P ( b )) 2. M-step: minimize w.r.t. b E-step: Same as the single kernel case • M-step: Cannot be solved in closed form • 2005/8 DIMACS 24

EM Algorithm for Extended Matrices • Extended Kernel Matrices Λ ⎡ ⎤ ⎡ ⎤ Q Q P ( b ) ~ = = xz ⎢ ⎥ ⎢ ⎥ Q R ( b ) Λ σ T T 2 ⎣ ⎦ Q Q I ⎣ ⎦ xz zz where ∈ ℜ ∈ ℜ Q Q , × × l l l l zz n n xz n [ ] k k k Λ = Λ Λ = Λ Λ 1 L T , , K , n i i i k • The solution of the following problem is also optimal in the original problem ~ min KL ( Q , R ( b )) Q , Q Q , Q , b vh hh , xz zz 2005/8 DIMACS 25

Solutions of the steps • E-step − − − − = = − + 1 T 1 T 1 1 Q K P P , Q P P P P P P K P P vh I vv vh hh hh vh vv vh vh vv I vv vh − − − = σ Λ Λ = + σ Λ Λ 1 2 T 4 T V , Q V V Q V z zz z z z • M-step kN [ ] 1 ∑ = b Q k zz jj N = − + j ( k 1 ) N 1 2005/8 DIMACS 26

Edge prediction experiments ・ Metabolic Network （ KEGG ） Network ・ Protein interaction network （ von Mering, 2002 ）・ exp: gene expression Data ・ y2h: Interaction net by yeast2hybrid ・ loc: subcellular location ・ phy: phylogenetic profile ・ rnd1,…,rnd4: random noise ・ Q: Proposed method Methods ・ P: Simple combination of kernel matrices ・ cca: kernel CCA (without noises) Evaluation ROC score of edge prediction accuracy (10-fold cross validation) 2005/8 DIMACS 27

Metabolic network • Made from LIGAND Database (KEGG) (Vert and Kanehisa, NIPS, 2003) • Connect enzymes of two successive reactions • 769 nodes, 3702 edges 2005/8 DIMACS 28

Interaction network (Von Mering et al., Nature, 2002) • Middle Confidence • Interactions validated by multiple experiments – High-throughput yeast two hybrid – Correlated mRNA expression – Genetic interaction – Tandem affinity purification, – High-throughput mass-spectrometric protein complex identification • 984nodes, 2438 edges 2005/8 DIMACS 29

Dataset Details Metabolic Net http://www.genome.jp/kegg/ Von Mering et al., Nature, 417 399--403 , 2002 Interaction Spellman et al., MBC, 9, 3273—3297, 1998 Expression Eisen et al., PNAS, 95, 14863—8, 1998 Ito et al., PNAS, 98, 4569—74, 2001 Y2H Uetz et al., Nature, 10, 601—3, 2000 Huh et al. Nature, 425, 686-91, 2003 Subcellular location Phylogenetic profile http://www.genome.jp/kegg/ 2005/8 DIMACS 30

31 Metabolic Network 2005/8 DIMACS

32 Physical Interaction Network 2005/8 DIMACS

Introduce More Random Matrices (Metabolic network) Sensitivity at 95% specificity 2005/8 DIMACS 33

Summary of Experiments • Simple combination (P) ＜ Completed matrix （ Q) – Training network is essential • Selection did not improve accuracy • Accuracy comparable to kernel CCA • Automatic selection of datasets • 4 noise kernel matrices removed 2005/8 DIMACS 34

Conclusion • Supervised Inference of Network – Part of network known – Selection from multiple data – Formulation as kernel completion problem – Validation experiments on metabolic and interaction networks • Future work – Biological interpretation of selection results – Applications to non-bio data T. Kato, K. Tsuda, and K. Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics , 21(10):2488--2495, 2005. 2005/8 DIMACS 35

36 Experiments 2005/8 DIMACS

Selective Integration of Multiple Biological Data for Supervised - PowerPoint PPT Presentation

Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

Data modeling: the key to biological data integration Franois Rechenmann NETTAB 2012 Biological

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

ConTour Data Abstraction Data Abstraction History View Pathway View Compound View Drug

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Biological Data Management, part 2 Biological Data Management, part 2 H. V. Jagadish University

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

CHAPTER I CHAPTER I From Biological From Biological to Artificial Neuron Model to Artificial

Biological Relationships I can evaluate the ways in which organisms interact. Types of Biological

Data Cleaning for Data Integration Advanced School on Data Exchange, Integration, and Streams

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

chromozoom.org rethinking the UI of genome browsers Ted Pak Roth Laboratory Donnelly Centre,

Timing from Stochasticity Scott Yang Nick Rhind (UMass Med) John Bechhoefer (SFU) CMMT TGIF

Membrane Systems in Algebraic Biology: From a Toy to a Tool Thomas Hinze Friedrich Schiller

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (GBLUP-RR) Paulino Prez 1 Jos Crossa 2

Genomic Analysis Hoon Cho (MIT) and David Wu (Stanford) March, 2015 Homomorphic Encryption

SIGBio SIGBio Revi Review ew Chair: Aidong Zhang Vice Chair: Tekin Ozsoyoglu

Likelihood Ratios For Out-of-Distribution Detection Jie Ren*, Peter J. Liu, Emily Feruig, Jasper

Selective Integration of Multiple Biological Data for Supervised - PowerPoint PPT Presentation

Selective Integration of Multiple Biological Data for Supervised Network Inference Koji Tsuda National Institute for Advanced Industrial Science and Technology (AIST), Tokyo, Japan Joint work with Tsuyoshi Kato and Kiyoshi Asai 2005/8 DIMACS

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

Data modeling: the key to biological data integration Franois Rechenmann NETTAB 2012 Biological

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

ConTour Data Abstraction Data Abstraction History View Pathway View Compound View Drug

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Biological Data Management, part 2 Biological Data Management, part 2 H. V. Jagadish University

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

CHAPTER I CHAPTER I From Biological From Biological to Artificial Neuron Model to Artificial

Biological Relationships I can evaluate the ways in which organisms interact. Types of Biological

Data Cleaning for Data Integration Advanced School on Data Exchange, Integration, and Streams

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

chromozoom.org rethinking the UI of genome browsers Ted Pak Roth Laboratory Donnelly Centre,

Timing from Stochasticity Scott Yang Nick Rhind (UMass Med) John Bechhoefer (SFU) CMMT TGIF

Membrane Systems in Algebraic Biology: From a Toy to a Tool Thomas Hinze Friedrich Schiller

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (GBLUP-RR) Paulino Prez 1 Jos Crossa 2

Genomic Analysis Hoon Cho (MIT) and David Wu (Stanford) March, 2015 Homomorphic Encryption

SIGBio SIGBio Revi Review ew Chair: Aidong Zhang Vice Chair: Tekin Ozsoyoglu

Likelihood Ratios For Out-of-Distribution Detection Jie Ren*, Peter J. Liu, Emily Feruig, Jasper

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective