The Missing Models: A Data-driven Approach to Learning How Networks - PowerPoint PPT Presentation

The Missing Models: A Data-driven Approach to Learning How Networks Grow Carl Kingsford Professor Computational Biology Department School of Computer Science Carnegie Mellon University Robert Patro, Geet Duggal, Emre Sefer, Hao Wang, Darya Filippova, Carl Kingsford (2012). The missing models: A data-driven approach for learning how networks grow. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 42-50.

Networks are everywhere Biological Social Technological [Stelzl et al. 2005] [Bulik-Sullivan & Sullivan 2012] [Peer 1 2011]

Networks are everywhere Biological Social Technological How did these networks grow? [Stelzl et al. 2005] [Bulik-Sullivan & Sullivan 2012] [Peer 1 2011]

Enter Network Growth Models Biological Social Technological [Stelzl et al. 2005] [Bulik-Sullivan & Sullivan 2012] [Peer 1 2011]

Enter Network Growth Models Biological Social Technological DMC ? Forest Fire ? Kronecker? [Stelzl et al. 2005] [Bulik-Sullivan & Sullivan 2012] [Peer 1 2011]

Example : DMC Model Plausible model of protein interaction network growth introduced by Vazquez et al. in 2001 Based on gene duplication & divergence Network at time t

Example of DMC Model New Node (duplicate) Parent [Duplication, Mutation, Complementarity]

Example of DMC Model New Node (duplicate) Parent [Duplication, Mutation, Complementarity] Network at time t+1

Example of DMC Model Repeated for many steps ? [Duplication, Mutation, Complementarity] [Stelzl et al. 2005] In addition to biologically plausible mechanism, can produce networks with similar degree distribution and clustering coeff. as real PPIs

Network Growth Models (NGMs) “What I Cannot Create, I Do Not Understand” -- Richard Feynman Bottom-up generative model of network growth process Creates “random” graphs with similar topological characteristics to the target Theoretical Practical Evaluate statistical significance of observed features Discover reasons for observed structure Test algorithms in different contexts How does topology change over time? - Varying topological characteristics How did the network look in the past? - Varying scales How will it look in the future?

(Navlakha & Kingsford,   PLoS Comp. Biol. , 2011)

Core vs. Peripheral Complex Members Coreness of a protein = percentage of like-annotated neighbors ½ , newer (ignore) x ? u ¾ , older Are core members of a protein complex older than peripheral members? Yes, somewhat: R = 0.37, P < 0.01 Agrees with 3D protein structure analysis (Kim & Marcotte, 2008) looking at age distribution of domains among eukaryotic species.

Supervised Learning → Predict Network Models SMW AGV RDG Extract Network RDS Features Classifier LPA DMR DMC DMC Inferring network mechanisms: The Drosophila melanogaster protein interaction network Manuel Middendorf, Etay Ziv, and Chris H. Wiggins

Many Existing Growth Models Varying complexity / accuracy Erdös-Rényi [1960] Repeated application of simple Barabási-Albert [1999] growth rule DMC [Vazquez et al. 2001] Duplication-divergence [Ispolatov et al. 2005] RTG [Akoglu & Faloutsos 2009] Forest Fire Model [Leskovec et al. 2010] More complex but highly Kronecker Model [Leskovec et al. 2010] flexible models Multifractal Network Generator [Palla et al. 2010]

Many Existing Growth Models Varying complexity / accuracy Erdös-Rényi [1960] Previous work focused on either Repeated application of simple Barabási-Albert [1999] growth rule Manually designed growth models DMC [Vazquez et al. 2001] Duplication-divergence [Ispolatov et al. 2005] or RTG [Akoglu & Faloutsos 2009] Parameterized family of models (possibly with parameter learning) Forest Fire Model [Leskovec et al. 2010] More complex but highly Kronecker Model [Leskovec et al. 2010] flexible models Multifractal Network Generator [Palla et al. 2010]

So What’s New? Method to automatically learn growth models which is nonparametric & data-driven GrowCode Virtual Machine GrowCode program = Random graphs network growth model GrowCode Optimization Target graph Set of network growth models optimized to produce graphs similar to the target graph

So What’s New? Method to automatically learn growth models which is nonparametric & data-driven Instructions represent basic topological operations Growth model is a program in the GrowCode language GrowCode Virtual Machine GrowCode program = Random graphs network growth model General similarity measure to capture desired target characteristics Pose finding NGMs as optimization over the space of programs GrowCode Optimization Target graph Set of network growth models optimized to produce graphs similar to the target graph

So What’s New? Method to automatically learn growth models which is nonparametric & data-driven Instructions represent basic topological operations Growth model is a program in the GrowCode language GrowCode Virtual This novel representation of NGMs allows us to Machine GrowCode program = Random graphs effectively search a large space of potential growth models network growth model General similarity measure to capture desired target characteristics Pose finding NGMs as optimization over the space of programs GrowCode Optimization Target graph Set of network growth models optimized to produce graphs similar to the target graph

GrowCode Virtual Machine GrowCode program = Random graphs network growth model GrowCode Optimization Target graph Set of network growth models optimized to produce graphs similar to the target graph

GrowCode Virtual Machine Register-based virtual machine Node label memory L : V V Runs program iteratively to grow a graph Program Node labels (act as memory) u v u u r0 r1 r2 PC v Registers v Current graph

Machine Instructions { Modify graph toplogy { Modify label memory { Program control flow { Manipulate machine registers

Machine Instructions Every sequence of instructions is a semantically valid GrowCode program { Modify graph toplogy { Modify label memory { Program control flow { Manipulate machine registers

Influence Instructions

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap Attach to influenced r0 r1 r2

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap 1 Attach to influenced r0 r1 r2

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap 1 2 1 Attach to influenced r0 r1 r2

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 3 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap 3 2 1 Attach to influenced r0 r1 r2

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 3 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap 2 3 1 Attach to influenced r0 r1 r2

Example GrowCode Program A new node duplicates an existing node u where u is selected proportional to its degree. Current graph: 3 2 1 2 Program: Set(1) Random edge New node Swap Influence neighbors(1.0) Registers: Swap 2 3 1 Attach to influenced r0 r1 r2

The Missing Models: A Data-driven Approach to Learning How Networks - PowerPoint PPT Presentation

The Missing Models: A Data-driven Approach to Learning How Networks Grow Carl Kingsford Professor Computational Biology Department School of Computer Science Carnegie Mellon University Robert Patro, Geet Duggal, Emre Sefer, Hao Wang, Darya

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations

False fasting is driven by pride False fasting is driven by pride False fasting is

Performing and tracking imputation Nicholas Tierney Statistician DataCamp Dealing With Missing

Practical Data Issues Department of Political Science and Government Aarhus University March 3,

Is the data missing at random? DEALIN G W ITH MIS S IN G DATA IN P YTH ON Suraj Donthi Deep

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Attention-based Learning for Missing Data Imputation in HoloClean Richard Wu 1 , A oqian Zhang 1 ,

The New Frontier of Robotics Sren Tranberg Hansen Agenda What defjnes a robot Why look

RESONANCES AT 100 TEV: DI-LEPTONS VS DI-JETS FCC-hh BSM group, workplan discussion Riccardo

UNIT V Prepared by Dr.K.S.Badrinathan 1 IMPLEMENTATION AND ROBOT ECONOMICS Automated

Introduction to Mobile Robotics SLAM: Simultaneous Localization and Mapping Wolfram Burgard,

modeling formalisms - results from the MULTIFORM project Martin Hfner, Christian Sonntag,

Generic component-based middleware for a peer-to-peer flexible robot architecture What is MT

LEAKAGE - RESILIENT PUBLIC - KEY ENCRYPTION FROM OBFUSCATION Dana Dachman-Soled, S. Dov

Lecture Slides - Part 4 Bengt Holmstrom MIT February 2, 2016. Bengt Holmstrom (MIT) Lecture