Exploration, testing, and prediction: the many roles of statistics - PowerPoint PPT Presentation

Exploration, testing, and prediction: the many roles of statistics in Network Science Aaron Clauset Assistant Professor of Computer Science & BioFrontiers University of Colorado Boulder External Faculty, Santa Fe Institute 250 200 150 100

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron "The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron "The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "There are three kinds of lies: lies, damned lies, and statistics." — unknown

"Those who ignore Statistics are condemned to reinvent it." "There are three kinds of lies: lies, damned lies, and — Bradley Efron statistics." — unknown "The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller

"Those who ignore Statistics are condemned to reinvent it." "There are three kinds of lies: lies, damned lies, and — Bradley Efron statistics." — unknown "The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller "If your experiment needs statistics, you ought to have done a better experiment." — E. Rutherford

"Those who ignore Statistics are condemned to reinvent it." "There are three kinds of lies: lies, damned lies, and — Bradley Efron statistics." — unknown "The first principle is that you must not fool yourself, and "If your experiment needs statistics, you ought to have done you are the easiest person to fool." — Richard Feynman a better experiment." — E. Rutherford "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller "Far better an approximate answer to the right question… than an exact answer to the wrong question." — John W. Tukey

"Those who ignore Statistics are condemned to reinvent it." "There are three kinds of lies: lies, damned lies, and — Bradley Efron statistics." — unknown "The first principle is that you must not fool yourself, and "If your experiment needs statistics, you ought to have done you are the easiest person to fool." — Richard Feynman a better experiment." — E. Rutherford "It’s easy to lie with statistics, but it’s easier to lie without "Far better an approximate answer to the right question… them." — Fred Mosteller than an exact answer to the wrong question." — John W. Tukey "In God we trust. All others must bring data." — W. Edwards Deming "God must bring data, too." — unknown

three roles of statistics • data exploration • model testing • prediction

data exploration : community detection • given a graph G z ( G ) • divide its vertices into coherent groups C • consummate data exploration! • a common task in network analysis • helped yield insight into real social, biological, technological systems D • scores of methods, many extremely powerful, some with guarantees (stochastic block model, Belief Propagation, etc.) 13

data exploration : community detection • given a graph G z ( G ) • divide its vertices into coherent groups C • nearly all methods: max f ( z ( G )) estimate z [WARNING: typically NP-hard] D 13

the trouble with community detection this is a pretty good division (under nearly any ) f B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81 , 046106 (2010).

data exploration : community detection so are all of these (and many more)

data exploration : community detection • there are an exponential number of good-looking local maxima each algorithm chooses one • this is okay for data exploration! • anything else requires caution • risks : 'wrong' optima • opportunities : community structure is genuinely interesting! • difficulties : how do we select among all these good divisions? B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81 , 046106 (2010).

model testing : scale-free networks Inferring network mechanisms: The Drosophila melanogaster protein interaction network Manuel Middendorf † , Etay Ziv ‡ , and Chris H. Wiggins §¶ � • observation : many protein interaction networks have heavy- tailed (power-law?) degree distributions M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

model testing : scale-free networks Inferring network mechanisms: The Drosophila melanogaster protein interaction network Manuel Middendorf † , Etay Ziv ‡ , and Chris H. Wiggins §¶ � • observation : many protein interaction networks have heavy- tailed (power-law?) degree distributions • claims : as of 2005, FIVE different models proposed as generative mechanisms • duplication mutation complementation (DMC), duplication mutation-random (DMR), linear preferential attachment (LPA), random growing networks (RDG), aging vertex networks (AGV) M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

model testing : scale-free networks • the problem: all models fit the observed degree distribution M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

model testing : scale-free networks • the problem: all models fit the observed degree distribution likes honey likes honey ? Aaron Bear M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

model testing : scale-free networks • the solution: build a classifier that can distinguish networks generated by the 5 models + 2 controls based on their motif frequencies • use decision trees + Adaboost (very powerful) to learn which motifs distinguish the models • validated on synthetic graphs with known structure: Prediction Truth DMR DMC AGV LPA SMW RDS RDG DMR 99.3 0.0 0.0 0.0 0.0 0.1 0.6 DMC 0.0 99.7 0.0 0.0 0.3 0.0 0.0 AGV 0.0 0.1 84.7 13.5 1.2 0.5 0.0 LPA 0.0 0.0 10.3 89.6 0.0 0.0 0.1 SMW 0.0 0.0 0.6 0.0 99.0 0.4 0.0 RDS 0.0 0.0 0.2 0.0 0.8 99.0 0.0 RDG 0.9 0.0 0.0 0.1 0.0 0.0 99.0 M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

model testing : scale-free networks • then pass the classifier the real PPIN Subgraphs with up to Eight-step subgraphs seven edges ( p * � 0.65) ( p * � 0.65) Rank Class Score Class Score 1 DMC 8.2 � 1.0 DMC 8.6 � 1.1 2 DMR � 6.8 � 0.9 DMR � 6.1 � 1.7 3 RDG � 9.5 � 2.3 RDG � 9.3 � 1.6 4 AGV � 10.6 � 4.2 AGV � 11.5 � 4.1 5 LPA � 16.5 � 3.4 LPA � 14.3 � 3.2 6 SMW � 18.9 � 0.7 SMW � 18.3 � 1.9 7 RDS � 19.1 � 2.3 RDS � 19.9 � 1.5 • risks : we sometimes fall in love with our models • opportunities : statistics offers powerful tools for model testing • difficulties : requires learning new tools, and bravery M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102 (9), 319203197 (2005).

prediction : link prediction • how can we evaluate how good a model is? • cross-validation hold out some data fit the model to what remains quantify model’s ability to predict held-out data • for networks, this usually means link prediction • to do this well, we use probabilistic generative models

model hierarchical random graph (HRG) j i instance i j Pr( i, j connected) = p r = p (lowest common ancestor of i,j )

prediction : link prediction Terrorist association network Grassland species network a T. pallidum metabolic network 1 1 1 Pure chance Pure chance Pure chance Common neighbors Common neighbors Common neighbors 0.9 Jaccard coefficient 0.9 Jaccard coeff. Jaccard coefficient 0.9 Degree product Degree product Degree product Shortest paths Shortest paths Shortest paths 0.8 0.8 0.8 Hierarchical structure Hierarchical structure Hierarchical structure AUC 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fraction of edges observed Fraction of edges observed Fraction of edges observed, k/m A. Clauset, C. Moore and M. E. J. Newman, "Hierarchical structure and the prediction of missing links in networks." Nature 453 , 98 - 101 (2008).

Exploration, testing, and prediction: the many roles of statistics - PowerPoint PPT Presentation

Exploration, testing, and prediction: the many roles of statistics in Network Science Aaron Clauset Assistant Professor of Computer Science & BioFrontiers University of Colorado Boulder External Faculty, Santa Fe Institute 250 200 150

Roles of Remote Sensing for Influenza Roles of Remote Sensing for Influenza Risk Prediction and

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Standard Project Roles and Responsibilities This describes typical roles and responsibilities for

Software testing Software Testing Introduction Testing levels Automated testing Principles and

What creates community? Roles Experiences Time Community Roles What Are Roles? A role is

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

1. Test page This page is for testing. This page is for testing. This page is for testing.

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Topic Overview Background Requirements Self-certifying the date for time limits

Local, Distributed Topology Control for Large-Scale Wireless Ad Hoc Networks T. Nieberg, J.L.

Image Denoising and Enhancement Karen Egiazarian (TUT , NI) Department of Signal Processing 2

Coflow Deadline Scheduling via Network-Aware Optimization Shih-Hao Tseng, (pronounced as

3/12/2019 Disclosure Continuous PA Pressure I have no disclosures Monitoring in Two

PEDIATRIC ACUTE PROMYELOCYTIC LEUKEMIA (APL) IN LATIN AMERICAN CHILDREN. IS IT POSSIBLE TO WORK

Marcos Almeida MD, MSc, PhD Tenured Professor of the Faculty of Medicine and the Postgraduate

Truncation Effects in the FRG Method Istvn Nndori MTA-DE Particle Physics Research Group,

Sambuz

Useful Links

Newsletter

Mail Us