Exploration, testing, and prediction: the many roles of statistics - - PowerPoint PPT Presentation

exploration testing and prediction the many roles of
SMART_READER_LITE
LIVE PREVIEW

Exploration, testing, and prediction: the many roles of statistics - - PowerPoint PPT Presentation

Exploration, testing, and prediction: the many roles of statistics in Network Science Aaron Clauset Assistant Professor of Computer Science & BioFrontiers University of Colorado Boulder External Faculty, Santa Fe Institute 250 200 150


slide-1
SLIDE 1

100 150 200 250

Aaron Clauset Assistant Professor of Computer Science & BioFrontiers University of Colorado Boulder External Faculty, Santa Fe Institute

Exploration, testing, and prediction: the many roles of statistics in Network Science

slide-2
SLIDE 2

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

slide-3
SLIDE 3

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

slide-4
SLIDE 4

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"There are three kinds of lies: lies, damned lies, and statistics." — unknown

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman

slide-5
SLIDE 5

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "There are three kinds of lies: lies, damned lies, and statistics." — unknown

slide-6
SLIDE 6

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"If your experiment needs statistics, you ought to have done a better experiment." — E. Rutherford

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller "There are three kinds of lies: lies, damned lies, and statistics." — unknown

slide-7
SLIDE 7

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"Far better an approximate answer to the right question… than an exact answer to the wrong question." — John W. Tukey

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller "There are three kinds of lies: lies, damned lies, and statistics." — unknown "If your experiment needs statistics, you ought to have done a better experiment." — E. Rutherford

slide-8
SLIDE 8

"Those who ignore Statistics are condemned to reinvent it." — Bradley Efron

"In God we trust. All others must bring data." — W. Edwards Deming "God must bring data, too." — unknown

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman "It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller "There are three kinds of lies: lies, damned lies, and statistics." — unknown "If your experiment needs statistics, you ought to have done a better experiment." — E. Rutherford "Far better an approximate answer to the right question… than an exact answer to the wrong question." — John W. Tukey

slide-9
SLIDE 9

three roles of statistics

  • data exploration
  • model testing
  • prediction
slide-10
SLIDE 10

data exploration : community detection

  • given a graph
  • divide its vertices into coherent groups
  • consummate data exploration!
  • a common task in network analysis
  • helped yield insight into real social,

biological, technological systems

  • scores of methods, many extremely

powerful, some with guarantees (stochastic block model, Belief Propagation, etc.)

G z(G)

13

C D

slide-11
SLIDE 11

data exploration : community detection

  • given a graph
  • divide its vertices into coherent groups
  • nearly all methods:

estimate [WARNING: typically NP-hard]

13

C D

G max

z

f(z(G)) z(G)

slide-12
SLIDE 12

the trouble with community detection

this is a pretty good division (under nearly any )

f

  • B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).
slide-13
SLIDE 13

data exploration : community detection

so are all of these (and many more)

slide-14
SLIDE 14

data exploration : community detection

  • there are an exponential number of

good-looking local maxima each algorithm chooses one

  • this is okay for data exploration!
  • anything else requires caution
  • risks: 'wrong' optima
  • opportunities: community

structure is genuinely interesting!

  • difficulties: how do we select

among all these good divisions?

  • B. H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).
slide-15
SLIDE 15

model testing : scale-free networks

  • observation: many protein interaction networks have heavy-

tailed (power-law?) degree distributions

Inferring network mechanisms: The Drosophila melanogaster protein interaction network

Manuel Middendorf†, Etay Ziv‡, and Chris H. Wiggins§¶

  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).
slide-16
SLIDE 16

model testing : scale-free networks

  • observation: many protein interaction networks have heavy-

tailed (power-law?) degree distributions

  • claims: as of 2005, FIVE different models proposed as

generative mechanisms

  • duplication mutation complementation (DMC), duplication

mutation-random (DMR), linear preferential attachment (LPA), random growing networks (RDG), aging vertex networks (AGV)

Inferring network mechanisms: The Drosophila melanogaster protein interaction network

Manuel Middendorf†, Etay Ziv‡, and Chris H. Wiggins§¶

  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).
slide-17
SLIDE 17

model testing : scale-free networks

  • the problem: all models fit the observed degree distribution
  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).
slide-18
SLIDE 18

model testing : scale-free networks

  • the problem: all models fit the observed degree distribution
  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).

Aaron likes honey Bear likes honey ?

slide-19
SLIDE 19

model testing : scale-free networks

  • the solution: build a classifier that can

distinguish networks generated by the 5 models + 2 controls based on their motif frequencies

  • use decision trees + Adaboost (very

powerful) to learn which motifs distinguish the models

  • validated on synthetic graphs with

known structure:

  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).

Truth Prediction DMR DMC AGV LPA SMW RDS RDG DMR 99.3 0.0 0.0 0.0 0.0 0.1 0.6 DMC 0.0 99.7 0.0 0.0 0.3 0.0 0.0 AGV 0.0 0.1 84.7 13.5 1.2 0.5 0.0 LPA 0.0 0.0 10.3 89.6 0.0 0.0 0.1 SMW 0.0 0.0 0.6 0.0 99.0 0.4 0.0 RDS 0.0 0.0 0.2 0.0 0.8 99.0 0.0 RDG 0.9 0.0 0.0 0.1 0.0 0.0 99.0

slide-20
SLIDE 20

model testing : scale-free networks

  • then pass the classifier the real PPIN
  • risks: we sometimes fall in love with our models
  • opportunities: statistics offers powerful tools

for model testing

  • difficulties: requires learning new tools,

and bravery

  • M. Middendorf, E. Ziv and C. H. Wiggins, Proc. Natl. Acad. Sci. USA 102(9), 319203197 (2005).

Rank Eight-step subgraphs (p* 0.65) Subgraphs with up to seven edges (p* 0.65) Class Score Class Score 1 DMC 8.2 1.0 DMC 8.6 1.1 2 DMR 6.8 0.9 DMR 6.1 1.7 3 RDG 9.5 2.3 RDG 9.3 1.6 4 AGV 10.6 4.2 AGV 11.5 4.1 5 LPA 16.5 3.4 LPA 14.3 3.2 6 SMW 18.9 0.7 SMW 18.3 1.9 7 RDS 19.1 2.3 RDS 19.9 1.5

slide-21
SLIDE 21

prediction : link prediction

  • how can we evaluate how good a model is?
  • cross-validation

hold out some data fit the model to what remains quantify model’s ability to predict held-out data

  • for networks, this usually means link prediction
  • to do this well, we use probabilistic generative models
slide-22
SLIDE 22

hierarchical random graph (HRG)

Pr(i, j connected) = pr i j i j = p(lowest common ancestor of i,j)

model instance

slide-23
SLIDE 23

prediction : link prediction

0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed, k/m Grassland species network Pure chance Common neighbors Jaccard coeff. Degree product Shortest paths Hierarchical structure 0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 AUC Fraction of edges observed Terrorist association network

a

Pure chance Common neighbors Jaccard coefficient Degree product Shortest paths Hierarchical structure 0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed

  • T. pallidum

metabolic network Pure chance Common neighbors Jaccard coefficient Degree product Shortest paths Hierarchical structure

  • A. Clauset, C. Moore and M. E. J. Newman, "Hierarchical structure and the prediction of missing links in networks." Nature 453, 98 - 101 (2008).
slide-24
SLIDE 24

prediction : link prediction

0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed, k/m Grassland species network Pure chance Common neighbors Jaccard coeff. Degree product Shortest paths Hierarchical structure 0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 AUC Fraction of edges observed Terrorist association network

a

Pure chance Common neighbors Jaccard coefficient Degree product Shortest paths Hierarchical structure 0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed

  • T. pallidum

metabolic network Pure chance Common neighbors Jaccard coefficient Degree product Shortest paths Hierarchical structure

10 10

1

10

−3

10

−2

10

−1

10

a Degree, k Fraction of vertices with degree k

0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25

Fraction of graphs with clustering coefficient c Clustering coefficient, c

2 4 6 8 10 10

−3

10

−2

10

−1

10

b Distance, d Fraction of vertex−pairs at distance d

{

and reproduces motifs and other patterns degrees triangles path lengths

slide-25
SLIDE 25

prediction : link prediction

  • link prediction is a hard form of validation
  • simple and clear evaluation measure
  • risks: overfitting

cross-validation not well-defined for networks we care about more than missing links

  • opportunities: data driven with up-front assumptions

generative models quantify uncertainty, predict missing data

  • difficulties: usually non-mechanistic (predictive but not explanatory)

how do we test more complicated predictions?

slide-26
SLIDE 26

"It’s easy to lie with statistics, but it’s easier to lie without them." — Fred Mosteller

  • statistics are the foundation of a data-driven Network Science.
  • exploration — what patterns need to be explained?
  • model testing — how well can I capture those patterns?
  • prediction — how well can I predict missing / future

patterns?

  • the BIG risk: we’ll reinvent statistics, slowly, haltingly
  • the BIG opportunity: we’ll use modern Statistics to be better

scientists, to find truth more quickly, accurately

  • the BIG difficulty: Statistics is hard
slide-27
SLIDE 27

fin

"The first principle is that you must not fool yourself, and you are the easiest person to fool." — Richard Feynman