Scalable Hashing-Based Network Discovery Tara Safavi , Chandra - - PowerPoint PPT Presentation

scalable hashing based network discovery
SMART_READER_LITE
LIVE PREVIEW

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra - - PowerPoint PPT Presentation

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor Networks are everywhere. Airport connections Internet routing Paper citations but are not always directly observed


slide-1
SLIDE 1

Scalable Hashing-Based Network Discovery

Tara Safavi, Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor

slide-2
SLIDE 2

Networks are everywhere….

Airport connections Internet routing Paper citations

slide-3
SLIDE 3

…but are not always directly observed

  • 1. fMRI scans
  • 2. Time series
  • 3. Brain network

See Network Structure Inference, A Survey, Brugere, Gallagher, Berger-Wolf

slide-4
SLIDE 4
  • 1. fMRI scans
  • 2. Time series
  • 3. Brain network

How to build this network?

slide-5
SLIDE 5

Reconstructing networks from indirect, possibly noisy measurements with unobserved interactions

Network discovery

Brain scans Gene sequences Stock patterns

slide-6
SLIDE 6

Traditional method

A B C

  • 1. N time series
  • 2. Fully-connected weighted

network

All-pairs correlation

. 8 . 4 . 3

A B C

slide-7
SLIDE 7

Traditional method

A B C

  • 1. N time series
  • 2. Fully-connected weighted

network

All-pairs correlation

. 8 . 4 . 3

A B C

. 8

A B C

  • 3. Sparse graph

Drop edges below threshold θ

slide-8
SLIDE 8

Traditional method

A B C

  • 1. N time series
  • 2. Fully-connected weighted

network

All-pairs correlation

. 8 . 4 . 3

A B C

. 8

A B C

  • 3. Sparse graph

Drop edges below threshold θ

Widely used in many domains, interpretable, but…

slide-9
SLIDE 9

A B C

  • 1. N time series

All-pairs correlation

. 8 . 4 . 3

A B C

O(N2) comparisons

How to set?

  • 3. Sparse graph

. 8

A B C Drop edges below threshold θ

Traditional method

  • 2. Fully-connected weighted

network

slide-10
SLIDE 10

A B C

  • 1. N time series

All-pairs correlation

. 8 . 4 . 3

A B C

O(N2) comparisons

How to set?

A B C A B C Hash function Buckets

  • 2. Hash series
  • 3. Sparse graph

Binarize

. 8

A B C

Bucket pairwise similarity

Drop edges below threshold θ

New hashing-based

  • 2. Fully-connected weighted

network

slide-11
SLIDE 11

A B C

  • 1. N time series
  • 2. Fully-connected network

All-pairs correlation

. 8 . 4 . 3

A B C

O(N2) comparisons

Arbitrary?

A B C A B C Hash function Buckets

  • 2. Hash series
  • 3. Sparse graph

Binarize

. 8

A B C

Bucket pairwise similarity

Drop edges below threshold θ

Contributions

  • Network discovery via new locality-sensitive hashing (LSH) family
  • Quickly find similar pairs — circumvent wasteful extra computation
  • Novel similarity measure on sequences for LSH
  • Quantify time-consecutive similarity
  • Complementary distance measure is a metric: suitable for LSH!
  • Evaluation on real data in the neuroscience domain
slide-12
SLIDE 12

Method

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph
slide-13
SLIDE 13

Approximate time series representation

  • Binarize w.r.t series mean (“clipped” representation1)
  • Why?
  • Capture approximate fluctuation trend
  • Preprocess for hashing

Pipeline

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

1Ratanamahatana et al, 2005

slide-14
SLIDE 14

Approximate time series representation

  • Binarize w.r.t series mean (“clipped” representation1)
  • Why?
  • Capture approximate fluctuation trend
  • Preprocess for hashing
  • But — binary sequences only have two possible values
  • Emphasize consecutive similarity between sequences over pointwise comparison

Pipeline

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

1Ratanamahatana et al, 2005

slide-15
SLIDE 15

ABC: Approximate Binary Correlation

  • Capture variable-length consecutive runs between binary sequences

x: y: 1 1 0 1 0 0 0 1 1 1 1 0 0 1

(1 + α)0 + (1 + α)1 (1 + α)0 + (1 + α)1 + (1 + α)2

+

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

slide-16
SLIDE 16

ABC: Approximate Binary Correlation

  • Capture variable-length consecutive runs between binary sequences
  • Similarity score s a sum of p geometric series, each of length ki
  • Common ratio (1+α): 0 < α ⋘ 1 is a consecutiveness weighting factor

x: y: 1 1 0 1 0 0 0 1 1 1 1 0 0 1

(1 + α)0 + (1 + α)1 (1 + α)0 + (1 + α)1 + (1 + α)2

+

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

slide-17
SLIDE 17
  • Empirically, a good estimator of correlation coefficient r
  • Similarity scores s correlate well with r

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

! "#

s r

ABC: Approximate Binary Correlation

slide-18
SLIDE 18
  • Empirically, a good estimator of correlation coefficient r
  • Similarity scores s correlate well with r
  • Added benefit of time-aware hashing
  • LSH requires a metric: satisfies triangle inequality
  • We can show ABC distance is a metric (critical)

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

x y z

! "#

s r

ABC: Approximate Binary Correlation

slide-19
SLIDE 19
  • Induction on n, sequence length
  • Induction step: identify feasible cases between sequence pairs
  • Disagreement (d)
  • New run (n)
  • Append to existing run (a)
  • Compute all deltas
  • Triangle inequality holds!

1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 x: y: z: n n + 1

ABC distance triangle inequality: sketch of proof

Append (a) to existing run

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

x y z

slide-20
SLIDE 20
  • Hash data s.t. similar items likely to collide
  • Family of hash fns F: (d1, d2, p1, p2)-sensitive
  • Control false negative/positive rates
  • Parameters
  • b: number of hash tables, increases p1
  • r: number of hash functions to concatenate, lowers p2

Locality-sensitive hashing

Original data + hash function

x: y: z: 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 g = h2 & h4 [1, 0] [0, 0] x, y z

Hash signatures Hash table buckets

Background

slide-21
SLIDE 21

Original data + hash function

x: y: z: 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 k = 2 g = h2 & h4 [11, 00] [01, 00] x, y z

Hash signatures Hash table buckets

  • sensitive

(d1, d2, 1 − α d1 (1 + α)n − 1, 1 − α d2 (1 + α)n − 1)

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

Proposed: ABC-LSH window sampling

slide-22
SLIDE 22

A B C

  • 1. N time series
  • 2. Fully-connected network

All-pairs correlation

. 8 . 4 . 3

A B A B C A B C Hash function Buckets

  • 2. Hash series
  • 3. Sparse graph

Binarize

. 8

A B C

Bucket pairwise similarity

Drop edges below threshold θ

Summary

  • Time-consecutive locality-sensitive hashing (LSH) family
  • Novel similarity measure + distance metric on sequences
slide-23
SLIDE 23

Evaluation

slide-24
SLIDE 24

Evaluation questions

  • 1. How efficient is our approach compared to baselines?
  • Baseline: pairwise correlation
  • Proposed: pairwise ABC, ABC-LSH
  • 2. How predictive are the output graphs in real applications?
  • Can we predict brain health using graphs discovered with ABC-LSH?
  • 3. How robust is our method to parameter choices?

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Pipeline

Data

  • Two publicly available fMRI datasets
  • Synthetic data
slide-25
SLIDE 25

Question 1: scalability

  • 2 - 15x speedup with 2k - 20k nodes
slide-26
SLIDE 26
  • Brain networks: identify biomarkers of mental disease
  • Extract commonly used features from generated brain networks
  • Avg weighted degree
  • Avg clustering coefficient
  • Avg path length
  • Modularity
  • Density

6.5 .3 1.4 .7 .03 F1 F2 F3 F4 F5 Healthy

Feature selection

Question 2: task-based evaluation

slide-27
SLIDE 27
  • Logistic regression classifier, 10-fold stratified CV

6.5 .3 1.4 .7 .03 1

Train Labels

1

Test Predicted health

Question 2: task-based evaluation

slide-28
SLIDE 28

Average accuracy same — runtime is not!

Total time: 5 min

Total time: >1 hr

Question 2: task-based evaluation

slide-29
SLIDE 29

Conclusion

  • Pipeline for network discovery on time series
  • ABC: time-consecutive similarity measure + distance metric on binary sequences
  • Associated LSH family
  • Modular + applicable in other settings

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph
slide-30
SLIDE 30

Conclusion

  • Pipeline for network discovery on time series
  • ABC: time-consecutive similarity measure + distance metric on binary sequences
  • Associated LSH family
  • Modular + applicable in other settings
  • Experiments: shown to be fast + accurate
  • Brain networks
  • More experiments on robustness, scalability, parameter sensitivity

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph
slide-31
SLIDE 31

Conclusion

  • Pipeline for network discovery on time series
  • ABC: time-consecutive similarity measure + distance metric on binary sequences
  • Associated LSH family
  • Modular + applicable in other settings
  • Experiments: shown to be fast + accurate
  • Brain networks
  • More experiments on robustness, scalability, parameter sensitivity
  • Impact: integrated into production systems

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph
slide-32
SLIDE 32

Conclusion

  • Pipeline for network discovery on time series
  • ABC: time-consecutive similarity measure + distance metric on binary sequences
  • Associated LSH family
  • Modular + applicable in other settings
  • Experiments: shown to be fast + accurate
  • Brain networks
  • More experiments on robustness, scalability, parameter sensitivity
  • Impact: integrated into production systems

A B C A B C A B C A B C

. 8

Binarize Bucket pairwise similarity

  • 1. Time series
  • 2. Hash series
  • 3. Sparse graph

Thank you + questions Supported by

slide-33
SLIDE 33

Additional slides

slide-34
SLIDE 34

Related work

Lacking consecutiveness and/or metrics Distributional assumptions

k-NN networks Similarity self-join

User-set thresholds

Time series hashing Graphical model inference

slide-35
SLIDE 35

Generated graph structure

  • Characteristics of generated graphs
  • Approximates correlation-based approach well
slide-36
SLIDE 36
  • How does scalability change varying k, b, and r?
  • Results fairly intuitive
  • Increase b: more hash tables, slower
  • Increase k: longer windows, series less likely to collide, faster
  • Increase r: longer signatures, series less likely to collide, faster

Parameter sensitivity

slide-37
SLIDE 37
  • How do graph properties change varying k, b, and r?
  • Not much

Parameter sensitivity

slide-38
SLIDE 38
  • Slide 2, airline routes
  • Slide 2, internet
  • Slide 2, paper citations
  • Slide 3, fMRI
  • Slide 3, brain network
  • Slide 5, gene sequences
  • Slide 5, stocks

Image credits