Scalable Hashing-Based Network Discovery Tara Safavi , Chandra - PowerPoint PPT Presentation

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor

Networks are everywhere…. Airport connections Internet routing Paper citations

…but are not always directly observed … 1. fMRI scans 2. Time series 3. Brain network See Network Structure Inference, A Survey, Brugere, Gallagher, Berger-Wolf

How to build this network? … 1. fMRI scans 2. Time series 3. Brain network

Network discovery Reconstructing networks from indirect, possibly noisy measurements with unobserved interactions Brain scans Gene sequences Stock patterns

Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation . 3 . C A B C 1. N time series

Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A A B B C C 1. N time series 3. Sparse graph

Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A A Widely used in many B B domains, interpretable , but… C C 1. N time series 3. Sparse graph

Traditional method 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A O( N 2 ) A How to comparisons set? B B C C 1. N time series 3. Sparse graph

New hashing-based 2. Fully-connected weighted network . 8 A B 4 All-pairs correlation Drop edges below threshold θ . 3 . C . 8 A O( N 2 ) A How to comparisons set? B B C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

Contributions 2. Fully-connected network . 8 A • Network discovery via new locality-sensitive hashing (LSH) family B • Quickly find similar pairs — circumvent wasteful extra computation 4 All-pairs correlation Drop edges below threshold θ . 3 . • Novel similarity measure on sequences for LSH C • Quantify time-consecutive similarity . 8 A A • Complementary distance measure is a metric : suitable for LSH! O(N 2 ) Arbitrary? comparisons B B • Evaluation on real data in the neuroscience domain C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

Method 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Approximate time series representation Binarize w.r.t series mean (“clipped” representation 1 ) • Why? • Capture approximate fluctuation trend • Preprocess for hashing • Pipeline 1 Ratanamahatana et al, 2005 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Approximate time series representation Binarize w.r.t series mean (“clipped” representation 1 ) • Why? • Capture approximate fluctuation trend • Preprocess for hashing • But — binary sequences only have two possible values • Emphasize consecutive similarity between sequences over pointwise comparison • Pipeline 1 Ratanamahatana et al, 2005 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

ABC: Approximate Binary Correlation Capture variable-length consecutive runs between binary sequences • x: 1 1 0 1 0 0 0 y: 1 1 1 1 0 0 1 + (1 + α ) 0 + (1 + α ) 1 (1 + α ) 0 + (1 + α ) 1 + (1 + α ) 2 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

ABC: Approximate Binary Correlation Capture variable-length consecutive runs between binary sequences • Similarity score s a sum of p geometric series, each of length k i • Common ratio (1+ α ): 0 < α ⋘ 1 is a consecutiveness weighting factor • x: 1 1 0 1 0 0 0 y: 1 1 1 1 0 0 1 + (1 + α ) 0 + (1 + α ) 1 (1 + α ) 0 + (1 + α ) 1 + (1 + α ) 2 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

ABC: Approximate Binary Correlation Empirically, a good estimator of correlation coefficient r • Similarity scores s correlate well with r • s " # r ! Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

ABC: Approximate Binary Correlation Empirically, a good estimator of correlation coefficient r • Similarity scores s correlate well with r • Added benefit of time-aware hashing • LSH requires a metric : satisfies triangle inequality • We can show ABC distance is a metric (critical) • z s " # x y r ! Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

ABC distance triangle inequality: sketch of proof Induction on n , sequence length • Induction step: identify feasible cases between sequence pairs • Disagreement ( d ) • n New run ( n ) • x: 1 1 1 0 0 1 Append to existing run ( a ) • Compute all deltas • Append ( a ) to y: 0 1 1 0 0 1 existing run Triangle inequality holds! • z: 1 0 1 0 0 1 n + 1 z Pipeline x y 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Locality-sensitive hashing Hash data s.t. similar items likely to collide • Family of hash fns F : ( d 1 , d 2 , p 1 , p 2 )-sensitive • Control false negative/positive rates • Parameters • b : number of hash tables, increases p 1 • r : number of hash functions to concatenate, lowers p 2 • Hash Hash table Original data + signatures buckets hash function 1 1 1 0 0 1 x: x, y [1, 0] y: 0 1 1 0 0 1 z [0, 0] 1 0 1 0 0 1 z: g = h 2 & h 4 Background

Proposed: ABC-LSH window sampling Hash Hash table Original data + signatures buckets hash function x: 1 1 1 0 0 1 x, y [11, 00] 0 1 1 0 0 1 y: z [01, 00] 1 0 1 0 0 1 z: k = 2 d 1 d 2 - sensitive ( d 1 , d 2 , 1 − α (1 + α ) n − 1 , 1 − α (1 + α ) n − 1) g = h 2 & h 4 Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Summary 2. Fully-connected network . 8 • Time-consecutive l ocality-sensitive hashing (LSH) family A B • Novel similarity measure + distance metric on sequences 4 All-pairs correlation Drop edges below threshold θ . 3 . . 8 A A B B C C 1. N time series 3. Sparse graph 2. Hash series A Binarize Bucket pairwise similarity A B B C C Hash function Buckets

Evaluation

Evaluation questions 1. How efficient is our approach compared to baselines? • Baseline: pairwise correlation • Proposed: pairwise ABC, ABC-LSH 2. How predictive are the output graphs in real applications? • Can we predict brain health using graphs discovered with ABC-LSH? 3. How robust is our method to parameter choices? Data • Two publicly available fMRI datasets • Synthetic data Pipeline 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Question 1: scalability 2 - 15 x speedup with 2k - 20k nodes •

Question 2: task-based evaluation Brain networks: identify biomarkers of mental disease • Extract commonly used features from generated brain networks • Avg weighted degree • Avg clustering coefficient • Avg path length • Modularity • Density • Feature selection F 1 F 2 F 3 F 4 F 5 Healthy 0 6.5 .3 1.4 .7 .03

Question 2: task-based evaluation Logistic regression classifier, 10-fold stratified CV • Labels 6.5 .3 1.4 .7 .03 0 Train 1 Predicted health 1 Test 0

Question 2: task-based evaluation Total time: >1 hr Total time: 5 min Average accuracy same — runtime is not!

Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • Experiments: shown to be fast + accurate • Brain networks • More experiments on robustness, scalability, parameter sensitivity • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Conclusion Pipeline for network discovery on time series • ABC: time-consecutive similarity measure + distance metric on binary sequences • Associated LSH family • Modular + applicable in other settings • Experiments: shown to be fast + accurate • Brain networks • More experiments on robustness, scalability, parameter sensitivity • Impact: integrated into production systems • 1. Time series 2. Hash series 3. Sparse graph A . 8 A A A B Bucket pairwise Binarize B B B similarity C C C C

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra - PowerPoint PPT Presentation

Scalable Hashing-Based Network Discovery Tara Safavi , Chandra Sripada, Danai Koutra University of Michigan, Ann Arbor Networks are everywhere. Airport connections Internet routing Paper citations but are not always directly observed

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Discrete Hashing Fast, scalable retrieval and classification Fumin Shen Center for Future Media,

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Hashing Hashing What is it? A form of narcotic intake? A side order for your eggs? A

Lecture 8: Hashing I Lecture Overview Dictionaries and Python Motivation Prehashing

Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files

Advanced Algorithms COMS31900 Hashing part two Static Perfect Hashing Rapha el Clifford

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

G oteborg, 12 May 2004 Corrections, 16 May 2004 Title: The cost of iterator validity

What's new in MySQL 5.5? Performance/Scale Unleashed Mikael Ronstrm Senior MySQL Architect The

Hash functions and Cayley graphs: The end of the story ? Christophe Petit UCL Crypto Group Ch.

Beta Presentation Automatic Resume Verification The Capstone Experience Team Yello Giorgio

State Management for Hash-Based Signatures David McGrew, Panos Kampanakis, Scott Fluhrer,

July 2019 Disclaimer This management presentation is intended to provide an overview of the

Generic Views on Data Types Problem Towards a solution Generic Views Detour into GH The

Research Paper Presentations Generalizing Monads to Arrows (SCP00) Akshay Balvant Kalbhor

Sambuz

Useful Links

Newsletter

Mail Us