community recovery in graphs with locality
play

Community Recovery in Graphs with Locality Yuxin Chen , Govinda - PowerPoint PPT Presentation

Community Recovery in Graphs with Locality Yuxin Chen , Govinda Kamath , Changho Suh , David Tse Stanford KAIST Community recovery / graph clustering Community structures are common in many social networks Credit: The


  1. Community Recovery in Graphs with Locality Yuxin Chen † , Govinda Kamath † , Changho Suh ∗ , David Tse † Stanford † KAIST ∗

  2. Community recovery / graph clustering Community structures are common in many social networks Credit: The Future Buzz Credit: S. Papadopoulos

  3. Community recovery / graph clustering Community structures are common in many social networks Credit: The Future Buzz Credit: S. Papadopoulos Community recovery: partition users into several clusters based on their friendships / similarities

  4. Community recovery in computational biology A genome phasing problem

  5. Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited

  6. Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs

  7. Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs Haplotype phasing : retrieve phase info of all SNPs from linking reads

  8. Stochastic block model / censored block model Pairwise measurements for any pair ( i, j ) of nodes � P 0 , if i and j are from same community ind. y i,j ∼ else P 1 ,

  9. Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al...

  10. Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes

  11. Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes In new technologies like 10x-Genomics: (1) n ∼ 10 5 SNPs; (2) linking range ∼ 100 SNPs

  12. This work: how to deal with measurement locality in community recovery?

  13. A two-community model • n variables we seek: x 1 , · · · , x n ∈ { 0 , 1 } – encode community membership x i = 0 x i = 1

  14. Measurement model: random sampling • Constraint graph G | {z r {z r }

  15. Measurement model: random sampling • Constraint graph G | {z r {z r } • Random sampling: pick m randomly chosen edges of G

  16. Measurement model: random sampling • Constraint graph G | {z r {z r } • Random sampling: pick m randomly chosen edges of G • Noise model: on each of these m edges ( i, j ) , take an independent sample  x i ⊕ x j , with prob. 1 − θ  ind. ���� y i,j = meas. error rate  x i ⊕ x j ⊕ 1 , else

  17. Modeling locality via constraint graph Global / long-range measurements constraint graph randomly picked edges

  18. Modeling locality via constraint graph Global / long-range measurements constraint graph randomly picked edges Local measurements | {z r {z r } constraint graph randomly picked edges (e.g. r ∼ n 0 . 4 for 10x)

  19. Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)?

  20. Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently?

  21. Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently? Global samples Local samples prior works

  22. Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently? Global samples Local samples prior works Encouraging news: one can obtain efficient recovery within linear time

  23. Proposed algorithm: a 3-stage linear-time paradigm

  24. Spectral-Stitching: Stage 1 Start by running spectral method on core complete subgraphs = E [ L ] + L − E [ L ] L ���� rank-1 • Compute rank-1 approximation of L ( sample matrix restricted to the subgraph )

  25. Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately

  26. Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately • Approximate solution within each subgraph – Key observation: approx. recovery needs only O (1) samples per node

  27. Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately • Approximate solution within each subgraph – Key observation: approx. recovery needs only O (1) samples per node • Inconsistent global phases across subgraphs

  28. Spectral-Stitching: Stage 2 Calibrate phases across subgraphs by checking their correlations

  29. Spectral-Stitching: Stage 2 Calibrate phases across subgraphs by checking their correlations

  30. Spectral-Stitching: Stage 2 Calibrate phases across subgraphs by checking their correlations Purpose of Stages 1-2: obtain approximate solution of all nodes

  31. Spectral-Stitching: Stage 3 Clean up all remaining errors by iterative refinement • local majority vote using all samples 31 / 45

  32. Spectral-Stitching: Stage 3 Clean up all remaining errors by iterative refinement • local majority vote using all samples • Key observation: exact recovery needs at least Θ(log n ) samples per node 32 / 45

  33. Main results: rings

  34. Main results: rings 0 . 5 n log n Theorem: minimum sample complexity = 1 − exp {− KL (0 . 5 � θ } )

  35. Main results: rings 0 . 5 n log n Theorem: minimum sample complexity = 1 − exp {− KL (0 . 5 � θ } ) Info and comput. limits meet!

  36. An insensitivity phenomenon ring

  37. An insensitivity phenomenon complete graph ring

  38. An insensitivity phenomenon complete graph ring small-world

  39. An insensitivity phenomenon complete graph ring small-world Info and comput. limits are identical for many spatially invariant graphs

  40. Empirical success rate vs. sample size n = 100 , 000 , input error rate = 0 . 2 10 Monte Carlo runs to get each point Each run takes ∼ 6.4 sec on a Mac Pro

  41. Extension: beyond spatially invariant graphs | {z r {z r } {z | {z } r r lines grids

  42. Extension: beyond spatially invariant graphs sample complexity | {z r {z r grids } lines {z rings | {z } r r locality radius r lines grids n 0 . 25 n 0 . 5 n 0 . 75 n Info limit vs. r Infomation and comput. limits achievable by same algorithm

  43. Extension: beyond pairwise measurements New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two

  44. Extension: beyond pairwise measurements New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two Algorithm and theory can be easily extended to see performance gain 15 n log n total # SNPs touched paired reads 10 n log n triple-linked reads 5 n log n infinite-linked reads error rate per read 0 . 1 0 . 2

  45. Initial results on real data (haplotype phasing) NA12878 dataset from 10x genomics # SNPs n : 34240 ∼ 191829 , sample size m : 102633 ∼ 574189

  46. Concluding remarks • Studied community recovery when measurements are highly local – motivated by genome phasing and social networks • Information limits can be achieved in linear time for a broad family of models | {z r {z r } {z | {z } r r Full version of paper available at http://arxiv.org/abs/1602.03828

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend