Community Recovery in Graphs with Locality Yuxin Chen , Govinda - - PowerPoint PPT Presentation

community recovery in graphs with locality
SMART_READER_LITE
LIVE PREVIEW

Community Recovery in Graphs with Locality Yuxin Chen , Govinda - - PowerPoint PPT Presentation

Community Recovery in Graphs with Locality Yuxin Chen , Govinda Kamath , Changho Suh , David Tse Stanford KAIST Community recovery / graph clustering Community structures are common in many social networks Credit: The


slide-1
SLIDE 1

Community Recovery in Graphs with Locality

Yuxin Chen†, Govinda Kamath†, Changho Suh∗, David Tse† Stanford† KAIST∗

slide-2
SLIDE 2

Community recovery / graph clustering

Community structures are common in many social networks

Credit: The Future Buzz Credit: S. Papadopoulos

slide-3
SLIDE 3

Community recovery / graph clustering

Community structures are common in many social networks

Credit: The Future Buzz Credit: S. Papadopoulos

Community recovery: partition users into several clusters based on their friendships / similarities

slide-4
SLIDE 4

Community recovery in computational biology

A genome phasing problem

slide-5
SLIDE 5

Community recovery in computational biology

A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited

slide-6
SLIDE 6

Community recovery in computational biology

A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs

slide-7
SLIDE 7

Community recovery in computational biology

A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs Haplotype phasing: retrieve phase info of all SNPs from linking reads

slide-8
SLIDE 8

Stochastic block model / censored block model

Pairwise measurements for any pair (i, j) of nodes yi,j

ind.

  • P0,

if i and j are from same community P1, else

slide-9
SLIDE 9

Problem: nodes often have locality

Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al...

slide-10
SLIDE 10

Problem: nodes often have locality

Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes

slide-11
SLIDE 11

Problem: nodes often have locality

Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes

In new technologies like 10x-Genomics: (1) n ∼ 105 SNPs; (2) linking range ∼ 100 SNPs

slide-12
SLIDE 12

This work: how to deal with measurement locality in community recovery?

slide-13
SLIDE 13

A two-community model

  • n variables we seek: x1, · · · , xn ∈ {0, 1}

– encode community membership

xi = 0 xi = 1

slide-14
SLIDE 14

Measurement model: random sampling

  • Constraint graph G

| {z }

r

{z

r

slide-15
SLIDE 15

Measurement model: random sampling

  • Constraint graph G

| {z }

r

{z

r

  • Random sampling: pick m randomly chosen edges of G
slide-16
SLIDE 16

Measurement model: random sampling

  • Constraint graph G

| {z }

r

{z

r

  • Random sampling: pick m randomly chosen edges of G
  • Noise model: on each of these m edges (i, j), take an independent sample

yi,j

ind.

=    xi ⊕ xj, with prob. 1 − θ

  • meas. error rate

xi ⊕ xj ⊕ 1, else

slide-17
SLIDE 17

Modeling locality via constraint graph

Global / long-range measurements

constraint graph randomly picked edges

slide-18
SLIDE 18

Modeling locality via constraint graph

Global / long-range measurements

constraint graph randomly picked edges

Local measurements

| {z }

r

{z

r

constraint graph (e.g. r ∼ n0.4 for 10x) randomly picked edges

slide-19
SLIDE 19

Information and computation limits

  • 1. How many samples are needed to recover {xi} reliably (up to global offset)?
slide-20
SLIDE 20

Information and computation limits

  • 1. How many samples are needed to recover {xi} reliably (up to global offset)?
  • 2. How to recover efficiently?
slide-21
SLIDE 21

Information and computation limits

  • 1. How many samples are needed to recover {xi} reliably (up to global offset)?
  • 2. How to recover efficiently?

Global samples Local samples prior works

slide-22
SLIDE 22

Information and computation limits

  • 1. How many samples are needed to recover {xi} reliably (up to global offset)?
  • 2. How to recover efficiently?

Global samples Local samples prior works Encouraging news: one can obtain efficient recovery within linear time

slide-23
SLIDE 23

Proposed algorithm: a 3-stage linear-time paradigm

slide-24
SLIDE 24

Spectral-Stitching: Stage 1

Start by running spectral method on core complete subgraphs L = E[L]

  • rank-1

+ L − E [L]

  • Compute rank-1 approximation of L (sample matrix restricted to the subgraph)
slide-25
SLIDE 25

Spectral-Stitching: Stage 1

Split all nodes into overlapping subsets and run spectral methods separately

slide-26
SLIDE 26

Spectral-Stitching: Stage 1

Split all nodes into overlapping subsets and run spectral methods separately

  • Approximate solution within each subgraph

– Key observation: approx. recovery needs only O(1) samples per node

slide-27
SLIDE 27

Spectral-Stitching: Stage 1

Split all nodes into overlapping subsets and run spectral methods separately

  • Approximate solution within each subgraph

– Key observation: approx. recovery needs only O(1) samples per node

  • Inconsistent global phases across subgraphs
slide-28
SLIDE 28

Spectral-Stitching: Stage 2

Calibrate phases across subgraphs by checking their correlations

slide-29
SLIDE 29

Spectral-Stitching: Stage 2

Calibrate phases across subgraphs by checking their correlations

slide-30
SLIDE 30

Spectral-Stitching: Stage 2

Calibrate phases across subgraphs by checking their correlations Purpose of Stages 1-2: obtain approximate solution of all nodes

slide-31
SLIDE 31

Spectral-Stitching: Stage 3

Clean up all remaining errors by iterative refinement

  • local majority vote using all samples

31 / 45

slide-32
SLIDE 32

Spectral-Stitching: Stage 3

Clean up all remaining errors by iterative refinement

  • local majority vote using all samples
  • Key observation: exact recovery needs at least Θ(log n) samples per node

32 / 45

slide-33
SLIDE 33

Main results: rings

slide-34
SLIDE 34

Main results: rings

Theorem: minimum sample complexity =

0.5n log n 1−exp{−KL(0.5 θ})

slide-35
SLIDE 35

Main results: rings

Theorem: minimum sample complexity =

0.5n log n 1−exp{−KL(0.5 θ})

Info and comput. limits meet!

slide-36
SLIDE 36

An insensitivity phenomenon

ring

slide-37
SLIDE 37

An insensitivity phenomenon

complete graph ring

slide-38
SLIDE 38

An insensitivity phenomenon

complete graph ring small-world

slide-39
SLIDE 39

An insensitivity phenomenon

complete graph ring small-world Info and comput. limits are identical for many spatially invariant graphs

slide-40
SLIDE 40

Empirical success rate vs. sample size

n = 100, 000, input error rate = 0.2 10 Monte Carlo runs to get each point Each run takes ∼6.4 sec on a Mac Pro

slide-41
SLIDE 41

Extension: beyond spatially invariant graphs

| {z }

r

{z

r

lines

| {z }

r

{z

r

grids

slide-42
SLIDE 42

Extension: beyond spatially invariant graphs

| {z }

r

{z

r

lines

| {z }

r

{z

r

grids

n0.25 n0.5 n0.75 n

rings lines grids locality radius r sample complexity

Info limit vs. r Infomation and comput. limits achievable by same algorithm

slide-43
SLIDE 43

Extension: beyond pairwise measurements

New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two

slide-44
SLIDE 44

Extension: beyond pairwise measurements

New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two Algorithm and theory can be easily extended to see performance gain

0.1 0.2 5n log n 10n log n 15n log n

paired reads triple-linked reads infinite-linked reads error rate per read total # SNPs touched

slide-45
SLIDE 45

Initial results on real data (haplotype phasing)

NA12878 dataset from 10x genomics # SNPs n : 34240 ∼ 191829, sample size m : 102633 ∼ 574189

slide-46
SLIDE 46

Concluding remarks

  • Studied community recovery when measurements are highly local

– motivated by genome phasing and social networks

  • Information limits can be achieved in linear time for a broad family of models

| {z }

r

{z

r

| {z }

r

{z

r

Full version of paper available at http://arxiv.org/abs/1602.03828