A sub-linear method for computing columns of functions of sparse - - PowerPoint PPT Presentation

a sub linear method for computing columns of functions of
SMART_READER_LITE
LIVE PREVIEW

A sub-linear method for computing columns of functions of sparse - - PowerPoint PPT Presentation

A sub-linear method for computing columns of functions of sparse matrices Kyle Kloster and David F. Gleich Purdue University March 3, 2014 Supported by NSF CAREER 1149756-CCF Kyle Kloster (Purdue) Fast f ( A ) b March 3, 2014 1 / 29


slide-1
SLIDE 1

A sub-linear method for computing columns of functions of sparse matrices

Kyle Kloster and David F. Gleich

Purdue University

March 3, 2014 Supported by NSF CAREER 1149756-CCF

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 1 / 29

slide-2
SLIDE 2

Overview

  • 1. f (A): problem description and applications
  • 2. Description of “sub-linear” results
  • 3. The Algorithm for f (A)b
  • 4. Intuition for proof
  • 5. Experiments on real-world social networks

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 2 / 29

slide-3
SLIDE 3

The Problem

Functions of Matrices: background

We can apply most functions, e.g. f (x) = cos(x), to any square matrices A if f is defined on the eigenvalues of A. One definition: Taylor series! cos(x) = 1 0! + −x2 2! + x4 4! + · · · cos(A) = I 0! + −A2 2! + A4 4! + · · · Then we can think of f (A)b as the action of the operator f (A) on b, or as a diffusion on a graph underlying the matrix A.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 3 / 29

slide-4
SLIDE 4

The Problem

Functions of Matrices: applications

Action: f (x) = ex:

dx dt = Ax; x(0) = x0

solution: x(t) = exp{tA}x0 f (x) = x1/p: P(t) transition matrix for Markov process P(1) describes process over a year; P1/12 for a month

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 4 / 29

slide-5
SLIDE 5

The Problem

Functions of Matrices: applications

Action: f (x) = ex:

dx dt = Ax; x(0) = x0

solution: x(t) = exp{tA}x0 f (x) = x1/p: P(t) transition matrix for Markov process P(1) describes process over a year; P1/12 for a month Diffusion: f (x) = (1 − αx)−1: the resolvent yields the PageRank diffusion: f (P)ei interpreted as nodes’ importance to node i. f (x) = etx: etPei, the heat kernel diffusion, offers an alternative ranking of nodes’ importance

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 4 / 29

slide-6
SLIDE 6

The Problem

Parameters of f (A)b

A: Original motivation: A = a normalized version of an adjacency matrix from a social network; the Laplacian or random-walk matrix. Sparse, small diameter, stochastic, degree distribution follows power-law Generalized: any nonnegative A with A1 ≤ 1. b: Originally b = ei, i.e. compute a column f (A)ei Generalized: b can be any sparse, stochastic vector f (·): Originally f (x) = ex, (1 − αx)−1 Generalized: can be any function decaying “fast enough”

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 5 / 29

slide-7
SLIDE 7

The Problem

Columns of the Matrix Exponential

exp{ A } used for link-prediction, node centrality, and clustering. Why? exp{A} =

  • k=0

1 k!Ak (Ak)ij gives the number of length-k walks from i to j, so... Large entries of exp{A} denote “important” nodes / links Used for link-prediction, node ranking, clustering

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 6 / 29

slide-8
SLIDE 8

The Problem

Columns of the Matrix Exponential

exp{ A } used for link-prediction, node centrality, and clustering. Why? exp{A} =

  • k=0

1 k!Ak (Ak)ij gives the number of length-k walks from i to j, so... Large entries of exp{A} denote “important” nodes / links Used for link-prediction, node ranking, clustering exp{A} is common, but other f (A) can be used: PageRank can be defined from the resolvent: (I − αA)−1 =

  • k=0

αkAk → replace

1 k! with other coefficients?

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 6 / 29

slide-9
SLIDE 9

The Problem

f (A) as weighted sum of walks

For f (A) = etA and f (A) = (1 − αA)−1, how are walks weighted? f (A)b =

  • f0I + f1A + f2A2 + f3A3 + · · ·
  • b

20 40 60 80 100 10

−5

10 t=1 t=5 t=15 α=0.85 α=0.99 Weight Length

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 7 / 29

slide-10
SLIDE 10

The Problem

Big Graphs from Social Networks

We’ve seen the computation (f ); what does the domain of inputs look like? Social networks like Twitter, YouTube, Friendster, Livejournal Large: n = 106, 107, 109+ Sparse: |E| = O(n), often ≤ 50n Difficulty: “small world” property: diameter ≈ 4 (!) Helpful: Power-law degree distribution (picture)

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 8 / 29

slide-11
SLIDE 11

The Problem

Power-law degree distribution

1 10 100 1000 10000 100000 1e+06 1e+07 9 99 999 9999 frequency

  • utdegree

[Laboratory for Web Algorithms, http://law.di.unimi.it/index.php]

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 9 / 29

slide-12
SLIDE 12

The Problem

Difficulties with current methods: Sidje, TOMS 1998; Al-Mohy and Higham, SISC 2011

Leading methods for f (A)b use Krylov or Taylor methods: “basically” repeated mat-vecs “Small world” property: graph diameter ≤ 4 ⇒ repeated mat-vecs fill in rapidly (see picture) Not designed specifically for sparse networks.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 10 / 29

slide-13
SLIDE 13

The Problem

Fill-in from repeated matvecs

Vectors Pkei for k = 1, 2, 3, 4. n = 1133

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 11 / 29

slide-14
SLIDE 14

The Problem

f (P)ei is a localized vector

200 400 600 800 1000 1200 0.2 0.4 0.6 0.8 1 1.2 1.4

x-axis: vector index, y-axis: magnitude of entry the column of exp{P} produced by previous slide’s matvecs

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 12 / 29

slide-15
SLIDE 15

The Problem

Local Method

New method: avoid mat-vecs! → use a local method. Local algorithms run in time proportional to size of output: sparse solution vector = small runtime Instead of matvecs, we do specially-selected vector adds using a relaxation method.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 13 / 29

slide-16
SLIDE 16

Main results

Main Result 1

Theorem 1:[action of f on b] Given nonnegative A satisfying A1 ≤ 1, with power-law degree distribution and max degree d; and sparse stochastic b; our method computes x ≈ f (A)b such that f (A)b − x1 < ε in work (ε) = O

  • (1/ε)Cf log(1/ε)d2log(d)2

, “work” “scales as” d2log(d)2 in the graph size for any function f that decays “fast enough”. The constant Cf depends

  • n how quickly the Taylor coefficients of f decay.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 14 / 29

slide-17
SLIDE 17

Main results

Main Result 1

Theorem 1:[action of f on b] Given nonnegative A satisfying A1 ≤ 1, with power-law degree distribution and max degree d; and sparse stochastic b; our method computes x ≈ f (A)b such that f (A)b − x1 < ε in work (ε) = O

  • (1/ε)Cf log(1/ε)d2log(d)2

, “work” “scales as” d2log(d)2 in the graph size for any function f that decays “fast enough”. The constant Cf depends

  • n how quickly the Taylor coefficients of f decay.

For f (x) = (1 − αx)−1, Cf =

1 1−α

(Note: α ∈ (0, 1)). For f (x) = ex, Cf = 3

2

For f (x) = x1/p, Cf =

3p 5p−1

(Note: p ∈ (0, 1)).

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 14 / 29

slide-18
SLIDE 18

Main results

Main Result 2

Theorem 2:[diffusion of f across a graph] Given column stochastic A and b, ˜ x ≈ ˜ f (tA)b can be computed such that ˜ f (P)b − ˜ x∞ < ε in work (ε) = O 2f (t) ε

  • ,

(Remark: the ‘tilde’ denotes a degree-normalized version for the diffusion: D−1exp{tP}b, for example. We normalize by degrees to adjust for the influence of the stationary distribution of P.) Corollary: f (A)b is a local vector. Proof: Because sublinear work is done, f (A)b cannot have O(n) nonzeros.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 15 / 29

slide-19
SLIDE 19

Our method: Nexpokit

Overview

Outline of Nexpokit method (our second method, hk-relax, is related)

  • 1. Express f (A)b via a Taylor polynomial
  • 2. Form large linear system out of Taylor terms
  • 3. Use sparse solver to approximate each term’s largest entries
  • 4. Combine approximated terms into a solution

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 16 / 29

slide-20
SLIDE 20

Our method: Nexpokit

In terms of Taylor terms

Taylor polynomial: f (A)b ≈

  • f0I + f1A + f2A2 + f3A3 + · · · + fNAN

b Compute terms recursively: vk = fkAkei =

fk fk−1 A

  • fk−1Ak−1

ei vk =

fk fk−1 Avk−1

Then f (A)b ≈ v0 + v1 + · · · + vN−1 + vN (But we want to avoid computing vj in full...)

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 17 / 29

slide-21
SLIDE 21

Our method: Nexpokit

Forming a linear system

So we convert the Taylor polynomial into a linear system. For simplicity’s sake, we use the example of exp{A}ei here.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 18 / 29

slide-22
SLIDE 22

Our method: Nexpokit

Forming a linear system

So we convert the Taylor polynomial into a linear system. For simplicity’s sake, we use the example of exp{A}ei here.         I −A/1 I −A/2 ... ... I −A/N I                v0 v1 v2 . . . vN        =        ei . . .        where we use the identity vk = 1

k Avk−1 (which comes from

vk =

fk fk−1 Avk−1, since fk = 1 k!, so fk/fk−1 = (k−1)! k!

= 1

k ).

Then exp{A}ei ≈ v0 + v1 + · · · + vN−1 + vN

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 18 / 29

slide-23
SLIDE 23

Our method: Nexpokit

Sparse solver: Gauss Southwell

Basic idea of Gauss Southwell (GS): solving Mx = b when x is “effectively sparse” (i.e. a localized vector)

  • 1. Set x(0) = 0, r(0) = b, then iterate:
  • 2. At step k, relax maximal entry of r(k) (denoted m(k)), add to x(k);

x(k+1) = x(k) + m(k) · ei

  • 3. Add corresponding column of M to residual:

r(k+1) = r(k) − m(k) · M(:, i)

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 19 / 29

slide-24
SLIDE 24

Our method: Nexpokit

NEXPOKIT

Apply GS to our linear system, M¯ v = ¯ ei:        r0 r1 r2 . . . rN        =        ei . . .        −         I −A/1 I −A/2 ... ... I −A/N I                v0 v1 v2 . . . vN        The update can be simplified to a block-wise update: r(k+1) = (r(k) − m(k) · ej ⊗ ei) + m(k)

j+1 · A(:, i)

(1) No component of large linear system formed explicitly:

  • residual vector stored in a heap (alternative: queue with threshold)
  • matrix M not formed at all
  • blocks vj not stored separately, stored as one solution vector x = vj.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 20 / 29

slide-25
SLIDE 25

Proof

Outline of proof

Initial residual is r = ei, has r(0)1 = 1, and it decreases at each step. We show that

  • 1. decay of r(k)1 depends on its max value m(k)
  • 2. max value m(k) is bounded below by average value of r
  • 3. average value of r depends on # nonzeros in r
  • 4. growth of #nnz(r) depends on degree distribution
  • 5. Power-law degree distribution implies #nnz(r) grows slowly, so
  • 6. r1 → 0 at a certain minimum speed!

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 21 / 29

slide-26
SLIDE 26

Proof

Decay of r1

Residual r = [r0; r1; · · · ; rN] has index and block section: r(i, j). For our special linear system, the GS residual reduces to: during step k, do (1) delete r(i, j)(k) in r and add it to xi, our approximation; (2) add scaled column, m(k)

j A(:, i), to section j of the residual.

Taking the 1-norm of (1) shows r(k+1)1 ≤ r(k)1 − m(k)(1 − 1

j )

Note the (1 − 1

j ) factor appears because we’re looking specifically at ex.

For the resolvent, f (x) = (1 − αx)−1, this factor would be (1 − α) instead.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 22 / 29

slide-27
SLIDE 27

Proof

Number of nonzeros

Largest entry, m(k) = r(i, j) is bounded below by average value of the residual, m(k) = r(i, j) > r1/(# non zeros in r) But we can bound nnz(r):= (# of nonzeros in r) based on the degree of the column of A that we add to the residual each step.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 23 / 29

slide-28
SLIDE 28

Proof

Number of nonzeros

Largest entry, m(k) = r(i, j) is bounded below by average value of the residual, m(k) = r(i, j) > r1/(# non zeros in r) But we can bound nnz(r):= (# of nonzeros in r) based on the degree of the column of A that we add to the residual each step. Each iteration we can add no more nonzeros to r than the largest degree among all unvisited nodes. Usually the best we can say is that this is upper bounded by d := dmax ∗ (#iterations), because it’s possible every node has max degree. But with the power-law assumption ...

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 23 / 29

slide-29
SLIDE 29

Proof

Number of nonzeros

Largest entry, m(k) = r(i, j) is bounded below by average value of the residual, m(k) = r(i, j) > r1/(# non zeros in r) But we can bound nnz(r):= (# of nonzeros in r) based on the degree of the column of A that we add to the residual each step. Each iteration we can add no more nonzeros to r than the largest degree among all unvisited nodes. Usually the best we can say is that this is upper bounded by d := dmax ∗ (#iterations), because it’s possible every node has max degree. But with the power-law assumption ...

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 23 / 29

slide-30
SLIDE 30

Proof

Power-law degree distribution

With power-law assumption, we know that the tth largest degree, dt, is bounded by dt ≤ Cd · t−β for some β near 1 and some constant C. After k iterations, nnz(r) is bounded by the sum of the degrees of the new vertices visited in those k iterations. By step k, this is at most nnz(r) ≤ k

t=1 dt, so

nnz(r) ≤

k

  • t=1

dt ≤

k

  • t=1

Cd · t−1 In fact, after the first d iterations, dt is just a small constant, c. Then this sum grows no faster than k

t=1 d · t−1 ≤ dlog(d) + c · t. So nnz(r)grows

like t · c for c ≈ 1 instead of t · d (!).

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 24 / 29

slide-31
SLIDE 31

Proof

Convergence

We had r(k+1)1 ≤ r(k)1 − m(k)(1 − 1

j )

The power-law assumption allows the bound −m(k) ≤ − r(k)1

C2+c·k .

r(k+1)1 ≤ r(k)1

  • 1 −

2/3 C2 + c · k

  • ≤ r(k)1exp{− 2

3 1 C2+c·k }

≤ r(0)1exp{− 2

3 k

  • t=0

1 C2+c·t }

≤ exp{− 2

3log(k + C)}

r(k+1)1 ≤ (k + C)−2/3 (See the paper cited at the end for a precise completion of the proof).

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 25 / 29

slide-32
SLIDE 32

Experimental Results

Runtime v. Graph Size

10

3

10

4

10

5

10

6

10

−4

10

−2

10 |E| + |V| Runtime (secs). TSGS TSGSQ EXPV MEXPV TAYLOR

“GSQ” is a version of our Gauss-Southwell method that stores the residual vector in a queue instead of a heap.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 26 / 29

slide-33
SLIDE 33

Experimental Results

Runtime on larger networks

10 20 30 50 100 150 200 250 300 350 Trial Time (sec) EXMPV GSQ GS

For ljournal-2008, n = 5, 363, 260, ave degree = 14.7.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 27 / 29

slide-34
SLIDE 34

Experimental Results

Runtime on larger networks

10 20 30 20 40 60 80 100 120 140 Trial Time (sec) EXMPV GSQ GS

For webbase-2001, n = 118, 142, 155, ave degree = 8.6.

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 28 / 29

slide-35
SLIDE 35

Experimental Results

Code and Further Details

Code available at http://www.cs.purdue.edu/homes/dgleich/codes/nexpokit For details and references, see our paper at http://arxiv.org/abs/1310.3423

Kyle Kloster (Purdue) Fast f (A)b March 3, 2014 29 / 29