Adaptive Diffusions for Scalable and Robust Learning over Graphs - - PowerPoint PPT Presentation

adaptive diffusions for scalable and robust learning over
SMART_READER_LITE
LIVE PREVIEW

Adaptive Diffusions for Scalable and Robust Learning over Graphs - - PowerPoint PPT Presentation

Adaptive Diffusions for Scalable and Robust Learning over Graphs ICASSP2017 Georgios B. Giannakis A. N. Nikolakopoulos D. K. Berberidis Dept. of ECE and Digital Tech. Center, University of Minnesota Acknowledgments: NSF 1500713,1711471, NIH


slide-1
SLIDE 1

Georgios B. Giannakis

Shanghai, P. R. China July 2, 2018

  • Dept. of ECE and Digital Tech. Center, University of Minnesota

Acknowledgments: NSF 1500713,1711471, NIH 1R01GM104975-01

Adaptive Diffusions for Scalable and Robust Learning over Graphs

ICASSP2017

  • A. N. Nikolakopoulos
  • D. K. Berberidis
slide-2
SLIDE 2

Motivation

2

Graph representations

Objective: Learn values or labels of graph nodes, as e.g., in citation networks

Real networks Data similarities

Challenges: Graphs can be huge and are scarcely labeled

  • Due to privacy, cost of battery, (un) reliable human annotators …
slide-3
SLIDE 3

Problem statement

3

 Graph

  • Weighted adjacency matrix
  • Label per node

 Topology given or identifiable

  • Given in e.g. WSNs and social nets
  • Identifiable via e.g., nodal similarities

Goal: Given labels on learn unlabeled nodes

slide-4
SLIDE 4

Work in context

4

 Non-parametric semi-supervised learning (SSL) on graphs

  • Graph partitioning [Joachims et al ‘03]
  • Manifold regularization [Belkin et al ‘06]
  • Label propagation [Zhu et al’03, Bengio et al‘06]
  • Bootstrapped label propagation [Cohen‘17]
  • Competitive infection models [Rosenfeld‘17]

 Node embedding + classification of vectors

  • Node2vec [Grover et al ’16]
  • Planetoid [Yang et al ‘16 ]
  • Deepwalk [Perozzi et al ‘14]

 Graph convolutional networks (GCNs)

  • [ Atwood et al ‘16], [ Kipf et al ‘16]
slide-5
SLIDE 5

Random walks on graphs

5

 Position of random walker at step k :

  • Transition probabilities

 Steady-state probs.

  • Presumes undirected, connected, and non-bipartite graphs
  • Not informative for SSL

 Step-k landing probabilities

  • Measure influence of on every node in - informative for SSL!
slide-6
SLIDE 6

Landing probabilities for SSL

6

 Random walk per class with  Family of per-class diffusions

  • Valid pmf with K-dim probability simplex

 Max-likelihood per-node classifier

  • Per step landing probabilities found

by multiplying with sparse H

  • Initial (“root”) probability distribution
slide-7
SLIDE 7

Special case 1: Personalized page rank (PPR) diffusion [Lin‘10]

  • Pmf of random walk with restart probability 1-α ; in steady-state

Unifying diffusion-based SSL

7

Special case 2: Heat kernel (HK) diffusion [Chung’07]  HK and PPR have fixed parameters Our key contribution: Graph- and label-adaptive selection of

  • “Heat’’ flowing from roots after time t ; in steady-state
slide-8
SLIDE 8

8

Adaptive diffusions

 AdaDIF scalable to large-scale graphs (K << N)  Linear-quadratic

``Differential’’ landing prob.

Normalized label indicator vector

slide-9
SLIDE 9

AdaDIF in a nutshell

9

slide-10
SLIDE 10

Interpretation and complexity

10

 For (smoothness-only),

  • Weight concentrates on last landing prob.

 For (fit-only)

  • Weight concentrates on first few landing probs
  • Intuition: very short walks visit similarly labeled nodes

 AdaDIF targets a “sweet-spot” between the two

  • Simplex constraint promotes sparsity on

 If , per-class complexity thanks to sparsity of H

  • Same as non-adaptive HK and PPR; also parallelizable across classes
  • Reflect on PPR and Google … just avoid K >>
slide-11
SLIDE 11

Boosting AdaDIF

11

 Dictionary of D << K diffusions  Unconstrained diffusions (relax simplex constraints )

  • Retain hyperplane constraint to avoid all-zero solution
  • Closed-form solution
  • Dictionary may include PPR, HK, and more
  • Complexity
slide-12
SLIDE 12

12

On the choice of K

 Message: Increasing K does not help distinguishing between classes

  • Large K may even degrade performance due to over-parametrization
  • Definition. Let and denote respectively the seed vectors for nodes of

class “+’’ and “-,’’ initializing the landing probability vectors in matrices , and , , .. With and , the -distinguishability threshold of the diffusion-based classifier is the smallest integer satisfying

  • Theorem. For any diffusion-based classifier with coefficients constrained to a

probability simplex of appropriate dimensions, it holds that and eigenvalues of the normalized graph Laplacian in ascending order.

slide-13
SLIDE 13

13

In practice

slide-14
SLIDE 14

Contributions and links with GSP

14  Different losses and regularizers, including those for outlier resilience  Multiple class case readily addressed  AdaDif’s simplex constraint can afford  Rigorous analysis using basic graph properties

AdaDif vis-à-vis graph filters [Sandryhaila-Moura ‘13, Chen et al ‘14] AdaDif vis-a-vis GCNs

  • No feature inputs: operates naturally on graph-only settings
  • Small number of constrained parameters: reduced overfitting
  • Simpler and easily parallelizable training: no back propagation
  • Random walk interpretation
  • Search space reduction
slide-15
SLIDE 15

Real data tests

15

 Real graphs

  • Citation networks
  • Blog networks
  • Protein interaction network

 HK and PR run with K =30 for convergence

  • AdaDIF relies just on K=15
  • Micro-F1: node-centric accuracy measure
  • Macro-F1: class-centric accuracy measure
slide-16
SLIDE 16

Multiclass graphs

16

 State-of-the-art performance

  • Large margin improvement over Citeseer
slide-17
SLIDE 17

Multilabel graphs

17

❑ AdaDIF approaches Node2vec Micro-F1 accuracy for PPI and BlogCatalog ➢ Significant improvement over non-adaptive PPR and HK for all graphs ❑ AdaDIF achieves state-of-the-art Macro-F1 performance ❑ Number of labels per node assumed known (typical) ➢ Evaluate accuracy of top-ranking classes

slide-18
SLIDE 18

Runtime comparison

18

 AdaDIF can afford much lower runtimes

  • Even without parallelization!
slide-19
SLIDE 19

Leave-one-out fitting loss

19

 Quantifies how well each (labeled) node is predicted by the rest  Compact form  Diffusion parameters  ‘s obtained via different random walks ( )

slide-20
SLIDE 20

Anomaly identification - removal

20

 Alternating minimization converges to stationary point  While, iterate:  Remove outliers from and predict using

Residuals Row-wise soft-thresholding

Group sparsity on i.e., force consensus among classes regarding which nodes are outliers

 Joint optimization  Model outliers as large residuals, captured by nnz entries of sparse vec.

slide-21
SLIDE 21

Testing classification performance

21

 Anomalies injected in Cora graph

  • Go through each entry of
  • With probability draw a label
  • Replace

 For fixed , accuracy with improves as false samples are removed

  • Less accuracy for (no anomalies), only useful samples removed (false alarms)
slide-22
SLIDE 22

Testing anomaly detection performance

22

 ROC curve: Probability of detection vs probability of false alarms

  • As expected, performance improves as decreases
slide-23
SLIDE 23

23

Research outlook

 Investigate different losses and diverse regularizers  Further boost accuracy with nonlinear diffusion models  Effect reduced complexity and memory requirements via approximations  Online AdaDIF for dynamic graphs