Statistical Inference for Large Directed Graphs with Communities of - - PowerPoint PPT Presentation

▶

Feb 14, 2023 293 likes •652 views

Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal Outline Communities of Interest : overview Why a probabilistic model? Bayesian Stochastic Blockmodels Example Ongoing work

SLIDE 1

Statistical Inference for Large Directed Graphs with Communities of Interest

Deepak Agarwal

SLIDE 2

Outline

Communities of Interest : overview
Why a probabilistic model?
Bayesian Stochastic Blockmodels
Example
Ongoing work

SLIDE 3

Communites of interest

Goal: understand calling behavior of every TN on ATT

LD network: massive graph

Corinna, Daryl and Chris invented COI’s to scale

computation using Hancock (Anne Rogers and Kathleen Fisher)

Definition: COI of TN X is a subgraph centered around X

– Top k called by X + other – Top k calling X + other

SLIDE 4

COI signature

X

Other

utbound

Other inbound

SLIDE 5

Entire graph union of COI’s
Extend a COI by recursively growing the spider

– Captures calling behavior more accurately

Definition for this work:

– Grow the spider till depth 3. Only retain depth 3 edges that are between depth 2 nodes.

SLIDE 6

Extended COI

me

ther
ther

X x

SLIDE 7

Enhancing a COI !!

Missed calls:

– Local calls where TN’s not ATT local – Outbound OCC calls – Calls to/from the bin “other”

Big outbound and inbound TNs

– Dominate the COI, lot of clutter. – Need to down weight their calls.

Other issues

Want to quantify things like tendency to call, tendency

f being called, tendency of returning calls for every

TN.

SLIDE 8

Our approach so far

COI -> social network
Want a statistical model that estimates

missing edges, add desired ones and remove (or down weight) undesired ones.

SLIDE 9

me COI from top probability edges of a statistical model. The model adds new

edges. (brown

arrows) Removes undesired

nes.

SLIDE 10

Getting a sense of data

Some descriptive statistics based on a random sample

f 500 residential COI’s.

SLIDE 11

density = 100*ne/(g(g-1)) ne = number of edges g = number of nodes

SLIDE 12

SLIDE 13

Under random Average conditional

n out -degrees

SLIDE 14

Under random: Conditional on outdegrees

SLIDE 15

Under random: Conditional on indegrees

SLIDE 16

SLIDE 17

Distribution of “Other"

SLIDE 18

Representing the Data

Collection of all edges with activity
Matrix with no diagonal entries
Collection of several 2x2 contingency

tables

SLIDE 19

COI: gxg matrix without diagonal entries

SLIDE 20

COI: collection of 2x2 tables.

Data matrix a collection of g(g-1)/2 2x2 tables

(called dyads).

mij aij aji nij pij pji 1 i->j j-> i

present absent present absent

1-pij 1-pji

Row total Column total

SLIDE 21

More probabilities than edges. Need to express them in terms of fewer parameters which could be learned from data.

SLIDE 22

∑ ∑ ∑ ∑ ∑

+ + + + + + =

+ + + + + + i j j i ij ij j j r i i s j j j i i i

w z w r w s w w w M C likelihood

) exp( γ λ λ β α θ ρ

All Greek letters to be estimated from data Computation: 2 minutes for a typical COI on fry Likelihood, gradient and Hessian computed using C, optimizer in R. Optimizer goes crazy due to presence of so many zero degrees Do regularization, known as “shrinkage estimation” in statistics. Incur bias for small degree nodes but get reduction in variance.

SLIDE 23

Meaning of parameters

Node i:

– ai: expansiveness (tendency to call)

– ßi: attractiveness (tendency of being called)

Global parameters:

– ?: density of COI (reduces with increasing sparseness) – ?: reciprocity of COI (tendency to return calls) – ?s: “caller” specific effect – ?r: “cal lee” specific effect – ?: “call” specific effect

SLIDE 24

Differential reciprocity

Different reciprocity for each node:

– Add another parameter ?i to node i – Replace ?M by ?M + S i?i Mi in the likelihood – Called “differential reciprocity” model – Computationally challenging, have implemented it.

SLIDE 25

Missing edges?

Can estimate all parameters as long as we

have some observed edges in data matrix

– for each row (to estimate expansiveness) – for each column (to estimate attractiveness)

Missing local calls -> o.k.
OCC -> problem, entire row missing.

– Impute data using reasonable assumptions m times (typically m=3 o.k.) and combine

results. Working on it.

SLIDE 26

Incorporating edge weights

Edge weights binned into k bins using a random sample
f 500 COI’s. Weights in ith bin assigned a score i.

tij unknown, w’s weights on dyad (i,j). tij imputed using Hyper geometric wij wji k tij