Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and - - PowerPoint PPT Presentation

social networks and large data sets
SMART_READER_LITE
LIVE PREVIEW

Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and - - PowerPoint PPT Presentation

Background Methodology Results Summary Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and Juhyun Kim (Social Networks) Brian de Silva, Jerry Luo, and Jason Bello (Document Declassifications) John Wu, Mindy Case, Paul


slide-1
SLIDE 1

Background Methodology Results Summary

Social Networks and Large Data Sets

Ryan de Vera, Qui Pham, and Juhyun Kim (Social Networks) Brian de Silva, Jerry Luo, and Jason Bello (Document Declassifications) John Wu, Mindy Case, Paul Chuavy-Waddy (Medical Data Mining) Advisors: Dr. Hunter, Dr. Kolokolnikov

University of California, Los Angeles

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-2
SLIDE 2

Background Methodology Results Summary

Community Detection Using Meaningful Geosocial Data

Ryan de Vera Qui Pham Juhyun Kim

University of California, Los Angeles

August 9, 2013

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-3
SLIDE 3

Background Methodology Results Summary

Overview

1

Background Setting Data Goals

2

Methodology Clustering Methods Spectral Clustering Measure of Similarity

3

Results

4

Summary

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-4
SLIDE 4

Background Methodology Results Summary Setting Data Goals

Setting

Figure : Map of Hollenbeck with 31 Gang Territories

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-5
SLIDE 5

Background Methodology Results Summary Setting Data Goals

Map of Hollenbeck with hills and railraod

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-6
SLIDE 6

Background Methodology Results Summary Setting Data Goals

Data

The data generated from non-criminal stops made by the LAPD in the Hollenbeck area from 2000 to 2011 includes: Geographical coordination Social connection Gang affiliation Gang territory Time of stop People are represented by geographical coordinates of where they were stopped and who they were stopped with.

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-7
SLIDE 7

Background Methodology Results Summary Setting Data Goals

Ground Truth of Hollenbeck

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-8
SLIDE 8

Background Methodology Results Summary Setting Data Goals

Goals

Predict gang affiliations Incorporate native geographical and social information in clustering Compare different methods of clustering and community detection

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-9
SLIDE 9

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

K-means Clustering

K-Means Input: objects represented by vectors, number k of clusters K-means assign each data point to a cluster with the closest mean Repeat Output: clusters B1, . . . , Bk

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-10
SLIDE 10

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Alternative Methods

Other Clustering Methods K-Medoids Gaussian Mixture Model Thresholding But there are limitations to these methods....

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-11
SLIDE 11

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Spectral Clustering

Algorithm [Ng, Jordan, and Weiss (2001)] Notation: vj

i is the j-th components of vector vi

Input: Similarity matrix A ∈ Rn×n, number k of clusters

1 Compute D = (dij) where dii = n

k=1 aik

2 Compute L = I − D−1/2AD−1/2 3 Compute the k smallest eigenvectors v1, . . . , vk of L 4 Cluster vectors (uij)j=1,...,k, i = 1, . . . , n, into clusters

C1, . . . , Ck using simple clustering methods Output: Clusters B1, . . . , Bk with Bi = {j|yj ∈ Ci}

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-12
SLIDE 12

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Measure of Similarity

Matrices A = (aij) = αS + (1 − α)G: similarity matrix S = (sij): social matrix G = (e−d2

ij /σiσj): geographical matrix

Distances dLp(xi, xj): Lp distance of vector xi and vector xj dG(xi, xj): geographical boundary distance dH(A, B): Hausdorff distance of set A and set B

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-13
SLIDE 13

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Social Matrix

Previous Binary model: sij =

  • 1 if Oi ∩ Oj = ∅

0 if Oi ∩ Oj = ∅ Disadvantages:

Do not reflect the frequency of people being stopped together

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-14
SLIDE 14

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Social Matrix

Motivation:

Keep the values in [0, 1] Utilize the frequency of people being stopped together

New Idea Logarithmic model: sij = ln (|Oi ∩ Oj| + 1) ln (maxOx,Oy∈Ω|Ox ∩ Oy| + 1)

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-15
SLIDE 15

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Geographical Matrix

Previous L2 Distance Between the Averages of Coordinates: d(Oi, Oj) = dL2

  • (xi,xj)∈Oi(xi, xj)

|Oi| ,

  • (xi,xj)∈Oj(xi, xj)

|Oj|

  • Disadvantages:

Lack differentiting power

O1 = {−20, 20}; O2 = {−3, 1, 2}; O3 = {0}

Be vulnerable to outliers

O1 = {−50, −3, 0, 1, 2}; O2 = {−10}; O3 = {0}

Ignore native geographical information:

Boundaries Railroads and freeways Impassable terrains

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-16
SLIDE 16

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Geographical Matrix

Present: Point-Set Distances Motivation: new distances satisfying:

Possess good differentiating power Be resilient to outliers

Directed distances:

1

d1(A, B) = mina∈A d(a, B)

2

d2(A, B) = 50K th

a∈Ad(a, B)

3

d3(A, B) = 75K th

a∈Ad(a, B)

4

d4(A, B) = 90K th

a∈Ad(a, B)

5

d5(A, B) = maxa∈A d(a, B)

6

d6(A, B) = 1 |A|

  • a∈A d(a, B)

Note: xK th

a∈A is the K-th ranked distance such that

K/|A| = x%

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-17
SLIDE 17

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Geographical Matrix

Present: Point-Set Distances Symmetrizing functions:

1

f1(d(A, B), d(B, A)) = min(d(A, B), d(B, A))

2

f2(d(A, B), d(B, A)) = max(d(A, B), d(B, A))

3

f3(d(A, B), d(B, A)) = d(A, B) + d(B, A) 2

4

f4(d(A, B), d(B, A)) = |A|d(A, B) + |B|d(B, A) |A| + |B|

Point-set distances: hij(A, B) = fi(dj(A, B), dj(B, A)) Note: The only point-set distances being metrics are:

Normal Hausdorff: h25 = max

  • maxa∈A d(a, B), maxb∈B d(b, A)
  • Modified Hausdorff:

h26 = max

  • a∈A d(a, B)

|A| ,

  • b∈B d(b, A)

|B|

  • Ryan de Vera Qui Pham Juhyun Kim

Social Networks and Large Data Sets

slide-18
SLIDE 18

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Geographical Matrix

Present: Geographical Distance Motivation: Incoporate native geographical information Optimal solution: dG(xi, xj): the shortest path between xi, xj on an undirected graph G = (Ω ∪ I, E) where I is the set of cooridnates of all intersections of streets in Hollenbeck area Approximated solution: dG(xi, xj): the shortest path between xi, xj on an undirected graph G = (Ω ∪ P, E) where P is the set of coordinates of all passages from one region of Hollenbeck area to another

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-19
SLIDE 19

Background Methodology Results Summary Clustering Methods Spectral Clustering Measure of Similarity

Geographical Matrix

Present: Geographical Similarity Measure Use different p to calculate Lp distances in computing the geographical distance dG Use the geographical distance to calculate d(a, B) = minb∈B d(a, b) in computing point-set distances Geographical matrix: gij = exp −h2

kl(Oi, Oj)

σiσj

  • σi = hkl(Oi, OK) where OK is the K-th nearest neighbor of the

i-th person Oi σi controls the width of the similarity neighborhood of the i-th person

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-20
SLIDE 20

Background Methodology Results Summary

Comparison of Point-Set Distances

directed symmetrizing functions distances f1 f2 f3 f4 d1 0.6024 0.6066 0.6083 0.5926 d2 0.6036 0.5477 0.5646 0.5524 d3 0.5905 0.5396 0.5625 0.5574 d4 0.5867 0.5345 0.5430 0.5286 d5 0.5897 0.5163 0.5630 0.5392 d6 0.6032 0.5651 0.6019 0.5702

Table : Purity scores for L1 distance and α = 0

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-21
SLIDE 21

Background Methodology Results Summary

Comparison of Point-Set Disances

directed symmetrizing functions distances f1 f2 f3 f4 d1 0.6181 0.6142 0.6181 0.6172 d2 0.6206 0.5803 0.5875 0.5825 d3 0.6121 0.5774 0.5880 0.5829 d4 0.6189 0.5774 0.5930 0.5816 d5 0.6151 0.5795 0.5854 0.5812 d6 0.6189 0.6032 0.6168 0.6104

Table : Maximum purity scores for L2 distance and binary model

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-22
SLIDE 22

Background Methodology Results Summary

Comparison of Point-Set Disances

directed symmetrizing functions distances f1 f2 f3 f4 d1 0.6210 0.6193 0.6176 0.6206 d2 0.6117 0.5757 0.5837 0.5727 d3 0.6189 0.5651 0.5795 0.5774 d4 0.6155 0.5740 0.5850 0.5884 d5 0.6168 0.5791 0.5901 0.5812 d6 0.6202 0.5922 0.6219 0.6087

Table : Maximum purity scores for L2 distance and logarithmic model

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-23
SLIDE 23

Background Methodology Results Summary

Comparison of Point-Set Disances

Observations Modified Hausdorff distance yields higher purity scores than Normal Hausdorff distance Social information complements geographical information regardless of social and geographical similarity functions For the same directed distance dj, the descending order of purity scores is f1 > f3 > f4 > f2 For the same symmetrizing function fi, the descending order

  • f purity scores is d1, d6 > d2, d3, d4, d5

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-24
SLIDE 24

Background Methodology Results Summary

Comparison of Lp Distances

Figure : Graph of purity scores for multiple p, binary mode, and Modified Hausdorff

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-25
SLIDE 25

Background Methodology Results Summary

Comparison of Lp Distances

Figure : Graph of purity scores for multiple p, logarithmic mode, and Modified Hausdorff

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-26
SLIDE 26

Background Methodology Results Summary

Comparison of Lp Distances

Observations Purity scores for p < 1 are significantly lower than ones for p ≥ 1 Purity scores for p ∈ {1, 2, 3} are almost the same Purity scores for α = 1 are the lowest

Imply the dominance of the geographical information over the social information due to the sparsity of the latter

For the binary model, the purity score increases with α, reaching the maximum around 0.92 For the logarithmic model, the purity score reaches the maximum around 0.63 The logarithmic model does not yield higher maximum purity score than the binary model

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-27
SLIDE 27

Background Methodology Results Summary

Eigenvectors Based on Geographic Distances

Eigenvector 2, 3, 4, 5, 6, 7

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-28
SLIDE 28

Background Methodology Results Summary

Spectral Clustering with Geographic Distance (Avg Locs)

α = 0.9, Purity = .4763, zrand = 181.79

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-29
SLIDE 29

Background Methodology Results Summary

Eigenvectors Based on Hausdorff Distance

Eigenvector 2, 3, 4, 5, 6, 7

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-30
SLIDE 30

Background Methodology Results Summary

Spectral Clustering with Hausdroff Distance

α = 0.9, Purity = .6304, z-Rand = 402.65

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-31
SLIDE 31

Background Methodology Results Summary

Eigenvectors Based on Geographic Hausdorff Distance

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-32
SLIDE 32

Background Methodology Results Summary

Spectral Clustering with Geographic Hausdorff Distance

α = 0.9, Purity = .6265, z-Rand = 373.58

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-33
SLIDE 33

Background Methodology Results Summary

Eigenvectors Based on Modified Geographic Hausdorff Distance

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-34
SLIDE 34

Background Methodology Results Summary

Spectral Clustering with Modified Geographic Hausdorff Distance

α = 0.9, Purity = 0.75, z-Rand = 519.87

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-35
SLIDE 35

Background Methodology Results Summary

Cluster Evaluation

Purity Score Purity(Ω, C) = 1

N

  • k maxj|ωk ∩ cj|

where Ω = {ω1, . . . , ωk} and C = {c1, . . . , cj} Z-Rand Score Z-Rand score is the number of standard deviations away from w11 which is the number of pairs which belongs both to the same cluster in spectral clustering and to the same gang. Normalized Mutual Information (NMI) NMI(Ω, C) =

I(Ω;C) [H(Ω)+H(C)]/2

where I(Ω; C) is mutual information, and H(Ω) is entropy.

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-36
SLIDE 36

Background Methodology Results Summary

Finding the optimal α

Figure : Z-Rand Score for α Figure : Purity Score for α

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-37
SLIDE 37

Background Methodology Results Summary

Parameter(α, σ) Optimization

Gij = (e−d2

ij /σiσj)

Figure : Z-Rand score for Social Information(α) vs. σ Figure : Purity score for Social Information(α) vs. σ

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-38
SLIDE 38

Background Methodology Results Summary

Best Result: Modified Hausdorff Distance

α NMI Purity Score Z-Rand Score 0.5503±0.0076 0.6275±0.0121 372.8460±25.7794 0.1 0.5517±0.0083 0.6269±0.0121 388.6803±24.0807 0.2 0.5557±0.0075 0.6330± 0.0112 388.9322±23.4806 0.3 0.5583±0.0084 0.6462± 0.0139 412.9919±24.2205 0.4 0.5609±0.0077 0.6495± 0.0111 416.2250±28.8774 0.5 0.5684±0.0076 0.6604± 0.0111 424.8999±25.7996 0.6 0.5903±0.0097 0.6780± 0.0134 441.8239±24.3245 0.7 0.6184±0.0094 0.6857± 0.0113 445.1817±19.0138 0.8 0.6477±0.0078 0.7159± 0.0108 460.9598±17.6098 0.9 0.7018±0.0128 0.7549± 0.0168 490.9406±26.2097 1 0.4233±0.0153 0.4681± 0.0176 47.2456±4.9858

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

slide-39
SLIDE 39

Background Methodology Results Summary

Summary

Progress Geographic boundary distance (Improved results, but) Hausdorff distance (Major improvement) Geographic hausdorff distance (Slight improvement, but) Future Direction Keep exploring the temporal component Compare clusters before and after gang injunction Semi-supervised clustering - Take advantage of gang territory Semi-supervised clustering - Partially labeled points

Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets