Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and - PowerPoint PPT Presentation

Background Methodology Results Summary Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and Juhyun Kim (Social Networks) Brian de Silva, Jerry Luo, and Jason Bello (Document Declassifications) John Wu, Mindy Case, Paul Chuavy-Waddy (Medical Data Mining) Advisors: Dr. Hunter, Dr. Kolokolnikov University of California, Los Angeles Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Methodology Results Summary Community Detection Using Meaningful Geosocial Data Ryan de Vera Qui Pham Juhyun Kim University of California, Los Angeles August 9, 2013 Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Methodology Results Summary Overview Background 1 Setting Data Goals Methodology 2 Clustering Methods Spectral Clustering Measure of Similarity Results 3 Summary 4 Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Setting Methodology Data Results Goals Summary Setting Figure : Map of Hollenbeck with 31 Gang Territories Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Setting Methodology Data Results Goals Summary Map of Hollenbeck with hills and railraod Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Setting Methodology Data Results Goals Summary Data The data generated from non-criminal stops made by the LAPD in the Hollenbeck area from 2000 to 2011 includes: Geographical coordination Social connection Gang affiliation Gang territory Time of stop People are represented by geographical coordinates of where they were stopped and who they were stopped with. Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Setting Methodology Data Results Goals Summary Ground Truth of Hollenbeck Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Setting Methodology Data Results Goals Summary Goals Predict gang affiliations Incorporate native geographical and social information in clustering Compare different methods of clustering and community detection Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary K-means Clustering K-Means Input: objects represented by vectors, number k of clusters K-means assign each data point to a cluster with the closest mean Repeat Output: clusters B 1 , . . . , B k Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Alternative Methods Other Clustering Methods K-Medoids Gaussian Mixture Model Thresholding But there are limitations to these methods.... Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Spectral Clustering Algorithm [Ng, Jordan, and Weiss (2001)] Notation: v j i is the j-th components of vector v i Input: Similarity matrix A ∈ R n × n , number k of clusters 1 Compute D = ( d ij ) where d ii = � n k =1 a ik 2 Compute L = I − D − 1 / 2 AD − 1 / 2 3 Compute the k smallest eigenvectors v 1 , . . . , v k of L 4 Cluster vectors ( u ij ) j =1 ,..., k , i = 1 , . . . , n , into clusters C 1 , . . . , C k using simple clustering methods Output: Clusters B 1 , . . . , B k with B i = { j | y j ∈ C i } Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Measure of Similarity Matrices A = ( a ij ) = α S + (1 − α ) G : similarity matrix S = ( s ij ): social matrix G = ( e − d 2 ij /σ i σ j ): geographical matrix Distances d L p ( x i , x j ): L p distance of vector x i and vector x j d G ( x i , x j ): geographical boundary distance d H ( A , B ): Hausdorff distance of set A and set B Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Social Matrix Previous Binary model: � 1 if O i ∩ O j � = ∅ s ij = 0 if O i ∩ O j = ∅ Disadvantages: Do not reflect the frequency of people being stopped together Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Social Matrix Motivation: Keep the values in [0 , 1] Utilize the frequency of people being stopped together New Idea Logarithmic model: s ij = ln ( | O i ∩ O j | + 1) ln (max O x , O y ∈ Ω | O x ∩ O y | + 1) Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Geographical Matrix Previous L 2 Distance Between the Averages of Coordinates: �� ( x i , x j ) ∈ O i ( x i , x j ) ( x i , x j ) ∈ O j ( x i , x j ) � d ( O i , O j ) = d L 2 , | O i | | O j | Disadvantages: Lack differentiting power O 1 = {− 20 , 20 } ; O 2 = {− 3 , 1 , 2 } ; O 3 = { 0 } Be vulnerable to outliers O 1 = {− 50 , − 3 , 0 , 1 , 2 } ; O 2 = {− 10 } ; O 3 = { 0 } Ignore native geographical information: Boundaries Railroads and freeways Impassable terrains Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Geographical Matrix Present: Point-Set Distances Motivation: new distances satisfying: Possess good differentiating power Be resilient to outliers Directed distances: d 1 ( A , B ) = min a ∈ A d ( a , B ) 1 d 2 ( A , B ) = 50 K th a ∈ A d ( a , B ) 2 d 3 ( A , B ) = 75 K th a ∈ A d ( a , B ) 3 d 4 ( A , B ) = 90 K th a ∈ A d ( a , B ) 4 d 5 ( A , B ) = max a ∈ A d ( a , B ) 5 d 6 ( A , B ) = 1 � a ∈ A d ( a , B ) 6 | A | Note: x K th a ∈ A is the K-th ranked distance such that K / | A | = x % Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Geographical Matrix Present: Point-Set Distances Symmetrizing functions: f 1 ( d ( A , B ) , d ( B , A )) = min( d ( A , B ) , d ( B , A )) 1 f 2 ( d ( A , B ) , d ( B , A )) = max( d ( A , B ) , d ( B , A )) 2 f 3 ( d ( A , B ) , d ( B , A )) = d ( A , B ) + d ( B , A ) 3 2 f 4 ( d ( A , B ) , d ( B , A )) = | A | d ( A , B ) + | B | d ( B , A ) 4 | A | + | B | Point-set distances: h ij ( A , B ) = f i ( d j ( A , B ) , d j ( B , A )) Note: The only point-set distances being metrics are: Normal Hausdorff: � � h 25 = max max a ∈ A d ( a , B ) , max b ∈ B d ( b , A ) Modified Hausdorff: �� a ∈ A d ( a , B ) b ∈ B d ( b , A ) � h 26 = max , | A | | B | Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Geographical Matrix Present: Geographical Distance Motivation: Incoporate native geographical information Optimal solution: d G ( x i , x j ): the shortest path between x i , x j on an undirected graph G = (Ω ∪ I , E ) where I is the set of cooridnates of all intersections of streets in Hollenbeck area Approximated solution: d G ( x i , x j ): the shortest path between x i , x j on an undirected graph G = (Ω ∪ P , E ) where P is the set of coordinates of all passages from one region of Hollenbeck area to another Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Clustering Methods Methodology Spectral Clustering Results Measure of Similarity Summary Geographical Matrix Present: Geographical Similarity Measure Use different p to calculate L p distances in computing the geographical distance d G Use the geographical distance to calculate d ( a , B ) = min b ∈ B d ( a , b ) in computing point-set distances � − h 2 kl ( O i , O j ) � Geographical matrix: g ij = exp σ i σ j σ i = h kl ( O i , O K ) where O K is the K-th nearest neighbor of the i-th person O i σ i controls the width of the similarity neighborhood of the i-th person Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Methodology Results Summary Comparison of Point-Set Distances directed symmetrizing functions distances f 1 f 2 f 3 f 4 d 1 0.6024 0.6066 0.6083 0.5926 d 2 0.6036 0.5477 0.5646 0.5524 d 3 0.5905 0.5396 0.5625 0.5574 d 4 0.5867 0.5345 0.5430 0.5286 d 5 0.5897 0.5163 0.5630 0.5392 d 6 0.6032 0.5651 0.6019 0.5702 Table : Purity scores for L 1 distance and α = 0 Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Background Methodology Results Summary Comparison of Point-Set Disances directed symmetrizing functions distances f 1 f 2 f 3 f 4 d 1 0.6181 0.6142 0.6181 0.6172 d 2 0.6206 0.5803 0.5875 0.5825 d 3 0.6121 0.5774 0.5880 0.5829 d 4 0.6189 0.5774 0.5930 0.5816 d 5 0.6151 0.5795 0.5854 0.5812 d 6 0.6189 0.6032 0.6168 0.6104 Table : Maximum purity scores for L 2 distance and binary model Ryan de Vera Qui Pham Juhyun Kim Social Networks and Large Data Sets

Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and - PowerPoint PPT Presentation

Background Methodology Results Summary Social Networks and Large Data Sets Ryan de Vera, Qui Pham, and Juhyun Kim (Social Networks) Brian de Silva, Jerry Luo, and Jason Bello (Document Declassifications) John Wu, Mindy Case, Paul

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Large Sets of q -Analogs of Designs Michael Braun, Michael Kiermaier, Axel Kohnert , Reinhard

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

Language Technologies Or why we all need large data sets, automatic tools and sharing! Thesis

Disjoint Sets and Disjoint sets The UNION-FIND ADT for disjoint sets the UNION-FIND

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Introduction Social and Economic Networks MohammadAmin Fazli Social and Economic Networks 1

Submodular Maximization applied to Marketing Over Social Networks Vahab Mirrokni Google

Querying Geo-social Data by Bridging Spatial Networks and Social Networks Yerach Ben Yaron

Community Structure in Large Community Structure in Large Social and Information Networks Social

SOCIAL NETWORKS OF ELDERLY PEOPLE Hayden Manseau 1 1. THE PROBLEM 2 THE IMPACT OF SOCIAL

Further plans and available Further plans and available data sets for research in data sets for

Data Mining Learning from Large Data Sets Lecture 8

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Impedance of pumping holes Bernard Riemann Center for Synchrotron Radiation 2016-11-09 1 / 18

REAL-TIME IMPACT FORCE IDENTIFICATION OF CFRP LAMINATED PLATES USING SOUND WAVES S. Atobe 1* , H.

Numerical simulations of gravitational waves from early-universe turbulence APS April meeting

Third quarter 2019 results Analyst call Koen Van Gerven, CEO Leen Geirnaerdt, CFO Brussels

Barratt Developments PLC Maintaining momentum with continued strong performance Barratt

Preliminary R Results P Presentation Year t to 3 30 J June 2 2014 Agend nda

NATO SPS PROJECT: A Field Detector for Genotoxicity from CBRN and Explosive Devices

Guiding Principles of COVID-19 Response As approved in the 25th IATF Department of Health,