Unsupervised Scalable Statistical Method for Identifying Influential - PowerPoint PPT Presentation

Unsupervised Scalable Statistical Method for Identifying Influential Users in Online Social Networks Antonio Fernández Anta

Team • Universidad Carlos III • IMDEA Networks de Madrid Institute � Rubén Cuevas � Arturo Azcorra � Henry Laniado � Luis F . Chiroque � Rosa E. Lillo � A.F .A. � Juan Romo � Carlos Sguera

Motivation • Online Social Networks (OSNs) are used everyday by billions of people • They are invaluable to extract information and to actuate in advertising, marketing, politics, etc. • A recurring problem in OSNs analyses is to identify “interesting” or “influential” users • Usually the characterization of influential users is given a priori, and algorithms to find these characteristics are proposed 9

Characterizing Influential Users • Several characterization that have been used for influential OSN users: � Large number of followers [Cha HBG 2010][Pastor- Satorras Vespignani 2001] [Cohen EbAH 2001] � Capacity of engagement [Domingos Richardson 2001] [D’Agostino ANT 2015] � High infection capacity in an epidemic model [Kitsak GHLMSM 2010] [Morone Makse 2015] [Kempe Kleinberg Tardos 2015] • Each of these characterizations may miss important interesting users • They disregard many available attributes of the users 10

Contributions • We propose a new unsupervised method to identify “interesting” users: Massive Unsupervised Outlier Detection (MUOD) • MOUD finds outliers in the multidimensional data available from the users • These outliers can later be explored further to identify their nature: MUOD identifies multiple types of outliers to make this easier • MUOD scales to millions of users, so it is usable in large OSN • We successfully tested MUOD in data of Google+ with 170M users over 2 years 11

Problem Statement d • We have a set of n OSN users • For every user we have d attributes: � Connectivity: Number of friends, followers, centrality metrics, etc. U n � Activity: Number of posts, likes, reposts, etc. � Profile: user’s name, location (e.g., city where she lives), job, education, gender, and related data

Outliers • The objective is to find the outliers in the set of OSN users

Multidimensional Data • Detecting outliers in multidimensional data is not easy

Multidimensional Data • With more than three dimensions, it is practically impossible to graphically visualize the observations using Cartesian coordinates. • Convenient alternative: parallel coordinates [Wegman 1990] • Observation x � R d can be seen as real function defined on an arbitrary set of equally spaced domain points, e.g., {1, . . . , d }, and x can be expressed as x = {x (1), . . . , x (d)} [López-Pintado Romo 2009] 15

Functional Data Analysis • Each observation/user is expressed as a curve, and the outliers are curves that are different from “the mass” [Hubert Rousseeuw Segaert 2015] in � Magnitude � Amplitude � Shape 16

The Method • In MOUD we assign to each user an index that gives the outlier intensity of each type: � The shape index I S is based on the correlation coefficient between the functions � The amplitude index I A is based on the slope of linear regression curves between the functions � The magnitude index I M is based on the constant term of linear regression curves between the functions • The higher the corresponding index, the more likely the user is an outlier 17

Shape Index Let us consider the set of users X = { x 1 , x 2 , . . . , x n } Where each user is a vector of d values The shape index of a user x is computed as � � n � � 1 X � � I S ( x, X ) = ρ ( x, x j ) − 1 � � n � � j =1 � � ρ ( x, x j ) Where is the Pearson correlation coefficient 18

Shape Index Example 12 10 8 6 4 1.4 2 1.2 0 1 -2 0 10 20 30 40 50 60 70 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 19

Magnitude and Amplitude Indices We use linear regression ˆ α j = x − ˆ β j = Cov ( x, x j ) / Var ( x j ) β j x j ˆ To obtain the magnitude and amplitude indices � � � � n n � � � � 1 1 ˆ X X � � � � I M ( x, X ) = ˆ I A ( x, X ) = β j − 1 α j � � � � n n � � � � j =1 j =1 � � � � x β j α j x j

Magnitude and Amplitude Indices 16 12 14 10 12 8 10 6 8 4 6 2 4 2 0 0 0 20 40 60 80 100 120 -2 0 100 200 300 400 500 600 700 25 0.7 20 0.6 15 0.5 10 0.4 5 0.3 0.2 0 21 0.1 − 5 0 20 40 60 80 100 120 0 10 20 30 40 50 60 70

Which are Outliers? • Given the index I S of each user we can obtain the set of outliers: � Sort by I S � Cut by point given by the tangent method [Louail 2014]

Sets of Outliers • Given the sets of outliers of shape, magnitude and amplitude, we have up to 7 different outliers subsets to consider, given their possible intersections Outliers groups. Simulation outliers magnitude fbplot amplitude shape 23

Performance Evaluation 24

Performance Results 25

Mixed Outliers 26

Decomposed Results 27

Implementation • We have implemented the outlier detection algorithm MUOD in R • We had to implement it in C++ and add it to the R system, since R functions did not allow the required memory control • The implementation allows parallel execution in p cores, with time complexity O(n 2 d/p) • It has been made available in a public repository: https://github.com/luisfo/muod.outliers 28

Performance 29

MOUD in Google+ • We have data of n=170M Google+ users and 2 years of activity (2011-2013), with d=21 features for each (of profile, activity, and connectivity) • We use the 5.6M active • We find: � 4K outliers of MAS � 2K outliers of MS � 2K outliers of AS � 294K outliers of only SHA 30

medians (log) 2 4 6 8 10 12 activity NumActivities NumAtts engagement NumPlusOnes NumReplies NumReshares NumFriends Exploration of the Outlier Sets NumFollowers NumFields PerBidir 31 followers accountAge accountRec gender job numVideos numPhotos numAlbums numArticles centrality numHangouts numEvents mass (sample) MAS AS MS SHA FBPLOT numWithGeo pageRank

Epidemic Behavior • We run 10 SI (susceptible-infected) simulations in the connected component (170M users) infection process FBPLOT SHA MS 60 AS MAS mass 50 40 millions of users 30 20 10 0 1 2 3 4 5 32 steps

Examples of Outlier Users 33

Conclusions and Future Work • We propose to use an unsupervised outlier detection method to identify “interesting” users in OSN • Then, explore what are the outliers • We propose a new method that scales to millions of users and test it with a real data set • In the future we plan to use the method in multiple contexts where identify outliers in multidimensional data is useful (fraud detection, faulty images, etc.) 34

Ongoing Work • Data from Twitter (MAG 2, AMP 226, SHA 6871, MA 5, MS 165, MAS 25, rest 138280) 35

Thank you!! Azcorra, A., Chiroque, L. F ., Cuevas, R., Fernández Anta, A., Laniado, H., Lillo, R. E., Romo, J., and Sguera, C. (2018), “Unsupervised Scalable Statistical Method for Identifying Influential Users in Online Social Networks” Scientific Reports (2018). https://github.com/luisfo/muod.outliers 36

Unsupervised Scalable Statistical Method for Identifying Influential - PowerPoint PPT Presentation

Unsupervised Scalable Statistical Method for Identifying Influential Users in Online Social Networks Antonio Fernndez Anta Team Universidad Carlos III IMDEA Networks de Madrid Institute Rubn Cuevas Arturo Azcorra Henry

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

Long Baseline Neutrino Experiment Jim Strait Fermilab Institutional Review June 6-9, 2011

Algebraic Theory of SKEW-MORPHISMS Robert Jajcay Comenius University

Finite Field Functions to Counterattack Linear and Differential Cryptanalysis Daniel Panario

Symmetric coverings and the Bruck-Ryser-Chowla theorem Daniel Horsley (Monash University,

Answer Set Solving in Practice Torsten Schaub University of Potsdam torsten@cs.uni-potsdam.de

Resolution-Based Uniform Interpolation and Forgetting for Expressive Description Logics Patrick

Design for a combination of compounds: the balance between theory and practice Peter Lane &

MODAL AUTOMATA studying modal fixpoint logics one step at a time Yde Venema

Unsupervised Scalable Statistical Method for Identifying Influential - PowerPoint PPT Presentation

Unsupervised Scalable Statistical Method for Identifying Influential Users in Online Social Networks Antonio Fernndez Anta Team Universidad Carlos III IMDEA Networks de Madrid Institute Rubn Cuevas Arturo Azcorra Henry

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

Long Baseline Neutrino Experiment Jim Strait Fermilab Institutional Review June 6-9, 2011

Algebraic Theory of SKEW-MORPHISMS Robert Jajcay Comenius University

Finite Field Functions to Counterattack Linear and Differential Cryptanalysis Daniel Panario

Symmetric coverings and the Bruck-Ryser-Chowla theorem Daniel Horsley (Monash University,

Answer Set Solving in Practice Torsten Schaub University of Potsdam torsten@cs.uni-potsdam.de

Resolution-Based Uniform Interpolation and Forgetting for Expressive Description Logics Patrick

Design for a combination of compounds: the balance between theory and practice Peter Lane &amp;

MODAL AUTOMATA studying modal fixpoint logics one step at a time Yde Venema

Design for a combination of compounds: the balance between theory and practice Peter Lane &