Statistical Learning with Networks and Texts Charles BOUVEYRON - PowerPoint PPT Presentation

Statistical Learning with Networks and Texts Charles BOUVEYRON Professor of Statistics Chair of Excellence Inria on ”Data Science” Laboratoire LJAD, UMR CNRS 7351 Equipe Asclepios, Inria Sophia-Antipolis Université Côte d’Azur charles.bouveyron@unice.fr @cbouveyron 1

Preamble “Essentially, all models are wrong but some are useful” George E.P. Box 2

Outline 1. Introduction 2. The Stochastic Topic Block Model 3. Numerical application: The Enron case 4. The Linkage project 5. Conclusion 3

Introduction In statistical learning, the challenge nowadays is to learn from data which are: � high-dimensional ( p large), � big or as stream ( n large), � evolutive (evolving phenomenon), � heterogeneous (categorical, functional, networks, texts, ...) 4

Introduction In statistical learning, the challenge nowadays is to learn from data which are: � high-dimensional ( p large), � big or as stream ( n large), � evolutive (evolving phenomenon), � heterogeneous (categorical, functional, networks, texts, ...) In any case, the understanding of the results is essential : � the practitioners are interested in visualizing or clustering their data, � to have a selection of the relevant original variables for interpretation, � and to have a probabilistic model supposed to have generated the data. 4

Introduction Statistical analysis of (social) networks has become a strong discipline: � description and comparison of networks, � network visualization, � clustering of network nodes. 5

Introduction Statistical analysis of (social) networks has become a strong discipline: � description and comparison of networks, � network visualization, � clustering of network nodes. with applications in domains ranging from biology to historical sciences: � biology: analysis of gene regulation processes, � social sciences: analysis of political blogs, � historical sciences: clustering and comparison of medieval social networks � Bouveyron, Lamassé et al., The random subgraph model for the analysis of an ecclesiastical network in merovingian Gaul , The Annals of Applied Statistics, vol. 8(1), pp. 377-405, 2014. 5

Introduction Networks can be observed directly or indirectly from a variety of sources: � social websites (Facebook, Twitter, ...), � personal emails (from your Gmail, Clinton’s mails, ...), � emails of a company (Enron Email data), � digital/numeric documents (Panama papers, co-authorships, ...), � and even archived documents in libraries (digital humanities). 6

Introduction Networks can be observed directly or indirectly from a variety of sources: � social websites (Facebook, Twitter, ...), � personal emails (from your Gmail, Clinton’s mails, ...), � emails of a company (Enron Email data), � digital/numeric documents (Panama papers, co-authorships, ...), � and even archived documents in libraries (digital humanities). ⇒ most of these sources involve text! 6

An introductory example 6 7 4 5 3 9 1 2 8 Figure: An (hypothetic) email network between a few individuals. 7

An introductory example 6 7 4 5 3 9 1 2 8 Figure: A typical clustering result for the (directed) binary network. 8

An introductory example W 6 7 h a t i s t h e g a m 4 5 e Basketball is great! r e s u 3 l t ? I love watching basketball! ! g n 9 i h s fi e Fishing is so relaxing! v o l 1 2 I 8 Figure: The (directed) network with textual edges. 9

An introductory example W 6 7 h a t i s t h e g a m 4 5 e Basketball is great! r e s u 3 l t ? I love watching basketball! ! g n 9 i h s fi e Fishing is so relaxing! v o l 1 2 I 8 Figure: Expected clustering result for the (directed) network with textual edges. 10

STBM : Context and notations We are interesting in clustering the nodes of a (directed) network of M vertices into Q groups: � the network is represented by its M × M adjacency matrix A : � 1 if there is an edge between i and j A ij = 0 otherwise � if A ij = 1 , the textual edge is characterized by a set of D ij documents: ij , ..., W D ij W ij = ( W 1 ij , ..., W d ij ) , � each document W d ij is made of N d ij words: dN d W d ij = ( W d 1 ij , ..., W dn ij ) . ij , ..., W ij 12

STBM : Modeling of the edges Let us assume that edges are generated according to a SBM model: � each node i is associated with an (unobserved) group among Q according to: Y i ∼ M ( ρ ) , where ρ ∈ [ 0 , 1 ] Q is the vector of group proportions, 13

STBM : Modeling of the edges Let us assume that edges are generated according to a SBM model: � each node i is associated with an (unobserved) group among Q according to: Y i ∼ M ( ρ ) , where ρ ∈ [ 0 , 1 ] Q is the vector of group proportions, � the presence of an edge A ij between i and j is drawn according to: A ij | Y iq Y jr = 1 ∼ B ( π qr ) , where π qr ∈ [ 0 , 1 ] is the connection probability between clusters q and r . 13

STBM : Modeling of the documents The generative model for the documents is as follows: � each pair of clusters ( q , r ) is first associated to a vector of topic proportions θ qr = ( θ qrk ) k sampled from a Dirichlet distribution: θ qr ∼ Dir ( α ) , such that � K k = 1 θ qrk = 1 , ∀ ( q , r ) . 14

STBM : Modeling of the documents The generative model for the documents is as follows: � each pair of clusters ( q , r ) is first associated to a vector of topic proportions θ qr = ( θ qrk ) k sampled from a Dirichlet distribution: θ qr ∼ Dir ( α ) , such that � K k = 1 θ qrk = 1 , ∀ ( q , r ) . � the n th word W dn of documents d in W ij is then associated to a latent ij topic vector Z dn according to: ij Z dn ij | { A ij Y iq Y jr = 1 , θ } ∼ M ( 1 , θ qr ) . 14

STBM : Modeling of the documents The generative model for the documents is as follows: � each pair of clusters ( q , r ) is first associated to a vector of topic proportions θ qr = ( θ qrk ) k sampled from a Dirichlet distribution: θ qr ∼ Dir ( α ) , such that � K k = 1 θ qrk = 1 , ∀ ( q , r ) . � the n th word W dn of documents d in W ij is then associated to a latent ij topic vector Z dn according to: ij Z dn ij | { A ij Y iq Y jr = 1 , θ } ∼ M ( 1 , θ qr ) . � then, given Z dn ij , the word W dn is assumed to be drawn from a ij multinomial distribution: W dn ij | Z dnk = 1 ∼ M ( 1 , β k = ( β k 1 , . . . , β kV )) , ij where V is the vocabulary size. 14

STBM at a glance... ρ α θ Z Y β π W A Figure: The stochastic topic block model. 15

The C-VEM algorithm for inference The C-VEM algorithm is a follows: � we use a VEM algorithm to maximize ˜ L with respect β and R ( Z , θ ) , which essentially corresponds to the VEM algorithm of Blei et al. (2003), � then, log p ( A , Y | ρ, π ) is maximized with respect to ρ and π to provide estimates, � finally, L ( R ( · ); Y , ρ, π, β ) is maximized with respect to Y , which is the only term involved in both ˜ L and the SBM complete data log-likelihood. 16

The C-VEM algorithm for inference The C-VEM algorithm is a follows: � we use a VEM algorithm to maximize ˜ L with respect β and R ( Z , θ ) , which essentially corresponds to the VEM algorithm of Blei et al. (2003), � then, log p ( A , Y | ρ, π ) is maximized with respect to ρ and π to provide estimates, � finally, L ( R ( · ); Y , ρ, π, β ) is maximized with respect to Y , which is the only term involved in both ˜ L and the SBM complete data log-likelihood. Optimization over Y : � we propose an online approach which cycles randomly through the vertices, � at each step, a single vertex i is considered and all membership vectors Y j � = i are held fixed, � for vertex i , we look for every possible cluster assignment Y i and the one which maximizes L ( R ( · ); Y , ρ, π, β ) is kept. 16

Analysis of the Enron Emails The Enron data set: � all emails between 149 Enron employees, � from 1999 to the bankrupt in late 2001, � almost 253 000 emails in the whole data base. 2000 1500 Frequency 1000 500 0 09/01 09/09 09/17 09/25 10/03 10/11 10/19 10/27 11/04 11/12 11/20 11/28 12/06 12/14 12/22 12/30 Date Figure: Temporal distribution of Enron emails. 18

Statistical Learning with Networks and Texts Charles BOUVEYRON - PowerPoint PPT Presentation

Statistical Learning with Networks and Texts Charles BOUVEYRON Professor of Statistics Chair of Excellence Inria on Data Science Laboratoire LJAD, UMR CNRS 7351 Equipe Asclepios, Inria Sophia-Antipolis Universit Cte dAzur

Introduction to Historical Texts Over 350, 000 late 15 th to long 19 th century

Nectar of Instruction (NOI) From shraddha to prema In Eleven Verses Texts 1-3 Text 8 Texts

and utterances (speech) go together to make texts and interactions and how those texts and

Using Science Texts Using Science Texts and Content in and Content in Interventions that

Translating Texts into Interpretations and Numbers Department of Government London School of

Deep maps and mapping of texts Universitt zu Kln Digital Humanities

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Description and Evidence Gathering Department of Government London School of Economics and

Finding Structure in Texts with Topological Data Analysis Calli Clay and Ella Graham St.

Factors Influencing Public Support for RPSs Hosted by Warren Leon, Executive Director, CESA

TLSCF Data System FAQs What every TDS user should know. Albert Y Chang AIRS-TDS Jet Propulsion

iNACOL Symposium 2018: A Primer on Submitting Your Proposal to Present iNACOL Special Edition

Open Source Tools for Mining and Analysing Web Data @ Scale Kris Carpenter Negulescu, Internet

CoSADIE's Data Center Census Preliminary results Gabriel Stckle gst@ari.uni-heidelberg.de

Peace Corps Masters International | Environmental Studies Sustainable Development and Climate

The Future of IR Evaluation Shlomo Geva 1 Jaap Kamps 1 Carol Peters 2 Tetsuya Sakai 3 Andrew

Series Resources Webinar Series Learner Guide Social Media Starter Kit Todays Presenters