Dynamic Egocentric Models for Citation Networks Duy Vu Arthur - PowerPoint PPT Presentation

Dynamic Egocentric Models for Citation Networks Duy Vu Arthur Asuncion David Hunter Padhraic Smyth To appear in Proceedings of the 28th International Conference on Machine Learning , 2011 MURI meeting, June 3, 2011 Scalable Methods for the Analysis of Network-Based Data

Outline Egocentric Modeling Framework Inference for the Models Application to Citation Network Datasets Scalable Methods for the Analysis of Network-Based Data

Egocentric Counting Processes ◮ Goal: Model a dynamically evolving network ◮ Following standard recurrent event theory, place a counting process N i ( t ) on node i , i = 1 , . . . , n . ◮ N i ( t ) counts the number of “events” involving the i th node. ◮ Combine N i ( t ) gives a multivariate counting process N ( t ) = ( N 1 ( t ) , . . . , N n ( t )). ◮ Genuinely multivariate; no assumption about the independence of N i ( t ). ◮ “Egocentric” using Carter’s terminology because i are nodes, not node pairs. Scalable Methods for the Analysis of Network-Based Data

Modeling of Citation Networks ◮ New papers join the network over time. ◮ At arrival, a paper cites others that are already in the network. ◮ Main dynamic development is the number of citations received . ◮ Thus, N i ( t ) equals the cumulative number of citations to paper i at time t . ◮ “Egocentric” means N i ( t ) is ascribed to nodes. Alternative “relational” framework, using N ( i , j ) ( t ), is not appropriate here: Relationship ( i , j ) is at risk of an event (citation) only at a single instant in time. ◮ Further discussion of general time-varying network modeling ideas given by Butts (2008) and Brandes et al (2009). Scalable Methods for the Analysis of Network-Based Data

The Doob-Meyer Decomposition Each N i ( t ) is nondecreasing in time, so N ( t ) may be considered a submartingale ; i.e., it satisfies E [ N ( t ) | past up to time s ] ≥ N ( s ) for all t > s . Scalable Methods for the Analysis of Network-Based Data

The Doob-Meyer Decomposition Each N i ( t ) is nondecreasing in time, so N ( t ) may be considered a submartingale ; i.e., it satisfies E [ N ( t ) | past up to time s ] ≥ N ( s ) for all t > s . Any submartingale may be uniquely decomposed as � t N ( t ) = λ ( s ) ds + M ( t ) : 0 ◮ λ ( t ) is the “signal” at time t (this intensity function is what we will model) ◮ M ( t ) is a continuous-time Martingale. Scalable Methods for the Analysis of Network-Based Data

Modeling the Intensity Process The intensity process for node i is given by β ⊤ s i ( t ) � � λ i ( t | H t − ) = Y i ( t ) α 0 ( t ) exp , where Scalable Methods for the Analysis of Network-Based Data

Modeling the Intensity Process The intensity process for node i is given by β ⊤ s i ( t ) � � λ i ( t | H t − ) = Y i ( t ) α 0 ( t ) exp , where ◮ Y i ( t ) = I ( t > t arr ) is the “at-risk indicator” i ◮ H t − is the past of the network up to but not including time t ◮ α 0 ( t ) is the baseline hazard function ◮ β is the vector of coefficients to estimate ◮ s i ( t ) = ( s i 1 ( t ) , . . . , s ip ( t )) is a p -vector of statistics for paper i Scalable Methods for the Analysis of Network-Based Data

Preferential Attachment Statistics For each cited paper j already in the network. . . ◮ First-order PA: s j 1 ( t ) = � N i =1 y ij ( t ). “Rich get richer” effect ◮ Second-order PA: s j 2 ( t ) = � i � = k y ki ( t ) y ij ( t ). Effect due to being cited by well-cited papers ◮ Recency-based first-order PA (we take T w = 180 days): s j 3 ( t ) = � N i =1 y ij ( t ) I ( t − t arr < T w ). i Temporary elevation of citation intensity after recent citations j Statistics in red are time-dependent. Others are fixed once j joins the network. Scalable Methods for the Analysis of Network-Based Data

Triangle Statistics For each cited paper j already in the network. . . ◮ “Seller” statistic: s j 4 ( t ) = � i � = k y ki ( t ) y ij ( t ) y kj ( t ). ◮ “Broker” statistic: s j 5 ( t ) = � i � = k y kj ( t ) y ji ( t ) y ki ( t ). ◮ “Buyer” statistic: s j 6 ( t ) = � i � = k y jk ( t ) y ki ( t ) y ji ( t ). Seller A Broker B Buyer C Statistics in red are time-dependent. Others are fixed once j joins the network. Scalable Methods for the Analysis of Network-Based Data

Out-Path Statistics For each cited paper j already in the network. . . ◮ First-order out-degree (OD): s j 7 ( t ) = � N i =1 y ji ( t ). ◮ Second-order OD: s j 8 ( t ) = � i � = k y jk ( t ) y ki ( t ). j Statistics in red are time-dependent. Others are fixed once j joins the network. Scalable Methods for the Analysis of Network-Based Data

Topic Modeling Statistics Additional statistics, using abstract text if available, as follows: ◮ An LDA model (Blei et al, 2003) is learned on the training set. ◮ Topic proportions θ generated for each training node. ◮ LDA model also used to estimate topic proportions θ for each node in the test set. ◮ We construct a vector of similarity statistics: s LDA ( t arr ) = θ i ◦ θ j , j i where ◦ denotes the element-wise product of two vectors. ◮ We use 50 topics; each s j component has a corresponding β . Scalable Methods for the Analysis of Network-Based Data

Partial Likelihood Recall: The intensity process for node i is β ⊤ s i ( t ) � � λ i ( t | H t − ) = Y i ( t ) α 0 ( t ) exp . If α 0 ( t ) ≡ α 0 ( t , γ ), we may use the “local Poisson-ness” of the multivariate counting process to obtain (and maximize) a likelihood function (details omitted). Scalable Methods for the Analysis of Network-Based Data

Partial Likelihood Recall: The intensity process for node i is β ⊤ s i ( t ) � � λ i ( t | H t − ) = Y i ( t ) α 0 ( t ) exp . If α 0 ( t ) ≡ α 0 ( t , γ ), we may use the “local Poisson-ness” of the multivariate counting process to obtain (and maximize) a likelihood function (details omitted). However, we treat α 0 as a nuisance parameter and take a partial likelihood approach as in Cox (1972): Maximize � � � � β ⊤ s i e ( t e ) β ⊤ s i e ( t e ) m exp m exp � � � = L ( β ) = � κ ( t e ) � n i =1 Y i ( t e ) exp β ⊤ s i ( t e ) e =1 e =1 Trick: Write κ ( t e ) = κ ( t e − 1 ) + ∆ κ ( t e ), then optimize ∆ κ ( t e ) calculation. Scalable Methods for the Analysis of Network-Based Data

Data Sets We Analyzed Three citation network datasets from the physics literature: 1. APS: Articles in Physical Review Letters , Physical Review , and Reviews of Modern Physics from 1893 through 2009. Timestamps are monthly for older, daily for more recent. 2. arXiv-PH: arXiv high-energy physics phenomenology articles from Jan. 1993 to Mar. 2002. Timestamps are daily. 3. arXiv-TH: High-energy physics theory articles spanning from January 1993 to April 2003. Timestamps are continuous-time (millisecond resolution). Also includes text of paper abstracts. Papers Citations Unique Times APS 463,348 4,708,819 5,134 arXiv-PH 38,557 345,603 3,209 arXiv-TH 29,557 352,807 25,004 Scalable Methods for the Analysis of Network-Based Data

Three Phases 1. Statistics-building phase: Construct network history and build up network statistics. 2. Training phase: Construct partial likelihood and estimate model coefficients. 3. Test phase: Evaluate predictive capability of the learned model. Statistics-building is ongoing even through the training and test phases. The phases are split along citation event times. Building Training Test Number of unique citation APS 4,934 100 100 event times in the three phases: arXiv-PH 2,209 500 500 arXiv-TH 19,004 1000 5000 Scalable Methods for the Analysis of Network-Based Data

Average Normalized Ranks ◮ Compute “rank” for each true citation among sorted likelihoods of each possible citation. ◮ Normalize by dividing by the number of possible citations. ◮ Average of the normalized ranks of each observed citation. ◮ Lower rank indicates better predictive performance. APS arXiv − PH arXiv − TH 0.32 0.26 Average normalized rank Average normalized rank Average normalized rank PA 0.3 P2PT 0.31 0.24 P2PTR180 0.25 LDA 0.22 0.3 LDA+P2PTR180 0.2 0.2 0.29 PA PA 0.15 P2PT 0.18 P2PT 0.28 P2PTR180 P2PTR180 0.1 0.16 0 2 4 6 0 5 10 0 5 10 Paper batches Paper batches Paper batches ◮ Batch sizes are 3000, 500, 500, respectively. ◮ PA : pref. attach only ( s 1 ( t )); P2PT : s 1 , . . . , s 8 except s 3 ; ◮ P2PTR180 : s 1 , . . . , s 8 ; LDA : LDA stats only Scalable Methods for the Analysis of Network-Based Data

Recall Performance Recall: Proportion of true citations among largest K likelihoods. 1 0.8 0.6 Recall 0.4 PA P2PT 0.2 P2PTR180 LDA LDA+P2PTR180 0 0 5000 10000 15000 Cut − point K ◮ PA : pref. attach only ( s 1 ( t )); P2PT : s 1 , . . . , s 8 except s 3 ; ◮ P2PTR180 : s 1 , . . . , s 8 ; LDA : LDA stats only Scalable Methods for the Analysis of Network-Based Data

Coefficient Estimates for LDA + P2PTR180 Model Statistics Coefficients ( β ) s 1 (PA) 0.01362 s 2 (2 nd PA) 0.00012 s 3 (PA-180) 0.02052 s 4 (Seller) -0.00126 s 5 (Broker) -0.00066 s 6 (Buyer) -0.00387 s 7 (1 st OD) 0.00090 s 8 (2 nd OD) 0.02052 Seller Seller A D A A Broker Broker B B B B Buyer Buyer C C C E Diverse seller effect: Diverse buyer effect: D more likely cited than A . E more likely cited than C . Scalable Methods for the Analysis of Network-Based Data

Dynamic Egocentric Models for Citation Networks Duy Vu Arthur - PowerPoint PPT Presentation

Dynamic Egocentric Models for Citation Networks Duy Vu Arthur Asuncion David Hunter Padhraic Smyth To appear in Proceedings of the 28th International Conference on Machine Learning , 2011 MURI meeting, June 3, 2011 Scalable Methods for the

Egocentric Relational Event Models Christopher Steven Marcum and Lorien Jasny August 25 th ,

Egocentric Analysis of Dynamic Networks with EgoLines Jian Zhao, Michael Glueck, Fanny Chevalier,

Summarizing Egocentric Video Kristen Grauman Department of Computer Science University of Texas

Citation networks in economics Carlo D Ippoliti Carlo D Ippoliti Citation Networks in

EgoNetCloud: Event-based Egocentric Dynamic Network Visualization Qingsong Liu, Yifan Hu, Lei

Santo Fortunato Universality of citation distributions The World Citation Network The

egoSlider Visual Analysis of Egocentric Network Evolution Presented by: Ken Mansfield CPSC 547

Egocentric Videos Yair Poleg Chetan Arora Shmuel Peleg CVPR 2014 Presenter: Hsin-Ping

Egocentric Networks: An In Innovative Method for Assessing Youth Mental Healt lth Support

Exemplary Practice Citation Exemplary Practice Citation Application Automated External

DataCite and Data Citation Joan Starr California Digital Library DataCite & Data Citation

Data Citation Principles: A Synthesis The Data Citation Synthesis Group Maryann Martone

Citation Detective : A Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale

Modeling co-authorship and citation networks. Analytical models: Other models:

SOCI 424: Networks & Social Structures Nov. 23 1. Semantic networks 1 Citation Data

into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Womens College

4CSLL5 IBM Translation Models Martin Emms October 23, 2020 4CSLL5 IBM Translation Models

Last class: Synchronization Problems and Primitives Today: Synchonization Solutions

Lecture 18 Jeffrey H. Shapiro Optical and Quantum Communications Group www.rle.mit.edu/qoptics

Introduction Today we begin a two-lecture treatment of semiclassical versus quantum photode-

Stability, convergence to equilibrium and simulation of non-linear Hawkes Processes with memory

Randomized Algorithms Lecture 6: Coupon Collectors problem Sotiris Nikoletseas

Foundations of Computing II Lecture 12: Multiple Random Variables, Linearity of Expectation.

Last Time... Sanity Check Let X be a RV that takes on values in A . Expectation describes the

Dynamic Egocentric Models for Citation Networks Duy Vu Arthur - PowerPoint PPT Presentation

Dynamic Egocentric Models for Citation Networks Duy Vu Arthur Asuncion David Hunter Padhraic Smyth To appear in Proceedings of the 28th International Conference on Machine Learning , 2011 MURI meeting, June 3, 2011 Scalable Methods for the

Egocentric Relational Event Models Christopher Steven Marcum and Lorien Jasny August 25 th ,

Egocentric Analysis of Dynamic Networks with EgoLines Jian Zhao, Michael Glueck, Fanny Chevalier,

Summarizing Egocentric Video Kristen Grauman Department of Computer Science University of Texas

Citation networks in economics Carlo D Ippoliti Carlo D Ippoliti Citation Networks in

EgoNetCloud: Event-based Egocentric Dynamic Network Visualization Qingsong Liu, Yifan Hu, Lei

Santo Fortunato Universality of citation distributions The World Citation Network The

egoSlider Visual Analysis of Egocentric Network Evolution Presented by: Ken Mansfield CPSC 547

Egocentric Videos Yair Poleg Chetan Arora Shmuel Peleg CVPR 2014 Presenter: Hsin-Ping

Egocentric Networks: An In Innovative Method for Assessing Youth Mental Healt lth Support

Exemplary Practice Citation Exemplary Practice Citation Application Automated External

DataCite and Data Citation Joan Starr California Digital Library DataCite &amp; Data Citation

Data Citation Principles: A Synthesis The Data Citation Synthesis Group Maryann Martone

Citation Detective : A Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale

Modeling co-authorship and citation networks. Analytical models: Other models:

SOCI 424: Networks &amp; Social Structures Nov. 23 1. Semantic networks 1 Citation Data

into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Womens College

4CSLL5 IBM Translation Models Martin Emms October 23, 2020 4CSLL5 IBM Translation Models

Last class: Synchronization Problems and Primitives Today: Synchonization Solutions

Lecture 18 Jeffrey H. Shapiro Optical and Quantum Communications Group www.rle.mit.edu/qoptics

Introduction Today we begin a two-lecture treatment of semiclassical versus quantum photode-

Stability, convergence to equilibrium and simulation of non-linear Hawkes Processes with memory

Randomized Algorithms Lecture 6: Coupon Collectors problem Sotiris Nikoletseas

Foundations of Computing II Lecture 12: Multiple Random Variables, Linearity of Expectation.

Last Time... Sanity Check Let X be a RV that takes on values in A . Expectation describes the

DataCite and Data Citation Joan Starr California Digital Library DataCite & Data Citation

SOCI 424: Networks & Social Structures Nov. 23 1. Semantic networks 1 Citation Data