Structure and Dynamics of Research Collaboration in Computer Science - - PowerPoint PPT Presentation

structure and dynamics of research collaboration in
SMART_READER_LITE
LIVE PREVIEW

Structure and Dynamics of Research Collaboration in Computer Science - - PowerPoint PPT Presentation

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Structure and Dynamics of Research Collaboration in Computer Science C.Bird E.Barr A.Nash P.Devanbu V.Filkov Z.Su presented by Elina Weinbrand


slide-1
SLIDE 1

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Structure and Dynamics of Research Collaboration in Computer Science

C.Bird E.Barr A.Nash P.Devanbu V.Filkov Z.Su

presented by

Elina Weinbrand

2014-03-26

slide-2
SLIDE 2

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-3
SLIDE 3

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Outline

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-4
SLIDE 4

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

Computer science is a diverse and growing area of scholarly activity, with many sub-areas Artificial Intelligence (AI) Computational biology (CBIO) Cryptography (CRYPTO) DataBases (DB) Graphics (GRAPH) Programming Languages (PL) Software Engineering (SE) Security (SEC) Theory (THEORY)

slide-5
SLIDE 5

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are many differences between the research areas Old (e.g., THEORY) vs. Newer (e.g.,GRAPH)

slide-6
SLIDE 6

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are many differences between the research areas Old (e.g., THEORY) vs. Newer (e.g.,GRAPH) Large number of researchers (e.g., DB and GRAPH) vs. smaller (e.g., CRYPTO and SE)

slide-7
SLIDE 7

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are many differences between the research areas Old (e.g., THEORY) vs. Newer (e.g.,GRAPH) Large number of researchers (e.g., DB and GRAPH) vs. smaller (e.g., CRYPTO and SE) Stable phase (e.g.,THEORY) vs. growing rapidly (e.g., SEC)

slide-8
SLIDE 8

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are other, informal, folkloric more subtle differences in character and style between areas Intellectually unified vs. several distinct, thriving groups

slide-9
SLIDE 9

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are other, informal, folkloric more subtle differences in character and style between areas Intellectually unified vs. several distinct, thriving groups Interact strongly with others vs. more stand-alone

slide-10
SLIDE 10

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are other, informal, folkloric more subtle differences in character and style between areas Intellectually unified vs. several distinct, thriving groups Interact strongly with others vs. more stand-alone Dominated by a few researchers vs. more diffuse collaborative structure

slide-11
SLIDE 11

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

There are other, informal, folkloric more subtle differences in character and style between areas Intellectually unified vs. several distinct, thriving groups Interact strongly with others vs. more stand-alone Dominated by a few researchers vs. more diffuse collaborative structure Older and younger researchers collaboration vs. researchers collaborate primarily with others like them

slide-12
SLIDE 12

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Motivation

This paper begins to quantify and study these informal, folkloric differences to produce data that may provide ”actionable intelligence” for interested parties such as researchers and funding agencies.

slide-13
SLIDE 13

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Motivation Related work

Related work

Collaboration over time: characterizing and modeling network evolution. Huang et al 2008 Group formation in large social networks: membership, growth, and evolution. Backstrom et al 2006 Community structure in social and biological networks.

  • M. Girvan and M. Newman. 2002
slide-14
SLIDE 14

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Outline

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-15
SLIDE 15

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

DBLP

Initially, the service provided by dblp was started at the database systems and logic programming (dblp) research group at the University of Trier, Germany, and focused on publications from this field of research. Through the years, dblp gradually expanded toward all fields of computer science, while the acronym survived. At times, the label ”Digital Bibliography & Library Project” has been adopted as a backronym for dblp.

slide-16
SLIDE 16

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

DBLP

DBLP is a publicly available bibliographic data source DBLP is maintained via massive human effort with special attention paid to issues such as author name consistency DBLP data is publicly available in XML form which is easily parsed and can be found at http://dblp.uni-trier.de/xml/

slide-17
SLIDE 17

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Data Collection Steps

1 Define researchers list; Solve the name consistency problem

slide-18
SLIDE 18

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Data Collection Steps

1 Define researchers list; Solve the name consistency problem 2 Define the research areas in computer science as sets of

first tier conferences and assign papers to conferences

slide-19
SLIDE 19

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Data Collection Steps

1 Define researchers list; Solve the name consistency problem 2 Define the research areas in computer science as sets of

first tier conferences and assign papers to conferences

3 Create collaboration graphs

slide-20
SLIDE 20

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Collaboration graphs

Definition Let C(p) represent some predicate or constraint on papers that identifies only those publications that we are interested in. Let P be the set of all papers, A be the set of all authors, and let W(a, p) be a predicate that is true if and only if author a is an author, or writer, of paper p. We then create the graph G = (V, E) as follows: V = {a : a ∈ A, p ∈ P, C(p) ∧ W(a, p)} E = {(a, b) : a, b ∈ V, p ∈ P, C(p) ∧ W(a, p) ∧ W(b, p)}

slide-21
SLIDE 21

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Outline

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-22
SLIDE 22

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Degree Distribution

The degree distributions of the sub-areas are almost identical, save for a scaling factor, and thus do not make good discriminators

slide-23
SLIDE 23

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Assortativity

Assortative mixing in networks is the tendency of vertices to be connected to like vertices Definition Define a set of properties over a graph’s vertices; Label each vertex with its value for each property. Let exy be the fraction of all edges in the graph that start at a vertex labelled x and end at a vertex labelled y; e is known as the mixing matrix. Let ax be the fraction of all edges in the graph incident to a vertex labelled x

slide-24
SLIDE 24

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Assortativity

Definition - cont. Assortativity is the Pearson correlation coefficient of the property values of any two vertices connected by an edge: Σxyxy(exy − axay) σ2

a

slide-25
SLIDE 25

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Assortativity

The assortativity ranges from 1, which indicates that all vertices are connected only to vertices that have similar values for that property, to -1, which indicates a perfect negative correlation in the values of connected vertices.

slide-26
SLIDE 26

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Assortativity

The assortativity ranges from 1, which indicates that all vertices are connected only to vertices that have similar values for that property, to -1, which indicates a perfect negative correlation in the values of connected vertices.

slide-27
SLIDE 27

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

Assortativity is a static measure of a graph at a particular point in time; it does not incorporate graph evolution Longitudinal Assortativity measures the correlation of dynamic properties of nodes at the time that an edge is created

slide-28
SLIDE 28

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

1 Edges and vertex properties (such as career length or

number of publications) are timestamped when they change or are added

slide-29
SLIDE 29

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

1 Edges and vertex properties (such as career length or

number of publications) are timestamped when they change or are added

2 The collaboration graph becomes a multigraph with an

edge with a single timestamp for each collaboration between two authors

slide-30
SLIDE 30

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

1 Edges and vertex properties (such as career length or

number of publications) are timestamped when they change or are added

2 The collaboration graph becomes a multigraph with an

edge with a single timestamp for each collaboration between two authors

3 The collaboration multigraph is decompose into a sequence

  • f multigraphs; Each multigraph in this sequence contains
  • nly those property values and edges whose timestamp is

earlier than the point in time under consideration; the value with the greatest timestamp is taken

slide-31
SLIDE 31

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

Final assortativity is 0

slide-32
SLIDE 32

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Longitudinal Assortativity

Longitudinal assortativity at time step t0 is 1 Longitudinal assortativity at time step t1 is 1

slide-33
SLIDE 33

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Results

To which degree do computer scientists in different research areas tend to publish with collaborators who are similar to them? Examine longitudinal assortativity over the three properties: number of publications, number of collaborators, and career length It is surprising that the assortativity is low, but positive Cryptography is a field in which senior researchers tend to collaborate with other senior researchers

slide-34
SLIDE 34

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Betweenness Centralization

The betweenness centrality of a vertex in a graph is calculated as the number of geodesics passing through that vertex. While centrality is a property of individuals, centralization is a property of a network, which measures the relative difference between the highest and lowest values for the centrality metric

  • ver all vertices in the graph.
slide-35
SLIDE 35

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Betweenness Centralization

Definition Let bv(vi) be the betweenness value of vertex vi and let v∗ be the vertex with the highest betweenness in the graph. The betweenness centralization bg of the entire graph is: bg = Σn

i=1(bv(v∗) − bv(vi))

n3 − 4n2 + 5n − 2 The denominator represents the maximum theoretical value of the sum of differences for a graph with n vertices, which obtains when the graph is in a star configuration

slide-36
SLIDE 36

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Results

For most areas, betweenness centralization had an initial peak, an early period most likely dominated by pioneers, followed by a plateau signifying a more diffuse flow of information within the community

slide-37
SLIDE 37

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Results

Betweenness centralization in PL, shows an initial peak, in 1975, followed by a long trough, then a second peak in 1993

slide-38
SLIDE 38

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Results

Betweenness centralization in security, shows unrelenting increase starting in 2002 to 2006

slide-39
SLIDE 39

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Community Structure

Community structure of a network is ”the division of network vertices into groups within which the network connections are dense, but between which they are sparser” Definition Define a k × k matrix e whose element eij is the fraction of all edges in the network that link vertices in group i to vertices in group j. The row sum is ai = Σjeij and the column sum is bj = Σieij. The modularity measure is Q = Σi(eii − aibi) Values for Q range from 0 (networks of essentially random structure) to 1

slide-40
SLIDE 40

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Degree Distribution Assortativity Longitudinal

Results

What is the level of modularity in different areas? Theoretical conferences are the least modular, indicating that researchers in the field are well-integrated. In less theoretical areas, there is more folklore and intuition that is harder to communicate and share Note: larger communities aren’t naturally more modular than smaller ones

slide-41
SLIDE 41

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Area Overlap Migration

Outline

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-42
SLIDE 42

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Area Overlap Migration

Area Overlap

Many researchers publish in more than one research area. Area

  • verlap is the number of authors that have published in two

areas during the same time period. Definition Let a and b be two research areas in computer science and let A(a, t) be the set of authors who have published in area a during time period t. The area overlap is: Oa(b, t) = |A(a, t) ∩ A(b, t)| |A(a, t)|

slide-43
SLIDE 43

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Area Overlap Migration

Migration

A score is created for each author and each computer science research area based on past publication history that favours recent publications. Definition Let P(r, a, y) be the number of publications by author r in area a in year y. Research ”score” for each author per area per year is one of the following two: S1(r, a, y) = Σ5

i=1P(r, a, y − i) · 6 − i

5 S2(r, a, y) = Σ3

i=1P(r, a, y − i) + Σ5 i=4P(r, a, y − i) · 6 − i

3 Ai(r, y) the research area of researcher r for a particular year y is determined by choosing the area with the highest score.

slide-44
SLIDE 44

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Area Overlap Migration

Migration

Definition Let R(a, y) be the set of researchers who have area a as their main research area in year y. The number researchers that migrated from area a to area b between y and y + 1 is: |R(a, y) ∩ R(b, y + 1)| |R(a, y)|

slide-45
SLIDE 45

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions Area Overlap Migration

Results

What are the relationships between different research areas? e.g. Overlap of DB with software engineering and other disciplines in the same year over time. The proportion of authors from DB publishing in machine learning conferences confirms the folklore that the two areas are converging

slide-46
SLIDE 46

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Outline

1 Outline 2 Introduction

Motivation Related work

3 Data Collection 4 Within-Area Analysis

Degree Distribution Assortativity Longitudinal Assortativity Betweenness Centralization Community Structure

5 Network-wide Metrics

Area Overlap Migration

6 Conclusions

slide-47
SLIDE 47

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Conclusions

We used DBLP bibliographic database We divided the network into computer science subareas We applied various network analysis metrics Findings may highlight potential problems within the community and suggest policies and actions to guide us towards a more effective scientific community.

slide-48
SLIDE 48

Outline Introduction Data Collection Within-Area Analysis Network-wide Metrics Conclusions

Conclusions

Questions?