for scientometrics network analysis Lovro ubelj University of - PowerPoint PPT Presentation

reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro Šubelj University of Ljubljana, Faculty of Computer and Information Science ITIS ‘16

acknowledgements Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, Biljana M. Boshkoska , Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck , Ludo Waltman PLoS ONE 11(4), e0154404 (2016)

study motivation • bibliographic databases basis for scientific research • main source of its evaluation (citations, h -index) • often studied in biblio / scientometrics literature • different databases give different conclusions (P( k )) • databases differ substantially between each other • which bibliographic database is most reliable ?

bibliographic databases • scientific bibliographic databases • hand-curated solutions — Web of Science, Scopus • automatic services — Google Scholar, CiteSeer • preprint repositories — arXiv, socArXiv, bioRxiv • field-specific libraries — PubMed, DBLP, APS • national information systems — SICRIS • and many other

comparisons of databases • amount of literature covered — WoS ≈ Scopus • timespan of literature covered — WoS > Scopus • available features and use in scientific workflow • data acquisition and maintenance methodology • content and structure differ substantially • only informal notions on reliability

reliability of databases • content — (amount of) literature covered • structure — accuracy of citation information • networks of citations between scientific papers • comparison of structure of citation networks

structure of citation networks • local / global statistics of citation networks • networks mostly consistent with few outliers • outliers due to data acquisition in most cases

comparison of citation networks • one can reason only about individual statistics • comparison over multiple statistics problematic • similar problem in machine learning community • comparison of algorithms over multiple data sets • compare mean ranks of algorithms over data sets • Friedman rank test with Nemenyi post-hoc test

methodology of comparison • statistics residuals since “true network” not known • database reliability seen as consistency with rest • statistics — residuals — independence — ranks 2 3 Pairwise Spearman correlations ρ ij Residuals mean ranks R i ∃ ρ ij : H 1 Two-tailed Fisher independence z -tests ∀ ρ ij : H 0 One-tailed Friedman rank test H 0 H 0 : ρ ij = 0 at P -value = 0 . 01 H 0 : R i = R j at P -value = 0 . 1 χ 2 -distribution with d.f. N − 1 Standard normal distribution H 1 ∃ ˆ x ij : H 1 1 4 Studentized statistics residuals ˆ x ij Residuals mean ranks R i ∀ ˆ x ij : H 0 Two-tailed Nemenyi post-hoc test Two-tailed Student statistics t -tests H 0 H 0 : ˆ x ij = 0 at P -value = 0 . 1 H 0 : R i = R j at P -value = 0 . 1 Studentized range with d.f. N 25 Student t -distribution with d.f. N − 2

comparison of citation networks • statistics — residuals — independence — ranks • most statistics derived from node distributions Field bow-tie 11.2% 51.4% 34.4% 3.0% A WoS A WoS Field bow-tie 10.5% 37.7% 46.8% 5.0% B CiteSeer B CiteSeer Field bow-tie 8.5% 51.4% 40.1% 0.0% C Cora C Cora Field bow-tie 44.8% 52.2% 1.6% 1.3% D HistCite D HistCite Field bow-tie 74.5% 16.9% 7.8% 0.8% E DBLP E DBLP Field bow-tie 6.7% 74.7% 18.1% 0.4% F arXiv F arXiv

comparison of citation networks • mean ranks of citation networks over statistics • connected networks are not significantly different • hand-curated WoS > field-specific DBLP

comparison with other networks • comparison robust to selection of networks P -value = 0 . 1 3 5 6 1 2 4 WoS DBLP Cora PubMed arXiv APS A P → P • comparison with social networks meaningless • comparison with other information networks

other bibliometric networks • A paper citation information networks • C author collaboration social networks • B author citation social-information networks P -value = 0 . 1 P -value = 0 . 1 1 2 3 4 5 6 1 2 3 4 5 6 WoS DBLP Cora APS Cora PubMed arXiv DBLP arXiv APS WoS PubMed A B A B P → P A ↔ A P -value = 0 . 1 1 2 3 4 5 6 DBLP arXiv WoS PubMed Cora APS C C A − A

robustness of comparison • results robust to selection of statistics — subgraphs G 0 G 1 G 2 G 3 G 4 G 5 G 6 G 7 G 8 • results comparable with other techniques — MDS P → P A ↔ A A − A % In % In arXiv 3 3 3 % Core % Core 2 2 2 % Out % Out APS PubMed 1 1 1 Cora PubMed k k APS arXiv WoS Y 2 0 Y 2 0 Y 2 0 DBLP APS DBLP γ in γ in % WCC arXiv WoS DBLP Cora PubMed -1 -1 -1 Cora WoS γ out γ out k -2 -2 -2 r ( in,in ) r ( in,in ) -3 -3 -3 γ -250 -200 -150 -100 -50 0 50 100 -250 -200 -150 -100 -50 0 50 100 -250 -200 -150 -100 -50 0 50 100 r ( in,out ) r ( in,out ) Y 1 Y 1 Y 1 r r ( out,in ) r ( out,in ) WoS Cora PubMed d 1 1.5 r ( out,out ) r ( out,out ) 0.4 arXiv APS 1 r b 0.5 APS 0.2 PubMed d d WoS Cora 0.5 WoS DBLP APS APS DBLP arXiv Y 3 0 0 WoS Y 3 δ 90 Y 3 r b WoS r b WoS PubMed DBLP 0 APS Cora DBLP DBLP Cora arXiv DBLP -0.5 PubMed -0.2 -0.5 δ 90 PubMed δ 90 PubMed arXiv APS Cora Cora − 6 − 4 − 2 0 2 4 6 -1 -0.4 -1 arXiv arXiv 1 0.5 2 Statistics residuals 10 20 0 100 − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 0 0 0 10 0 -1 -100 -2 0 Statistics residuals Statistics residuals -10 -200 -10 -2 -20 Y 2 -0.5 -300 -4 -20 Y 1 Y 2 Y 1 Y 2 Y 1 A − A P → P A ↔ A

conclusions of comparison • notable differences between databases • there is no “best” bibliographic database • most appropriate depends on type of analysis • hand-curated databases perform well overall • field-specific databases perform poorly • recipes for future scientometrics studies • methodology applicable to any network data

identification of research areas • scientific journals classified in disciplines , fields • research areas of scientific papers unknown • clustering papers based on direct citation relations • graph partitioning/community detection methods • goal are clusters of topically related papers • clusters recognizable , comprehensible , robust

methods for clustering

classes of clustering methods • distances between clusterings of methods • smaller number of representative methods

statistical comparison • size distributions , degeneracy diagrams etc. • network analysis and bibliometric metrics

expert assessment tool • hands-on assessment for scientometrics field • CitNetExplorer for analyzing citation networks

hands-on expert assessment • low resolution — one cluster for scientometrics • high resolution — four clusters for h -index papers • topic resolution — limited number of methods

conclusions of identification • methods return substantially different clusterings • no method performs satisfactory by all criteria • simple post-processing performs poorly • map equation methods provide good trade-off • entire science can be clustered in about one hour

references Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, BiljanaM. Boshkoska, Andrej Kastrin & Zoran Levnajic ́ PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)

for scientometrics network analysis Lovro ubelj University of - PowerPoint PPT Presentation

reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro ubelj University of Ljubljana, Faculty of Computer and Information Science ITIS 16 acknowledgements Lovro ubelj, Dalibor Fiala & Marko

Computational Scientometrics: Mapping the Structure and Evolution of Science Cyberinfrastructure

in in Astronomy and Astrophysics Ashraf Maleki Master Graduate at Scientometrics University of

Scientometrics & Altmetrics Dr. Peter Kraker VU Science 2.0, 20.11.2014 funded within

Scientometrics & Altmetrics Dr. Peter Kraker VU Science 2.0, 25.11.2015 funded within

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

Week 5 Video 5 Relationship Mining Network Analysis Todays Class Network Analysis Network

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

Epistemic Network Analysis Todays Class Epistemic Network Analysis Epistemic Network

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

Network Coding Network Coding Jie Gao Existing network Existing network Independent data

Definitions & basic recap Network Analysis in Python II Network/Graph Network = Graph

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Bioinformatics: Network Analysis Comparative Network Analysis COMP 572 (BIOS 572 / BIOE 564) -

Applying Ontology in Network Analysis EWG-DSS Research Collaboration Network EWG-DSS Collab-Net

Computational Scientometrics References Brner, Katy, Chen, Chaomei, and Boyack, Kevin. (2003).

Probabilistic Graphical Models for Cellular Pathways Florian Markowetz

Extraction of structure functions and TMDs from azimuthal asymmetries in SIDIS Harut Avakian

Entrywise positivity preservers: covariance estimation, symmetric function identities, novel graph

ComPAS: Community Preserving Sampling for Streaming Graphs Sandipan Sikdar Chair for

Assessing Model Fit Our model has assumptions: mean 0 errors, functional form of

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr

STAT 213 Regression Inference II Colin Reimer Dawson Oberlin College 18 February 2016 Outline

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

for scientometrics network analysis Lovro ubelj University of - PowerPoint PPT Presentation

reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro ubelj University of Ljubljana, Faculty of Computer and Information Science ITIS 16 acknowledgements Lovro ubelj, Dalibor Fiala & Marko

Computational Scientometrics: Mapping the Structure and Evolution of Science Cyberinfrastructure

in in Astronomy and Astrophysics Ashraf Maleki Master Graduate at Scientometrics University of

Scientometrics &amp; Altmetrics Dr. Peter Kraker VU Science 2.0, 20.11.2014 funded within

Scientometrics &amp; Altmetrics Dr. Peter Kraker VU Science 2.0, 25.11.2015 funded within

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

Week 5 Video 5 Relationship Mining Network Analysis Todays Class Network Analysis Network

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

Epistemic Network Analysis Todays Class Epistemic Network Analysis Epistemic Network

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

Network Coding Network Coding Jie Gao Existing network Existing network Independent data

Definitions &amp; basic recap Network Analysis in Python II Network/Graph Network = Graph

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Bioinformatics: Network Analysis Comparative Network Analysis COMP 572 (BIOS 572 / BIOE 564) -

Applying Ontology in Network Analysis EWG-DSS Research Collaboration Network EWG-DSS Collab-Net

Computational Scientometrics References Brner, Katy, Chen, Chaomei, and Boyack, Kevin. (2003).

Probabilistic Graphical Models for Cellular Pathways Florian Markowetz

Extraction of structure functions and TMDs from azimuthal asymmetries in SIDIS Harut Avakian

Entrywise positivity preservers: covariance estimation, symmetric function identities, novel graph

ComPAS: Community Preserving Sampling for Streaming Graphs Sandipan Sikdar Chair for

Assessing Model Fit Our model has assumptions: mean 0 errors, functional form of

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr

STAT 213 Regression Inference II Colin Reimer Dawson Oberlin College 18 February 2016 Outline

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Scientometrics & Altmetrics Dr. Peter Kraker VU Science 2.0, 20.11.2014 funded within

Scientometrics & Altmetrics Dr. Peter Kraker VU Science 2.0, 25.11.2015 funded within

Definitions & basic recap Network Analysis in Python II Network/Graph Network = Graph