SLIDE 1 re reliability of bibliographic c da databa bases
for scientometrics network analysis
Lovro Šubelj
University of Ljubljana, Faculty of Computer and Information Science
ITIS ‘16
SLIDE 2
acknowledgements
Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, Biljana M. Boshkoska, Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)
SLIDE 3 study motivation
- bibliographic databases basis for scientific research
- main source of its evaluation (citations, h-index)
- often studied in biblio/scientometricsliterature
- different databases give different conclusions (P(k))
- databases differ substantially between each other
- which bibliographic database is most reliable?
SLIDE 4 bibliographic databases
- scientific bibliographic databases
- hand-curated solutions — Web of Science, Scopus
- automatic services — Google Scholar, CiteSeer
- preprint repositories — arXiv, socArXiv, bioRxiv
- field-specific libraries — PubMed, DBLP, APS
- national information systems — SICRIS
- and many other
SLIDE 5 comparisons of databases
- amount of literature covered — WoS ≈ Scopus
- timespan of literature covered — WoS > Scopus
- available features and use in scientific workflow
- data acquisition and maintenance methodology
- content and structure differ substantially
- only informal notions on reliability
SLIDE 6 reliability of databases
- content — (amount of) literature covered
- structure — accuracy of citation information
- networks of citations between scientific papers
- comparison of structure of citation networks
SLIDE 7 structure of citation networks
- local/global statistics of citation networks
- networks mostly consistent with few outliers
- outliers due to data acquisition in most cases
SLIDE 8 comparison of citation networks
- one can reason only about individual statistics
- comparison over multiple statisticsproblematic
- similar problem in machine learning community
- comparison of algorithms over multiple data sets
- compare mean ranks of algorithms over data sets
- Friedman rank test with Nemenyi post-hoc test
SLIDE 9 methodology of comparison
- statistics residuals since “true network” not known
- database reliability seen as consistency with rest
- statistics — residuals — independence — ranks
Studentized statistics residuals ˆ xij Two-tailed Student statistics t-tests H0 : ˆ xij = 0 at P -value = 0.1 Student t-distribution with d.f. N − 2 ∃ρij : H1 ∀ˆ xij : H0 Pairwise Spearman correlations ρij Two-tailed Fisher independence z-tests H0 : ρij = 0 at P -value = 0.01 Standard normal distribution ∀ρij : H0 ∃ˆ xij : H1 Residuals mean ranks Ri One-tailed Friedman rank test H0 : Ri = Rj at P -value = 0.1 χ2-distribution with d.f. N − 1 H0 H1 Residuals mean ranks Ri Two-tailed Nemenyi post-hoc test H0 : Ri = Rj at P -value = 0.1 Studentized range with d.f. N25 H0
1 2 3 4
SLIDE 10 comparison of citation networks
- statistics— residuals — independence — ranks
- most statistics derived from node distributions
Field bow-tie 51.4% 11.2% 34.4% 3.0%
A WoS
Field bow-tie 37.7% 10.5% 46.8% 5.0%
B CiteSeer
Field bow-tie 51.4% 8.5% 40.1% 0.0%
C Cora
Field bow-tie 52.2% 44.8% 1.6% 1.3%
D HistCite
Field bow-tie 16.9% 74.5% 7.8% 0.8%
E DBLP
Field bow-tie 74.7% 6.7% 18.1% 0.4%
F arXiv
A WoS B CiteSeer C Cora D HistCite E DBLP F arXiv
SLIDE 11 comparison of citation networks
- mean ranks of citation networks over statistics
- connected networks are not significantly different
- hand-curated WoS > field-specific DBLP
SLIDE 12 comparison with other networks
- comparison robust to selection of networks
- comparison with social networks meaningless
- comparison with other information networks
P -value = 0.1
1 2 3 4 5 6
WoS Cora arXiv APS PubMed DBLP
A P→P
SLIDE 13
- ther bibliometric networks
- A paper citation information networks
- C author collaboration social networks
- B author citation social-information networks
P -value = 0.1
1 2 3 4 5 6
WoS Cora arXiv APS PubMed DBLP P -value = 0.1
1 2 3 4 5 6
Cora arXiv WoS PubMed DBLP APS
A P→P B A↔A
P -value = 0.1
1 2 3 4 5 6
DBLP WoS Cora APS PubMed arXiv
C A−A A B C
SLIDE 14 robustness of comparison
- results robust to selection of statistics — subgraphs
- results comparable with other techniques — MDS
P→P A↔A A−A
Y1 Y2 APS WoS DBLP PubMed Cora arXiv
50 100
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2
50 100
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2
50 100
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
10
1
0.5 1 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
100
0.5
0.2 0.4 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
10 20
2
0.5 1 1.5
Statistics residuals
−6 −4 −2 2 4 6
δ90 rb d r(out,out) r(out,in) r(in,out) r(in,in) γout γin k % Out % Core % In APS WoS DBLP PubMed Cora arXiv Statistics residuals
−6 −4 −2 2 4 6
δ90 rb d r(out,out) r(out,in) r(in,out) r(in,in) γout γin k % Out % Core % In APS WoS DBLP PubMed Cora arXiv Statistics residuals
−6 −4 −2 2 4 6
δ90 rb d r γ k % WCC APS WoS DBLP PubMed Cora arXiv
P→P A↔A A−A
G1 G2 G3 G4 G5 G6 G7 G8 G0
SLIDE 15 conclusions of comparison
- notable differences between databases
- there is no “best” bibliographic database
- most appropriate depends on type of analysis
- hand-curated databases perform well overall
- field-specific databases perform poorly
- recipes for future scientometrics studies
- methodology applicable to any network data
SLIDE 16 identification of research areas
- scientific journals classified in disciplines, fields
- research areas of scientific papers unknown
- clustering papers based on direct citation relations
- graph partitioning/community detection methods
- goal are clusters of topically related papers
- clusters recognizable, comprehensible, robust
SLIDE 17
methods for clustering
SLIDE 18 classes of clustering methods
- distances between clusterings of methods
- smaller number of representative methods
SLIDE 19 statistical comparison
- size distributions, degeneracy diagrams etc.
- network analysis and bibliometric metrics
SLIDE 20 expert assessment tool
- hands-on assessment for scientometrics field
- CitNetExplorer for analyzing citation networks
SLIDE 21 hands-on expert assessment
- low resolution — one cluster for scientometrics
- high resolution — four clusters for h-index papers
- topic resolution — limited number of methods
SLIDE 22 conclusions of identification
- methods return substantially different clusterings
- no method performs satisfactoryby all criteria
- simple post-processing performs poorly
- map equation methods provide good trade-off
- entire science can be clustered in about one hour
SLIDE 23
references
Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, BiljanaM. Boshkoska, Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)