for scientometrics network analysis Lovro ubelj University of - - PowerPoint PPT Presentation

for scientometrics network analysis
SMART_READER_LITE
LIVE PREVIEW

for scientometrics network analysis Lovro ubelj University of - - PowerPoint PPT Presentation

reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro ubelj University of Ljubljana, Faculty of Computer and Information Science ITIS 16 acknowledgements Lovro ubelj, Dalibor Fiala & Marko


slide-1
SLIDE 1

re reliability of bibliographic c da databa bases

for scientometrics network analysis

Lovro Šubelj

University of Ljubljana, Faculty of Computer and Information Science

ITIS ‘16

slide-2
SLIDE 2

acknowledgements

Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, Biljana M. Boshkoska, Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)

slide-3
SLIDE 3

study motivation

  • bibliographic databases basis for scientific research
  • main source of its evaluation (citations, h-index)
  • often studied in biblio/scientometricsliterature
  • different databases give different conclusions (P(k))
  • databases differ substantially between each other
  • which bibliographic database is most reliable?
slide-4
SLIDE 4

bibliographic databases

  • scientific bibliographic databases
  • hand-curated solutions — Web of Science, Scopus
  • automatic services — Google Scholar, CiteSeer
  • preprint repositories — arXiv, socArXiv, bioRxiv
  • field-specific libraries — PubMed, DBLP, APS
  • national information systems — SICRIS
  • and many other
slide-5
SLIDE 5

comparisons of databases

  • amount of literature covered — WoS ≈ Scopus
  • timespan of literature covered — WoS > Scopus
  • available features and use in scientific workflow
  • data acquisition and maintenance methodology
  • content and structure differ substantially
  • only informal notions on reliability
slide-6
SLIDE 6

reliability of databases

  • content — (amount of) literature covered
  • structure — accuracy of citation information
  • networks of citations between scientific papers
  • comparison of structure of citation networks
slide-7
SLIDE 7

structure of citation networks

  • local/global statistics of citation networks
  • networks mostly consistent with few outliers
  • outliers due to data acquisition in most cases
slide-8
SLIDE 8

comparison of citation networks

  • one can reason only about individual statistics
  • comparison over multiple statisticsproblematic
  • similar problem in machine learning community
  • comparison of algorithms over multiple data sets
  • compare mean ranks of algorithms over data sets
  • Friedman rank test with Nemenyi post-hoc test
slide-9
SLIDE 9

methodology of comparison

  • statistics residuals since “true network” not known
  • database reliability seen as consistency with rest
  • statistics — residuals — independence — ranks

Studentized statistics residuals ˆ xij Two-tailed Student statistics t-tests H0 : ˆ xij = 0 at P -value = 0.1 Student t-distribution with d.f. N − 2 ∃ρij : H1 ∀ˆ xij : H0 Pairwise Spearman correlations ρij Two-tailed Fisher independence z-tests H0 : ρij = 0 at P -value = 0.01 Standard normal distribution ∀ρij : H0 ∃ˆ xij : H1 Residuals mean ranks Ri One-tailed Friedman rank test H0 : Ri = Rj at P -value = 0.1 χ2-distribution with d.f. N − 1 H0 H1 Residuals mean ranks Ri Two-tailed Nemenyi post-hoc test H0 : Ri = Rj at P -value = 0.1 Studentized range with d.f. N25 H0

1 2 3 4

slide-10
SLIDE 10

comparison of citation networks

  • statistics— residuals — independence — ranks
  • most statistics derived from node distributions
Field bow-tie 51.4% 11.2% 34.4% 3.0% A WoS Field bow-tie 37.7% 10.5% 46.8% 5.0% B CiteSeer Field bow-tie 51.4% 8.5% 40.1% 0.0% C Cora Field bow-tie 52.2% 44.8% 1.6% 1.3% D HistCite Field bow-tie 16.9% 74.5% 7.8% 0.8% E DBLP Field bow-tie 74.7% 6.7% 18.1% 0.4% F arXiv

A WoS B CiteSeer C Cora D HistCite E DBLP F arXiv

slide-11
SLIDE 11

comparison of citation networks

  • mean ranks of citation networks over statistics
  • connected networks are not significantly different
  • hand-curated WoS > field-specific DBLP
slide-12
SLIDE 12

comparison with other networks

  • comparison robust to selection of networks
  • comparison with social networks meaningless
  • comparison with other information networks

P -value = 0.1

1 2 3 4 5 6

WoS Cora arXiv APS PubMed DBLP

A P→P

slide-13
SLIDE 13
  • ther bibliometric networks
  • A paper citation information networks
  • C author collaboration social networks
  • B author citation social-information networks

P -value = 0.1

1 2 3 4 5 6

WoS Cora arXiv APS PubMed DBLP P -value = 0.1

1 2 3 4 5 6

Cora arXiv WoS PubMed DBLP APS

A P→P B A↔A

P -value = 0.1

1 2 3 4 5 6

DBLP WoS Cora APS PubMed arXiv

C A−A A B C

slide-14
SLIDE 14

robustness of comparison

  • results robust to selection of statistics — subgraphs
  • results comparable with other techniques — MDS

P→P A↔A A−A

Y1 Y2 APS WoS DBLP PubMed Cora arXiv
  • 250
  • 200
  • 150
  • 100
  • 50
50 100
  • 3
  • 2
  • 1
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2
  • 250
  • 200
  • 150
  • 100
  • 50
50 100
  • 3
  • 2
  • 1
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2
  • 250
  • 200
  • 150
  • 100
  • 50
50 100
  • 3
  • 2
  • 1
1 2 3 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
  • 20
  • 10
10
  • 2
  • 1
1
  • 1
  • 0.5
0.5 1 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
  • 300
  • 200
  • 100
100
  • 0.5
0.5
  • 0.4
  • 0.2
0.2 0.4 APS WoS DBLP PubMed Cora arXiv Y1 Y2 Y3
  • 20
  • 10
10 20
  • 4
  • 2
2
  • 1
  • 0.5
0.5 1 1.5 Statistics residuals −6 −4 −2 2 4 6 δ90 rb d r(out,out) r(out,in) r(in,out) r(in,in) γout γin k % Out % Core % In APS WoS DBLP PubMed Cora arXiv Statistics residuals −6 −4 −2 2 4 6 δ90 rb d r(out,out) r(out,in) r(in,out) r(in,in) γout γin k % Out % Core % In APS WoS DBLP PubMed Cora arXiv Statistics residuals −6 −4 −2 2 4 6 δ90 rb d r γ k % WCC APS WoS DBLP PubMed Cora arXiv

P→P A↔A A−A

G1 G2 G3 G4 G5 G6 G7 G8 G0

slide-15
SLIDE 15

conclusions of comparison

  • notable differences between databases
  • there is no “best” bibliographic database
  • most appropriate depends on type of analysis
  • hand-curated databases perform well overall
  • field-specific databases perform poorly
  • recipes for future scientometrics studies
  • methodology applicable to any network data
slide-16
SLIDE 16

identification of research areas

  • scientific journals classified in disciplines, fields
  • research areas of scientific papers unknown
  • clustering papers based on direct citation relations
  • graph partitioning/community detection methods
  • goal are clusters of topically related papers
  • clusters recognizable, comprehensible, robust
slide-17
SLIDE 17

methods for clustering

slide-18
SLIDE 18

classes of clustering methods

  • distances between clusterings of methods
  • smaller number of representative methods
slide-19
SLIDE 19

statistical comparison

  • size distributions, degeneracy diagrams etc.
  • network analysis and bibliometric metrics
slide-20
SLIDE 20

expert assessment tool

  • hands-on assessment for scientometrics field
  • CitNetExplorer for analyzing citation networks
slide-21
SLIDE 21

hands-on expert assessment

  • low resolution — one cluster for scientometrics
  • high resolution — four clusters for h-index papers
  • topic resolution — limited number of methods
slide-22
SLIDE 22

conclusions of identification

  • methods return substantially different clusterings
  • no method performs satisfactoryby all criteria
  • simple post-processing performs poorly
  • map equation methods provide good trade-off
  • entire science can be clustered in about one hour
slide-23
SLIDE 23

references

Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, BiljanaM. Boshkoska, Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)