for scientometrics network analysis
play

for scientometrics network analysis Lovro ubelj University of - PowerPoint PPT Presentation

reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro ubelj University of Ljubljana, Faculty of Computer and Information Science ITIS 16 acknowledgements Lovro ubelj, Dalibor Fiala & Marko


  1. reliability of bibliographic re c da databa bases for scientometrics network analysis Lovro Šubelj University of Ljubljana, Faculty of Computer and Information Science ITIS ‘16

  2. acknowledgements Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, Biljana M. Boshkoska , Andrej Kastrin & Zoran Levnajić PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck , Ludo Waltman PLoS ONE 11(4), e0154404 (2016)

  3. study motivation • bibliographic databases basis for scientific research • main source of its evaluation (citations, h -index) • often studied in biblio / scientometrics literature • different databases give different conclusions (P( k )) • databases differ substantially between each other • which bibliographic database is most reliable ?

  4. bibliographic databases • scientific bibliographic databases • hand-curated solutions — Web of Science, Scopus • automatic services — Google Scholar, CiteSeer • preprint repositories — arXiv, socArXiv, bioRxiv • field-specific libraries — PubMed, DBLP, APS • national information systems — SICRIS • and many other

  5. comparisons of databases • amount of literature covered — WoS ≈ Scopus • timespan of literature covered — WoS > Scopus • available features and use in scientific workflow • data acquisition and maintenance methodology • content and structure differ substantially • only informal notions on reliability

  6. reliability of databases • content — (amount of) literature covered • structure — accuracy of citation information • networks of citations between scientific papers • comparison of structure of citation networks

  7. structure of citation networks • local / global statistics of citation networks • networks mostly consistent with few outliers • outliers due to data acquisition in most cases

  8. comparison of citation networks • one can reason only about individual statistics • comparison over multiple statistics problematic • similar problem in machine learning community • comparison of algorithms over multiple data sets • compare mean ranks of algorithms over data sets • Friedman rank test with Nemenyi post-hoc test

  9. methodology of comparison • statistics residuals since “true network” not known • database reliability seen as consistency with rest • statistics — residuals — independence — ranks 2 3 Pairwise Spearman correlations ρ ij Residuals mean ranks R i ∃ ρ ij : H 1 Two-tailed Fisher independence z -tests ∀ ρ ij : H 0 One-tailed Friedman rank test H 0 H 0 : ρ ij = 0 at P -value = 0 . 01 H 0 : R i = R j at P -value = 0 . 1 χ 2 -distribution with d.f. N − 1 Standard normal distribution H 1 ∃ ˆ x ij : H 1 1 4 Studentized statistics residuals ˆ x ij Residuals mean ranks R i ∀ ˆ x ij : H 0 Two-tailed Nemenyi post-hoc test Two-tailed Student statistics t -tests H 0 H 0 : ˆ x ij = 0 at P -value = 0 . 1 H 0 : R i = R j at P -value = 0 . 1 Studentized range with d.f. N 25 Student t -distribution with d.f. N − 2

  10. comparison of citation networks • statistics — residuals — independence — ranks • most statistics derived from node distributions Field bow-tie 11.2% 51.4% 34.4% 3.0% A WoS A WoS Field bow-tie 10.5% 37.7% 46.8% 5.0% B CiteSeer B CiteSeer Field bow-tie 8.5% 51.4% 40.1% 0.0% C Cora C Cora Field bow-tie 44.8% 52.2% 1.6% 1.3% D HistCite D HistCite Field bow-tie 74.5% 16.9% 7.8% 0.8% E DBLP E DBLP Field bow-tie 6.7% 74.7% 18.1% 0.4% F arXiv F arXiv

  11. comparison of citation networks • mean ranks of citation networks over statistics • connected networks are not significantly different • hand-curated WoS > field-specific DBLP

  12. comparison with other networks • comparison robust to selection of networks P -value = 0 . 1 3 5 6 1 2 4 WoS DBLP Cora PubMed arXiv APS A P → P • comparison with social networks meaningless • comparison with other information networks

  13. other bibliometric networks • A paper citation information networks • C author collaboration social networks • B author citation social-information networks P -value = 0 . 1 P -value = 0 . 1 1 2 3 4 5 6 1 2 3 4 5 6 WoS DBLP Cora APS Cora PubMed arXiv DBLP arXiv APS WoS PubMed A B A B P → P A ↔ A P -value = 0 . 1 1 2 3 4 5 6 DBLP arXiv WoS PubMed Cora APS C C A − A

  14. robustness of comparison • results robust to selection of statistics — subgraphs G 0 G 1 G 2 G 3 G 4 G 5 G 6 G 7 G 8 • results comparable with other techniques — MDS P → P A ↔ A A − A % In % In arXiv 3 3 3 % Core % Core 2 2 2 % Out % Out APS PubMed 1 1 1 Cora PubMed k k APS arXiv WoS Y 2 0 Y 2 0 Y 2 0 DBLP APS DBLP γ in γ in % WCC arXiv WoS DBLP Cora PubMed -1 -1 -1 Cora WoS γ out γ out k -2 -2 -2 r ( in,in ) r ( in,in ) -3 -3 -3 γ -250 -200 -150 -100 -50 0 50 100 -250 -200 -150 -100 -50 0 50 100 -250 -200 -150 -100 -50 0 50 100 r ( in,out ) r ( in,out ) Y 1 Y 1 Y 1 r r ( out,in ) r ( out,in ) WoS Cora PubMed d 1 1.5 r ( out,out ) r ( out,out ) 0.4 arXiv APS 1 r b 0.5 APS 0.2 PubMed d d WoS Cora 0.5 WoS DBLP APS APS DBLP arXiv Y 3 0 0 WoS Y 3 δ 90 Y 3 r b WoS r b WoS PubMed DBLP 0 APS Cora DBLP DBLP Cora arXiv DBLP -0.5 PubMed -0.2 -0.5 δ 90 PubMed δ 90 PubMed arXiv APS Cora Cora − 6 − 4 − 2 0 2 4 6 -1 -0.4 -1 arXiv arXiv 1 0.5 2 Statistics residuals 10 20 0 100 − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 0 0 0 10 0 -1 -100 -2 0 Statistics residuals Statistics residuals -10 -200 -10 -2 -20 Y 2 -0.5 -300 -4 -20 Y 1 Y 2 Y 1 Y 2 Y 1 A − A P → P A ↔ A

  15. conclusions of comparison • notable differences between databases • there is no “best” bibliographic database • most appropriate depends on type of analysis • hand-curated databases perform well overall • field-specific databases perform poorly • recipes for future scientometrics studies • methodology applicable to any network data

  16. identification of research areas • scientific journals classified in disciplines , fields • research areas of scientific papers unknown • clustering papers based on direct citation relations • graph partitioning/community detection methods • goal are clusters of topically related papers • clusters recognizable , comprehensible , robust

  17. methods for clustering

  18. classes of clustering methods • distances between clusterings of methods • smaller number of representative methods

  19. statistical comparison • size distributions , degeneracy diagrams etc. • network analysis and bibliometric metrics

  20. expert assessment tool • hands-on assessment for scientometrics field • CitNetExplorer for analyzing citation networks

  21. hands-on expert assessment • low resolution — one cluster for scientometrics • high resolution — four clusters for h -index papers • topic resolution — limited number of methods

  22. conclusions of identification • methods return substantially different clusterings • no method performs satisfactory by all criteria • simple post-processing performs poorly • map equation methods provide good trade-off • entire science can be clustered in about one hour

  23. references Lovro Šubelj, Dalibor Fiala & Marko Bajec Scientific Reports 4, 6496 (2014) Lovro Šubelj, Marko Bajec, BiljanaM. Boshkoska, Andrej Kastrin & Zoran Levnajic ́ PLoS ONE 10(5), e0127390 (2015) Lovro Šubelj, Nees Jan van Eck, Ludo Waltman PLoS ONE 11(4), e0154404 (2016)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend