Publications, Identity, and Disambiguation NIH Workshop on - - PDF document

publications identity and disambiguation
SMART_READER_LITE
LIVE PREVIEW

Publications, Identity, and Disambiguation NIH Workshop on - - PDF document

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in Scholarly Work Denise Beaubien Bennett Gainesville, FL March 18, 2010 "Until George W. Bush became President, the first President Bush never


slide-1
SLIDE 1

1

Publications, Identity, and Disambiguation

NIH Workshop on Identifiers and Disambiguation in Scholarly Work

Denise Beaubien Bennett Gainesville, FL March 18, 2010

2

"Until George W. Bush became President, the first President Bush never used his middle initials,"

George H.W. Bush's chief of staff, Jean Becker, says. "But once his son became President, the elder Bush began to realize that it was necessary, to help identify which President Bush was being referred to.”

  • How confident are we that all mentions of plain “George

Bush” refer to Senior?

  • Remember that George H.W. Bush had several roles:

CIA Director, Ambassador to China, Vice President

slide-2
SLIDE 2

2

3

Automated disambiguation

  • Scopus
  • Web of Science
  • CiteSeer
  • DBLP author search engine – query interpreted as set
  • f prefixes (implicit truncation) of name parts
  • Author-ity
  • improving recall and precision over time!

4

Scopus – snapshot from 2007

2007 – one solid cluster, 6 ambiguous outliers

slide-3
SLIDE 3

3

Scopus in 2010: improving

2010 - one solid cluster, 3 ambiguous outlier names

6

Web of Science

  • Their example shows incompleteness of

disambiguation; continue using all variations

with and without apostrophe

slide-4
SLIDE 4

4 WoS Distinct Author Sets – clustering is improving

DIY disambiguation

Web of Science

slide-5
SLIDE 5

5

CiteSeer – disambiguated (but not perfect)

unclustered items are mostly typos

alternate name resolves to preferred name

slide-6
SLIDE 6

6

slide-7
SLIDE 7

7

Author-ity clusters

slide-8
SLIDE 8

8

Author-ity pairwise ranking Author-ity ranking results

slide-9
SLIDE 9

9

Author-ity ranking – the bottom

super-high probability through 130. less than 50% with title far

  • ff topic

Voluntary Profiles

Author (or proxy) created and maintained

  • Compliance challenges with ingestion and

updating

  • Usually include numbers
  • COS Expertise - 480,000 profiles
  • ResearcherID (to be used by ORCID)
  • RePEc Author Service in IDEAS
slide-10
SLIDE 10

10

19

COS Community of Science

useful tools 18 months ago

20

ResearcherID

author-controlled profile

slide-11
SLIDE 11

11

21

ResearcherID - features

value added from WoS –

  • nly works on cites in WoS

ResearcherID dups

keywords helpful when present

slide-12
SLIDE 12

12

RePEc Author Service

  • Relies on authors to maintain their profiles and

identify articles as written by them

  • 23,000+ registered authors and 7000+ registered

non-authors

from 2007: dups & funnies

disambiguated index is much cleaner in 2010

they track lost and deceased authors

slide-13
SLIDE 13

13

25

In development

  • Cooperative Identities Hub
  • ISNI
  • ORCID

26

Manual checking

  • no guarantee of perfection
  • scalability
  • MathSciNet
  • Mathematics Genealogy Project
  • ACM
slide-14
SLIDE 14

14

MathSciNet clusters all papers but preserves name on piece

28

However…

  • Even the small, discipline-specific

database of MathSciNet cannot corral all the duplicate names.

– only half of the entries disambiguated for:

  • Zhang, Lei
  • Zhang, Li
  • Red herring: how many people only

author one paper in their career???

– about 46% in Medline (sec. 3.5)

slide-15
SLIDE 15

15

Many people, same name

30

MGP -

slide-16
SLIDE 16

16

ACM – discloses the weighting

ACM Digital Library – not quite yet

slide-17
SLIDE 17

17

33

After we disambiguate, we can:

  • Link / cluster records within the silo

– highlighting the preferred version

  • Link headings (or records) across silos
  • Analyze / repackage / mashup the data

34

Linking within a silo

  • more examples -- inspiration from outside

the university/research world

slide-18
SLIDE 18

18

Linking in Community- maintained IMDB

  • thers born the same day or year or place

links to people, films, etc. credit!

Community-maintained - MusicBrainz

members & years

slide-19
SLIDE 19

19

Community-maintained - MusicBrainz

please – no “eyes” no “pears” no hyphen

38

Linking across silos

  • VIAF – Virtual International Authority File
  • Getty ULAN – Union List of Artist Names
  • Names Project - UK individuals and institutions –

for benefit of institutional and subject repositories

  • BKN People – using

Bibliographic Ontology (BIBO) to aggregate author silos

  • rely on local silos for maintenance
slide-20
SLIDE 20

20

VIAF – linking across files

authority record in BNF (France) matches these other files

Getty Union List of Artist Names

  • ULAN
  • Used mostly by museums
  • Merges multiple authority files
  • Displays all options and sources
  • Guides to preferred name
slide-21
SLIDE 21

21

name variations preferred among options

slide-22
SLIDE 22

22

relationships sources

Names project (UK)

slide-23
SLIDE 23

23

45

Names Project (UK)

46

BKN People: uses BIBO

slide-24
SLIDE 24

24

47

BKN People: uses BIBO

48

Analyzing / repackaging the data

– discover outliers through analysis

  • what’s wrong with this picture?

– run the outliers by human checkers – use the analyzed results to refine the disambiguation

slide-25
SLIDE 25

25

WorldCat Identities

more than birth/death dates the fun stuff

Anne O’Tate (Author-ity) analyze by address

note the fractions

  • f addresses
slide-26
SLIDE 26

26

Anne O’Tate (Author-ity) analyze by topic

neat clustering, compared to “Topics” with 324 results

analyze – author’s impact within silo

IDEAS / RePEc

slide-27
SLIDE 27

27

MathSciNet collaboration distance

the Kevin Bacon of Math

How close are these authors?

slide-28
SLIDE 28

28

DBLP Vis – coauthor intensity

see # papers with coauthor when mouse-over a year

DBLP Vis – coauthor timecolor

see fatter boxes on graph when mouse-over a year

slide-29
SLIDE 29

29

57

Features to help disambiguate

  • affiliation (how many addresses/year?)
  • email address
  • coauthors
  • keywords from source or all metadata
  • dates - degree years, expected range
  • web page – URL and other data
  • caution - what fuzziness/distance is

acceptable? differences by disciplines?

Use with care: one author, many interests

slide-30
SLIDE 30

30

59

For contemplation and discussion

60

Assigning numbers

  • Centralized numbering system –

governance issues, unpalatable to some

  • Individual small silo numbering – can be

highly accurate

  • Record linking across files – easily

accomplished

  • Getting started -- authors could include

number(s) with all contact info

slide-31
SLIDE 31

31

61

Trustworthiness

  • Am I in control of all of my publications?
  • If I’m logged in (to ResearcherID, via my

university account, etc.) and I indicate “these items are mine,” should you trust my accuracy?

  • Have I captured all of my items?

– variants on my name – items I forgot – items credited without my awareness

61 62

Issues to explore

  • Ingestion vs. maintenance

– very different problems – author compliance needed?

  • De-duplication (within and across silos)
  • Management and cooperation for updating
  • Scalability
  • Automated vs. manual techniques
  • Optimizing computational performance
  • Long tail of one-hit authors (how much attention?)
slide-32
SLIDE 32

32

63

Researchers, projects, products, models

  • Great review (by the Author-ity folks)

Smalheiser NR, Torvik VI. (2009) Author name disambiguation.

Databases and those who created or tinkered with them

  • MathSciNet
  • ULAN
  • DBLP - Han
  • CiteSeer – Giles, Han
  • IMDB – Malin
  • ANAC – Levy sheet music
  • Medline – Torvik and Smalheiser
  • D-Dupe - Getoor
  • rexa.info – McCallum
  • VIAF - Hickey