Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 - - PowerPoint PPT Presentation

benefits and challenges
SMART_READER_LITE
LIVE PREVIEW

Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 - - PowerPoint PPT Presentation

Quatrro Confidential Quatrro Confidential Author / Researcher Databases Benefits and Challenges CESSE Annual Meeting July 18, 2013 1 1 www.Quatrro.com Trends in STM Research Publishing Exponential growth of scholarly output.


slide-1
SLIDE 1

www.Quatrro.com

1

Quatrro Confidential Quatrro Confidential

Author / Researcher Databases – Benefits and Challenges

CESSE Annual Meeting July 18, 2013

1

slide-2
SLIDE 2

Quatrro Confidential

  • Exponential growth of scholarly output.
  • Evolution of social networks and topical communities
  • Authors seeking more visibility and recognition for their contributions.
  • Evolving user expectations from online content (functional efficiency, accuracy)
  • Increasing emphasis on data mining, analysis, integration
  • Governments, institutions and funding agencies evaluating their “investments”
  • faculty, departments, grants, collaborations – for Productivity, ROI
  • Increased interest in the “Who” of STM research - the producers of the

research, not just the research itself.

Trends in STM Research Publishing

2

slide-3
SLIDE 3

Quatrro Confidential

  • More efficient, enhanced editorial workflow (Peer Review)

– Simpler, faster, higher quality review process (ID “the best” reviewers)

  • Improved online performance and search results

– Enhanced discovery and more accurate retrieval of author information and content

  • More robust and accurate bibliometric analyses

– Research productivity of institutions, departments, individuals – Indicators like citations, downloads, articles published, patents – Supports decisions like funding, promotion, and reappointment. – Better assessment of the impact of money spent on investment in research

  • Increased exposure to and support of the author

– Visibility, tracking, collaborating

  • Support for the broader community

– Analysis, networking, productivity, efficiency

Benefits of Clean, Aggregated Author Data

3

slide-4
SLIDE 4

Quatrro Confidential

Researcher ID – Thomson Reuters

4

slide-5
SLIDE 5

Quatrro Confidential

5

Researcher ID – Thomson Reuters

slide-6
SLIDE 6

Quatrro Confidential

Elsevier Scopus ID and Author Profile

6

Author Profile Page

slide-7
SLIDE 7

Quatrro Confidential

  • Its prime aim is to improve

the overall Research ecosystem by creating unique identifiers for researchers and scholars that link to other references such as publications, grants & patents.

ORCID

7

slide-8
SLIDE 8

Quatrro Confidential

Society/Association Initiatives

8

slide-9
SLIDE 9

Quatrro Confidential

ACM Authorizer

9

slide-10
SLIDE 10

Quatrro Confidential

IEEE Explore Author Search

10

  • Author profile user

interface before the end of the year.

  • Authors will be

asked to QC the data.

slide-11
SLIDE 11

Quatrro Confidential

  • One of world’s largest physical science publishers
  • Overview:

– 5.5 million potential author names – 6,000 authors with surname “Wang” – 800,000 articles back to early 20th century – Subject areas and keywords

  • Outcome:

– 980,000 academic authors – 33,000 institutions – Database of publishing physicists complete with a record of affiliations, areas of

expertise, papers published, co-authors.

  • Next Step:

– Feedback from users and explore additional refinements

AIP Publishing

11

slide-12
SLIDE 12

Quatrro Confidential

  • Support for Authors, Researchers

– Create individual author profiles and provide new value added services. – Enhance the author experience with your publications (service).

  • Support for the Specialty / Domain Which the Society Serves

– Having an accurate author record of your publications is important – Enhance interconnectivity and networking of a specific publishing community

  • Also need and want to respond to market needs, trends, expectations
  • Important, valuable information they want to own, maintain, develop proactively

– Complimentary to similar, broader initiatives (ORCID, etc.)

  • Believe it is a service its members and community want from them.
  • ACM: “…emphasizing its continuing commitment to the interests of its authors

and to the computing community.”

Why Create Author Database?

12

slide-13
SLIDE 13

Quatrro Confidential

The Bigger Association/Society Picture

13

Author

Member Editor, Reviewer Subscriber Donor Marketer Meeting Attendee Committee Member

slide-14
SLIDE 14

Quatrro Confidential

Practical Considerations

14

slide-15
SLIDE 15

Quatrro Confidential

  • Extracting, cleansing and disambiguating the author data is an arduous but

essential process – garbage in, garbage out.

– Automated tools using an algorithm and scoring mechanism can be used (to discern

whether a record for John Smith and J L Smith is likely to be the same person).

– Fully automated solutions are prone to problems (data glitches and missing

information results in mapping errors).

– Expert human intervention is required to achieve a desirable level of quality.

  • At the front end, to analyze the data and establish the rule set for the automation;
  • In the processing phase, to ensure data is validated and standardized;
  • During disambiguation, for “hands on analysis and processing” when necessary.
  • Find a partner with sophisticated data cleansing and disambiguation capabilities

and experience to help with analysis, strategy and execution.

  • Once completed, profiles including papers authored, affiliations and other info

can be created in a very automated fashion, using existing bibliographic metadata from the publisher and in the “public domain”—e.g. CrossRef

The Grunt Work

15

slide-16
SLIDE 16

Quatrro Confidential

  • Multiple input formats: PDF, TIFF, XML and HTML (OCR needed?)
  • Inconsistent representation of Author Data in documents
  • Author Data represented in unstructured format

Sourcing and Extracting Author Data

Name Affiliation Affiliation Name 16

slide-17
SLIDE 17

Quatrro Confidential

  • Same authors with multiple name variants

– Journals use different naming styles

  • Name changes due to marriage

e.g. if Adela LANDOVÁ married Jakub ŠTYCHKOV, she may be known as Adela ŠTYCHKOVÁ or Adela LANDOVÁ-ŠTYCHKOVÁ.

  • International naming conventions

– Eastern order - Family-name (surname) Forename (given name) – Western order - Forename (given name) Family-name (surname) – Surname Prefixes – Abdel, Abdul, Abu, Af, Akhu, Al, Ben, De, Della, Des, Du, El, Ibn,

La, Le, On, Op

– Multiple family names – María-Jose Carreño Quiñones. – Brazilians may have three or four family names.

Issues with Names

17

First Name Middle Name Last Name T Scullion Tom Scullion Thomas Hyun Scullion

slide-18
SLIDE 18

Quatrro Confidential

  • Lack of standardization in affiliation names
  • University of California at Davis
  • University of California Davis
  • University of California at Davis School of Medicine
  • University of California, Davis
  • Authors migrating from one affiliation to another
  • Data represented in multiple languages
  • Institut für Klinische Pharmakologie und Toxikologie, Charité Campus Benjamin

Franklin, Garystr. 5, 14195 Berlin

  • Institut für Arbeitsphysiologie an der Universität Dortmund
  • Institut für Theoretische Physik der Universität Heidelberg
  • Laboratoire d’Elecfrochimie et des Procédés Membranaires

Issues with Institutional/Affiliation Data

18 First Name Last Name Department Organization E-mail Abdurrahman Sahin Department of Civil Engineering Karadeniz Technical University abdurrahmansahin@hotmail.com Abdurrahman Sahin Department of Earthquake Engineering Bogazici University abdurrahman.sahin@boun.edu.tr

slide-19
SLIDE 19

Quatrro Confidential

  • Accented characters (require conversion into Unicode)
  • Surname Prefixes (van, von, de,...)
  • Names of cities and states being the same in different countries
  • Authors represented by generic emails (Yahoo or Gmail) without unique organization

IDs

  • Email not as per the standard formats

Other Data Related Issues

19

slide-20
SLIDE 20

Quatrro Confidential

Modular Approach to Data Preparation

20

Data Parsing

  • Source input documents
  • Identify author data
  • SME verification of identified author data with

input document

Data Validation

  • Error identification using global validation

checks across author names and affiliation data

Data Standardization

  • Standardization of author names and affiliation

data using predefined rules and knowledge repositories

Disambiguation and Visualization Data Preparation and Enhancement

Disambiguation by email and affiliation mapping Disambiguation by co-author analysis Manual validation of email ID if required Creation of unique author profiles Author Data clustering and Visualization

slide-21
SLIDE 21

Quatrro Confidential

  • Author records need to be split into their constituent data fields -- surname, first

name, email, division, organization, city, state, country, etc.

Parsing Module

21 21

Source Document Parsed

  • utput data

The data parsing module extracts author data from input documents, parses the data and populates the relevant fields in a predefined template.

slide-22
SLIDE 22

Quatrro Confidential

  • Parsed data needs to be validated for accuracy – automation based on pre-

defined rules, built-in databases and other knowledge repositories can help, but manual intervention is typically required to achieve a desirable level of accuracy.

Validation Module

22

The data validation module will identify the errors with respect to formatting and parsing for human validation and rectification of errors.

Output validation using pre defined rules

slide-23
SLIDE 23

Quatrro Confidential

The self-learning standardization module has built-in thesauri which are continuously updated based on automatic and manual corrections.

Standardization Module

23

  • This process isolates incorrect field names after comparing them with standard

names in pre-built databases. It enables running partial or complete standardization rules, and manual validation for errors that cannot be corrected automatically.

slide-24
SLIDE 24

Quatrro Confidential

Disambiguation Process

24

slide-25
SLIDE 25

Quatrro Confidential

Ex: Author names represented in multiple ways in different articles:

  • Disambiguation using exact mapping of e-mail, first name and last name:
  • Disambiguation using mapping of org, first name and last name w/out email IDs:

First Name Middle Name Last Name Organization Email K. Abdel-Ghaffar University of California Khaled

  • A. S.

Abdel-Ghaffar University of California ghaffar@ece.ucdavis.edu Khaled Abdel-Ghaffar University of California ghaffar@ece.ucdavis.edu First Name Middle Name Last Name Organization Email Khaled

  • A. S.

Abdel-Ghaffar University of California ghaffar@ece.ucdavis.edu Khaled Abdel-Ghaffar University of California ghaffar@ece.ucdavis.edu First Name Middle Name Last Name Organization Email K. Abdel-Ghaffar University of California

Standardized affiliation information helps in automatic disambiguation using email and affiliation.

Name Variants

25

slide-26
SLIDE 26

Quatrro Confidential

Unique Id First Name Middle Name Last Name Department Organization Email 1 Khaled

  • A. S.

Abdel-Ghaffar Department of Electrical & Computer Engineering University of California ghaffar@ece.uc davis.edu 1 Khaled

  • A. S.

Abdel-Ghaffar Department of Electrical and Computer Engineering University of California ghaffar@ece.uc davis.edu 1 Khaled

  • A. S.

Abdel-Ghaffar Department of Electrical and Computer Engineering University of California ghaffar@ece.uc davis.edu

  • The disambiguated output will have name and affiliation fields in full and standardized

format with unique assigned ID to each author.

  • Records not grouped under unique Id will be further subject to manual validation to

improve the accuracy of the automatic disambiguation process.

Disambiguated Output – Affiliation Step

26

slide-27
SLIDE 27

Quatrro Confidential

Ex: Disambiguation when Author Names have Different Affiliations due to migration

Manual Validation – Co-Author Analysis

Confidential

27

Article ID Author Name Department Organization Email State Country 00704872 Tracy Cameron Advanced Neuromodulation Systems, Inc. tracy@ans- medical.com TX United States 00704872 Gerald E. Loeb Queen's University mb@tum.de

Article ID

First Name Last Name Department Organization E-mail 00704872 Tracy Cameron Advanced Neuromodulation Systems, Inc tracy@ans-medical.com 00623047 Tracy Cameron Biomedical Engineering Unit Queen's University Systems, Inc. tracy@biomed.queensu .ca Article ID Author Name Department Organization Email State Country 00623047 Tracy Cameron Biomedical Engineering Unit Queen's University Systems, Inc. tracy@biomed.quee nsu.ca 00623047 Gerald E. Loeb Biomedical Engineering Unit Queen's University mb@tum.de

Two author records of the same author with different affiliation are checked for their co- authors to ensure it represents the same author or not

27

slide-28
SLIDE 28

Quatrro Confidential

Co-author relations for bibliographic data Use relations to improve identification and disambiguation Article 1 Article 2

Affiliation Article 1 Article 2 Author Name Filippini, Daniel Filippini, Daniel First Name Daniel D. Last Name Filippini Filippini Division Division of Applied Physics, IFM ORG Linkoping University Email danfi@ifm.liu.se Country Sweden Sweden

Author record without affiliation data, will be disambiguated with another record with the same author name using co-authors

Co-Author Analysis – Lack of Affiliation Data

28

slide-29
SLIDE 29

Quatrro Confidential

  • Author Disambiguation using Co-author and Title Keywords:

For records that could not be disambiguated due to lack of affiliation data, additional metadata like title, keywords, journal name will be extracted for disambiguation

Disambiguation Using Add’l Metadata

29

slide-30
SLIDE 30

Quatrro Confidential

  • Scope’s Author Data Management Solution
  • A platform-based service for parsing, standardizing and disambiguating author data.
  • Combines automated algorithms, manual validation and standardized data

repositories to develop fast, scalable and high-quality databases of disambiguated author names their affiliations, publications and other related data.

  • AuthEntik can also provide clustering and visualization of relationships across authors.
  • Over 30 million records delivered
  • Other components of the service:

– Knowledge repositories – Pre-defined rules for automation – ISO standards – Multi-lingual capabilities

AuthEntikTM

30

slide-31
SLIDE 31

Quatrro Confidential

Thank You!

Rich Kobel

  • Assoc. VP, Business Development

31

rkobel@scopeknowledge.com www.scopeknowledge.com