Machine Learning for Author disambiguation Gilles Louppe CERN - - PowerPoint PPT Presentation

machine learning for author disambiguation
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Author disambiguation Gilles Louppe CERN - - PowerPoint PPT Presentation

Machine Learning for Author disambiguation Gilles Louppe CERN October 14, 2015 1 / 12 From publications to signatures Publications Signatures Signature for Doe, John Title Lorem ipsum dolor sit amet, consectetur adipiscing elit ...


slide-1
SLIDE 1

Machine Learning for Author disambiguation

Gilles Louppe

CERN

October 14, 2015

1 / 12

slide-2
SLIDE 2

From publications to signatures

...

Signature for Doe, John Publications Signatures

Title Lorem ipsum dolor sit amet, consectetur adipiscing elit Author Doe, John Affiliation University of Foo Co-authors Smith, John; Chen, Wang 2015 Year

2 / 12

slide-3
SLIDE 3

Author disambiguation

For each author, group together all his signatures, and only those.

M.S.Smith.1 Z.Liang.4 Z.Liang.5 ... Z.Liang.83 S.W.Hawking.1 No more No less But all and only the correct ones

3 / 12

slide-4
SLIDE 4

Spread of the problem

As extracted from claimed publications in INSPIRE,

  • Authors have on average 2.06 name variants (synonyms)
  • Eg. : Doe, John ; Doe, J.
  • Unique name variants are shared on average by 1.04 authors

(homonyms) Clustering on same surnames and same given name initials, should yield very good results on average. But, disambiguation issues are expected to amplify with the rise of Asian researchers : Caucasian names (now representative of INSPIRE authors) are almost never ambiguous, while Asian names are very often.

4 / 12

slide-5
SLIDE 5

How would you fare ?

5 / 12

slide-6
SLIDE 6

How would you fare ?

✓ Same authors

5 / 12

slide-7
SLIDE 7

How would you fare ?

5 / 12

slide-8
SLIDE 8

How would you fare ?

✓ Same authors

5 / 12

slide-9
SLIDE 9

How would you fare ?

5 / 12

slide-10
SLIDE 10

How would you fare ?

✗ Different authors

5 / 12

slide-11
SLIDE 11

How would you fare ?

5 / 12

slide-12
SLIDE 12

How would you fare ?

✓ Same authors

5 / 12

slide-13
SLIDE 13

Learning from data

  • Manual disambiguation is long and difficult, even for

experienced curators.

  • Couldn’t we automatically find a set of rules to disambiguate

two signatures ? ϕ(s1, s2) =

  • if s1 and s2 belong to the same author,

1

  • therwise.
  • This is a machine learning task called supervised learning.

6 / 12

slide-14
SLIDE 14

Feature extraction s1 s2 x = (name sim. = 0.7, title sim. = 0.3, ...) Machine learning model ϕ p(s1, s2 have different authors|x)

7 / 12

slide-15
SLIDE 15

Feature extraction

Feature Combination operator Full name Cosine similarity of (2, 4)-TF-IDF Given names Cosine similarity of (2, 4)-TF-IDF First given name Jaro-Winkler distance Second given name Jaro-Winkler distance Given name initial Equality Affiliation Cosine similarity of (2, 4)-TF-IDF Co-authors Cosine similarity of TF-IDF Title Cosine similarity of (2, 4)-TF-IDF Journal Cosine similarity of (2, 4)-TF-IDF Abstract Cosine similarity of TF-IDF Keywords Cosine similarity of TF-IDF Collaborations Cosine similarity of TF-IDF References Cosine similarity of TF-IDF Subject Cosine similarity of TF-IDF Year difference Absolute difference White Product of estimated probabilities Black Product of estimated probabilities American Indian or Alaska Native Product of estimated probabilities Chinese Product of estimated probabilities Japanese Product of estimated probabilities Other Asian or Pacific Islander Product of estimated probabilities Others Product of estimated probabilities

8 / 12

slide-16
SLIDE 16

Disambiguation as a clustering problem

  • Author disambiguation = clustering

signatures that belong to the same author.

  • Using our model ϕ, the probability

that two signatures belong to different authors can be used as a (pseudo) distance metric, and e.g., plugged into a hierarchical clustering clustering.

  • The complexity of hierarchical

clustering is O(N2). For N = 107 signatures, this is impractical. Solution : pre-cluster signatures into blocks of smaller size, then cluster each

  • f these blocks.

9 / 12

slide-17
SLIDE 17

Workflow

10 / 12

slide-18
SLIDE 18

Results

F measure Baseline 1 0.9409 Our model 0.9862

  • 1. Group by same surnames and same given name initials.

11 / 12

slide-19
SLIDE 19

References

  • Implementation available at

https://github.com/inveniosoftware/beard.

  • Ethnicity sensitive author disambiguation using

semi-supervised learning. Gilles Louppe, Hussein Al-Natsheh, Mateusz Susik, Eamonn Maguire. http://arxiv.org/abs/1508.07744.

12 / 12