InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical - PowerPoint PPT Presentation

InvIdenti: Author Disambiguation for Medical Patents Bachelor Thesis Presentation Sanchit Alekh InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) : Prof. Dr. U.S. Tiwary Guide (RWTH Aachen) : PD Dr. Christoph Quix Enrolment : IIT2012108 Email : iit2012108@iiita.ac.in / alekh@dbis.rwth-aachen.de Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Contents Author Disambiguation for Medical Patents 1. Introduction and Goals Sanchit Alekh 28 July 2016 2. Background Slide 2 3. Approach and Solution 4. Evaluation 5. Conclusion 6. Scope for Future Work Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 1. What and Why? Sanchit Alekh 28 July 2016 • Author Disambiguation: Distinguish between inventors with Slide 3 same or similar names / competence fields • Identifying by name has severe limitations • Spelling errors in patent database introduce ambiguity • Authors/Inventors may share name and/or expertise area • Manual Approaches infeasible and not future-proof due to explosion in number of patents Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 2. Software Functionality Goals Sanchit Alekh • Feature Selection : Find good and representative features for 28 July 2016 Slide 4 the disambiguation task • Importance Weighting of Features • Similarity Calculation • Patent Clustering • Patent-Publication Matching Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 3. Software Quality Goals Sanchit Alekh • Software Design and Architecture : Software should conform 28 July 2016 Slide 5 to S.O.L.I.D principles for code maintainability and possibility of future extension • Support for Parallelization & Multiprocessor Architecture • Lucid Documentation for long-term maintainability • UML Diagrams • JavaDoc™ Documentation • Wiki Pages Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Background Author Disambiguation for Medical Patents 1. Project Mi-Mappa • Complex innovation in medical engineering not possible without Sanchit Alekh 28 July 2016 collaboration Slide 6 • Goal is to develop an integrative competence model based on Data Mining Algorithms • Assignment of patents and medical products to competence fields • Actors selected based on published texts for a given project • Use of Ontology Modeling and matching, Data and Text Mining Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Background Author Disambiguation for Medical Patents 2. Related Work • PatentsView Inventor Disambiguation Workshop – Sept. ‘15 Sanchit Alekh Neural Networks, Rule-based methods, Ensemble ML 28 July 2016 Slide 7 Methods for Inventor Disambiguation • [Fleming et al. 2014] Disambiguation and Co-Authorship Networks of the US Patent Inventor Database(1975-2010) Uses a Naïve Bayesian Classifier Technique with Blocking • [Maraut et al. 2014] Identifying author-inventors from Spain Computes a global similarity and clusters inventors based on Prof. M. Jarke that Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Outline Author Disambiguation for Medical 1. Underlying data-structure used is an Inventor-Patent Instance, which Patents stores the metadata as well as textual features Sanchit Alekh 2. An Assortment of 10 features is used, out of which there are 6 28 July 2016 metadata and 4 textual features Slide 8 3. Different Feature Similarity metrics are used for each of the features to compute a weighted similarity matrix between instance pairs 4. Weight Training is done using pre-labelled instances from dataset provided by Fleming et al. using Logistic Regression Prof. M. Jarke 5. Hierarchical Clustering and DBSCAN are used to assign inventor- Lehrstühl patent instances to clusters with unique inventors Informatik 5 RWTH Aachen

InvIdenti: Solution: Flow Author Disambiguation for Medical Patents Sanchit Alekh 28 July 2016 Slide 9 Prof. M. Jarke Lehrstühl Fig. 9.1 Flowchart of processes involved in InvIdenti Informatik 5 RWTH Aachen

InvIdenti: Solution: Inventor- Author Disambiguation Patent Instance for Medical Patents Sanchit Alekh 28 July 2016 Slide 10 Prof. M. Jarke Fig. 10.1 Inventor Patent Instances Fig. 10.2 Ten features used to represent Lehrstühl Informatik 5 obtained from Patent X the Inventor-Patent Instance RWTH Aachen

InvIdenti: Solution: Similarity Author Disambiguation for Medical Patents • Feature Similarity Techniques 1. Name : Levenshtein Distance 2. Location : Country + Distance (from Latitude & Longitude) Sanchit Alekh 28 July 2016 3. Assignee : Assignee Code + Levenshtein Distance of Ass. Name Slide 11 4. Technology Class : Number of shared classes 5. Co-Inventors : Number of Shared Co-Inventors 6. Textual Features : Cosine Similarity between Document Vectors Prof. M. Jarke Lehrstühl Informatik 5 Fig. 11.1 Feature Similarity Calculation for Location, Co-Author and Textual Features RWTH Aachen

InvIdenti: Solution: Similarity Author Disambiguation for Medical Patents • Feature Similarity Transformations 1. Distance Measures are converted to Similarity Measures Sanchit Alekh 2. All Similarity values are normalized to fall within range [0,1] 28 July 2016 Slide 12 • Global Similarity - S global = ∑ 1. w i S i , where w i are feature weights and S i are the ./0 normalized similarity values 2. Threshold : 𝜗 • How to find suitable values for weights and threshold? v Logistic Regression Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • Maximum Log-likelihood is used to model the Probability 𝑄 𝑍 = 1 𝑌 = 𝑦) based on binary output variable Y ∈ {0,1} Sanchit Alekh 28 July 2016 • The Logistic (or Logit) Function is used to model this probability as Slide 13 it is bounded in both directions. The equation is: On solving for 𝑄 𝑍 = 1 𝑌 = 𝑦) , we get the Sigmoid Function • Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • Using Logistic Regression, aim is to train the model on labelled data to obtain weights and threshold Sanchit Alekh 28 July 2016 - We can say that there is a match or no match if ∑ w i x i • is greater ./0 Slide 14 than or less than 𝜗 respectively • For training, there must be a cost function associated with the sigmoid function. The cost function follows a –ve log form, and is given by: Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • For training in Logistic Regression, classic Gradient Descent method is used, i.e. error correction is made by a factor of the gradient of the cost function Sanchit Alekh 28 July 2016 Slide 15 • Therefore, the weight update of each parameter after every iteration of Gradient Descent is given by Where α is the learning rate Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Transitivity Author Disambiguation for Medical Simple Binary Classification using Logistic Regression does not yield good • Patents results. Why? 1. Many inventors cover several expertise areas Sanchit Alekh 28 July 2016 2. Inventors may change their location or organization/university Slide 16 3. Logistic Regression often suffers from overfitting. To remedy this, we propose that additional property, i.e. Transitivity be • fulfilled by patents. Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Solution: Transitivity Author Disambiguation for Medical Patents • In InvIdenti, Transitivity is affected by Clustering Algorithms, i.e. Hierarchical Clustering and DBSCAN. Sanchit Alekh 28 July 2016 • In Hierarchical Clustering, the type of linkage method used Slide 17 controls the extent of transitivity 1. Single-Linkage : Promotes chaining; best transitivity 2. Complete-Linkage : Avoids chaining; worst transitivity 3. Group-Average Linkage : Medium Transitivity • In DBSCAN, the parameter MinPts determines the extent of transitivity. MinPts = 1 guarantees chaining Prof. M. Jarke Lehrstühl and best transitivity Informatik 5 RWTH Aachen

Solution: Hierarchical InvIdenti: Author Clustering Disambiguation for Medical Patents • Hierarchical Agglomerative Clustering starts with each patent in a different cluster, and then merges successfully based on the best similarity values Sanchit Alekh 28 July 2016 Slide 18 • The Stopping Criterion used is the threshold obtained from Logistic Regression. • We employ Single-linkage clustering, which uses the best similarity value between clusters to merge them. • When cluster similarities are less than Prof. M. Jarke the threshold, merge process is stopped Lehrstühl Informatik 5 RWTH Aachen

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical - PowerPoint PPT Presentation

InvIdenti: Author Disambiguation for Medical Patents Bachelor Thesis Presentation Sanchit Alekh InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) : Prof. Dr. U.S. Tiwary Guide (RWTH Aachen) : PD Dr.

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

!= AUTHOR ORIT ITY AUTHOR ORIT ITY LIKING NG AUTHOR ORIT ITY LIKING NG SOCIA

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Data-driven sense induction for disambiguation and lexical selection in translation Marianna

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

User-focused Multi-document Summarization with Paragraph Clustering and Sentence-type Filtering

Evaluation of Hierarchical Clustering Algorithms for Document Datasets Paper by Ying Zhao and

Review of Available Childhood Asthma Data Allegheny County, 2018 1 Percent of Children with

SPECTRAL CLUSTERING OF LARGE NETWORKS A. Fender, N. Emad, S. Petiton, M. Naumov May 8th, 2017

Reconstruction of the Intra-Host Evolution of HCV Mathieu Flinders Max Planck Institute for

P rediction of U nderlying L atent C lasses via K -means and H ierarchical C lustering A lgorithm

Bayesian cluster detection via adjacency modelling Craig Anderson University of Technology

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical - PowerPoint PPT Presentation

InvIdenti: Author Disambiguation for Medical Patents Bachelor Thesis Presentation Sanchit Alekh InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) : Prof. Dr. U.S. Tiwary Guide (RWTH Aachen) : PD Dr.

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Author Disambiguation &amp; Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

!= AUTHOR ORIT ITY AUTHOR ORIT ITY LIKING NG AUTHOR ORIT ITY LIKING NG SOCIA

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Data-driven sense induction for disambiguation and lexical selection in translation Marianna

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

User-focused Multi-document Summarization with Paragraph Clustering and Sentence-type Filtering

Evaluation of Hierarchical Clustering Algorithms for Document Datasets Paper by Ying Zhao and

Review of Available Childhood Asthma Data Allegheny County, 2018 1 Percent of Children with

SPECTRAL CLUSTERING OF LARGE NETWORKS A. Fender, N. Emad, S. Petiton, M. Naumov May 8th, 2017

Reconstruction of the Intra-Host Evolution of HCV Mathieu Flinders Max Planck Institute for

P rediction of U nderlying L atent C lasses via K -means and H ierarchical C lustering A lgorithm

Bayesian cluster detection via adjacency modelling Craig Anderson University of Technology

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT