Pucktada Treeratpituk (Puck) PhD Student College of Information - - PDF document

▶

Feb 13, 2024 354 likes •408 views

3/19/10 Pucktada Treeratpituk (Puck) PhD Student College of Information Sciences & Technology Penn State University Background Education 3 rd yr PhD. Information Sciences & Technology, Penn State CiteSeerX digital library

SLIDE 1

3/19/10 1

Pucktada Treeratpituk (Puck)

PhD Student College of Information Sciences & Technology Penn State University

Background

Education

– 3rd yr PhD. Information Sciences & Technology, Penn State

CiteSeerX digital library (http://citeseerx.ist.psu.edu)
Advisor: Dr. C. Lee Giles

– Intelligent Information Systems Research Lab

– MS. Language Technology, Carnegie Mellon University – MS. Computer Science, Stanford University – BS. Computer Science, Mathematics and Economics, Carnegie Mellon University

General Research Interest

– Data Mining, Information Retrieval, Information Extraction, Natural Language Processing and Digital Library

SLIDE 2

3/19/10 2

Author Disambiguation in Digital Libraries

– using machine learning techniques to resolve name ambiguity

– http://citeseerx.ist.psu.edu

Key Phrase Extraction from Scholarly Work

– mining & identifying important topics in each academic paper

Automatic Expertise Extraction

– identifying expertise & research interests for each authors based on the content of their publication records

– http://singularity.ist.psu.edu/expert

Expert Finding in Digital Library

– computing query-dependent expert ranking

Current Research

Author Disambiguation in CiteSeerX

SLIDE 3

3/19/10 3 Automatic Expertise Extraction

http://singularity.ist.psu.edu/expert

Since CiteSeerX crawls for scientific papers (pdf/ps) from the

web, we have to rely on metadata we extract from the papers for the disambiguation.

Approaches

– Learn to estimate the likelihood that two author names from two different papers refer to the same person, based on metadata such as affiliation, paper title, coauthors, etc., then do the clustering. – Previous Work

SVM + DBSCAN (Huang et al, PKDD’06)
Topic Model (Song et al, JCDL’07)
Random Forest (Treeratpituk et al, JCDL’09) (for MEDLINE)
Also do other types of record matching such as citations

Author Disambiguation in CiteSeerX

SLIDE 4

3/19/10 4

Iterative approach

– Right now, disambiguation is done in batch. – Adding new documents every week, thus disambiguation should also be done iteratively.

Interactive Mode

Pucktada Treeratpituk (Puck) PhD Student College of Information - - PDF document

3/19/10 1

Pucktada Treeratpituk (Puck)

PhD Student College of Information Sciences & Technology Penn State University

Background

3/19/10 2

Current Research

Author Disambiguation in CiteSeerX

3/19/10 3 Automatic Expertise Extraction

Author Disambiguation in CiteSeerX

3/19/10 4

– Right now, disambiguation is done in batch. – Adding new documents every week, thus disambiguation should also be done iteratively.

– Will never be 100% perfect. – Allow user correction (merge/split) – Provide suggestions.

What’s Next Thanks…