pucktada treeratpituk puck
play

Pucktada Treeratpituk (Puck) PhD Student College of Information - PDF document

3/19/10 Pucktada Treeratpituk (Puck) PhD Student College of Information Sciences & Technology Penn State University Background Education 3 rd yr PhD. Information Sciences & Technology, Penn State CiteSeerX digital library


  1. 3/19/10 Pucktada Treeratpituk (Puck) PhD Student College of Information Sciences & Technology Penn State University Background • Education – 3 rd yr PhD. Information Sciences & Technology, Penn State • CiteSeerX digital library (http://citeseerx.ist.psu.edu) • Advisor: Dr. C. Lee Giles – Intelligent Information Systems Research Lab – MS. Language Technology, Carnegie Mellon University – MS. Computer Science, Stanford University – BS. Computer Science, Mathematics and Economics, Carnegie Mellon University • General Research Interest – Data Mining, Information Retrieval, Information Extraction, Natural Language Processing and Digital Library 1

  2. 3/19/10 Current Research • Author Disambiguation in Digital Libraries – using machine learning techniques to resolve name ambiguity – http://citeseerx.ist.psu.edu • Key Phrase Extraction from Scholarly Work – mining & identifying important topics in each academic paper • Automatic Expertise Extraction – identifying expertise & research interests for each authors based on the content of their publication records – http://singularity.ist.psu.edu/expert • Expert Finding in Digital Library – computing query-dependent expert ranking Author Disambiguation in CiteSeerX 2

  3. 3/19/10 Automatic Expertise Extraction http://singularity.ist.psu.edu/expert Author Disambiguation in CiteSeerX • Since CiteSeerX crawls for scientific papers (pdf/ps) from the web, we have to rely on metadata we extract from the papers for the disambiguation. • Approaches – Learn to estimate the likelihood that two author names from two different papers refer to the same person, based on metadata such as affiliation, paper title, coauthors, etc., then do the clustering. – Previous Work • SVM + DBSCAN (Huang et al, PKDD’06) • Topic Model (Song et al, JCDL’07) • Random Forest (Treeratpituk et al, JCDL’09) (for MEDLINE) • Also do other types of record matching such as citations 3

  4. 3/19/10 What’s Next • Iterative approach – Right now, disambiguation is done in batch. – Adding new documents every week, thus disambiguation should also be done iteratively. • Interactive Mode – Will never be 100% perfect. – Allow user correction (merge/split) – Provide suggestions. Thanks… 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend