scientific corpus
play

scientific corpus HUI WEI Data collection Some digital libraries - PowerPoint PPT Presentation

Data mining, management and visualization in large scientific corpus HUI WEI Data collection Some digital libraries did not supply APIs We use raw PDF docs as input Data collection 1. to extract basic information of a paper such as authors,


  1. Data mining, management and visualization in large scientific corpus HUI WEI

  2. Data collection Some digital libraries did not supply APIs We use raw PDF docs as input

  3. Data collection 1. to extract basic information of a paper such as authors, title, abstract sentences, doi 2. to extract references 3. to extract standard keywords and their frequency from each paper.

  4. Text mining 1. Use Jape rules to define “Macros” to find important markers, such as”DOI”, “year”, “abstract” tags. 2. Use Annie NE Transducer and Gazetteer look up person names like “author”. 1. Use Gate ontology Gazetteer and Jape rules look up Computer Graphic terms in the content.

  5. Text mining

  6. Keywords onto

  7. Data repositories Graph repository

  8. Data repositories Data is managed in 4 NoSql repositories

  9. Data repositories Data distribution and system workflow

  10. Data visualization

  11. Topic river visualization

  12. Thanks hui.wei@beds.ac.uk

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend