scientific corpus HUI WEI Data collection Some digital libraries - - PowerPoint PPT Presentation

scientific corpus
SMART_READER_LITE
LIVE PREVIEW

scientific corpus HUI WEI Data collection Some digital libraries - - PowerPoint PPT Presentation

Data mining, management and visualization in large scientific corpus HUI WEI Data collection Some digital libraries did not supply APIs We use raw PDF docs as input Data collection 1. to extract basic information of a paper such as authors,


slide-1
SLIDE 1

Data mining, management and visualization in large scientific corpus

HUI WEI

slide-2
SLIDE 2

Data collection

Some digital libraries did not supply APIs We use raw PDF docs as input

slide-3
SLIDE 3

Data collection

  • 1. to extract basic information of a paper such as authors, title, abstract

sentences, doi

  • 2. to extract references
  • 3. to extract standard keywords and their frequency from each paper.
slide-4
SLIDE 4

Text mining

  • 1. Use Jape rules to define “Macros” to find important markers,

such as”DOI”, “year”, “abstract” tags.

  • 2. Use Annie NE Transducer and Gazetteer look up

person names like “author”.

  • 1. Use Gate ontology Gazetteer and Jape rules look up

Computer Graphic terms in the content.

slide-5
SLIDE 5

Text mining

slide-6
SLIDE 6

Keywords onto

slide-7
SLIDE 7

Data repositories

Graph repository

slide-8
SLIDE 8

Data repositories

Data is managed in 4 NoSql repositories

slide-9
SLIDE 9

Data repositories

Data distribution and system workflow

slide-10
SLIDE 10

Data visualization

slide-11
SLIDE 11

Topic river visualization

slide-12
SLIDE 12

Thanks

hui.wei@beds.ac.uk