presentation of opennlp
play

Presentation of OpenNLP Presenter : Dr Ir Robert Viseur What is - PowerPoint PPT Presentation

[ RMLL 2013, Bruxelles Thursday 11 th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur What is OpenNLP ? Toolkit for the processing of natural language text. Project of the Apache Foundation. Developped in


  1. [ RMLL 2013, Bruxelles – Thursday 11 th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur

  2. What is OpenNLP ? • Toolkit for the processing of natural language text. • Project of the Apache Foundation. • Developped in Java. • Under Apache License, Version 2. • Download and documentation: http://opennlp.apache.org/ . 2

  3. What are the features ? • For common NLP tasks : • tokenization, • sentence segmentation, • part-of-speech tagging, • named entity extraction, • chuncking. 3

  4. What is the part-of-speech tagging ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html . 4

  5. What is the named entity extraction ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html . 5

  6. How does it work ? (1/2) • The features are associated to pre-trained models. • Each pre-trained model is created for one language and for one type of use. • Supported languages: da, de, en, es, nl, pt, se. • Warnings : – The functional coverage varies with languages. – The french language is not supported ! • See http://opennlp.sourceforge.net/models- 1.5/ . • Use in command line or as a Java library. • Warning : loading time of models with CLI. 6

  7. How does it work ? (2/2) • Example (English vs Spanish languages) : 7

  8. What are the criteria of choice ? • Support of the product. • License. • Available languages. • Precision / Recall. • Speed of text processing. 8

  9. Are there free (as freedom) alternative tools ? • Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST), • Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit. • Frameworks : • In Java : UIMA (Java), GATE (Java). • In other languages : NLTK (Python). 9

  10. Example: tag cloud creation (1/6) • Starting point: website. • Example: www.adacore.com . • What we want (from website content): • common tag cloud, • circular tag cloud. • Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud). 10

  11. Example: tag cloud creation (2/6) • Cleaning: • Remove the HTML tags and keep only the useful content. • Warnings: • NLP tools are sensitive to noise in raw data. • Pay attention to the language of the document. • Use of HTML boilerplate tool (HTML -> TXT). • Tool: Boilerpipe. • See http://code.google.com/p/boilerpipe/ . • Next: normalization of the text. 11

  12. Example: tag cloud creation (3/6) • Named entities extraction. • Standard in OpenNLP : OpenNLP adds tags in text. • Here : extraction of Person NE. • Terminology extraction. • First : part-of-speech tagging (POST). • Next : identification et filtering (threshold) of : • collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people). 12

  13. Example: tag cloud creation (4/6) • Process : Website Crawl Website (local) (Internet) ---- --- -- ----. Raw HTML Conversion --- -- -- -- ---- document to text --- -- ----. Normalization _--- _-- _-- _ ---- --- -- ----. POS _---- _--. --- -- -- -- ---- tagging _--- _-- _-- _-- --- -- ----. Terminology NE extraction extraction _____ _____ _____ _____ Merge _____ _____ Tags Tag cloud (for a website) 13

  14. Example: tag cloud creation (5/6) • Result: common tag cloud. 14

  15. Example: tag cloud creation (6/6) • Result: circular tag cloud. 15

  16. Thanks for your attention. Any questions ? 16

  17. Contact Dr Ir Robert Viseur Email (@CETIC) : robert.viseur@cetic.be Email (@UMONS) : robert.viseur@umons.ac.be Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be This presentation is covered by « CC-BY-ND » license. 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend