Presentation of OpenNLP Presenter : Dr Ir Robert Viseur What is - - PowerPoint PPT Presentation

presentation of opennlp
SMART_READER_LITE
LIVE PREVIEW

Presentation of OpenNLP Presenter : Dr Ir Robert Viseur What is - - PowerPoint PPT Presentation

[ RMLL 2013, Bruxelles Thursday 11 th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur What is OpenNLP ? Toolkit for the processing of natural language text. Project of the Apache Foundation. Developped in


slide-1
SLIDE 1

[ RMLL 2013, Bruxelles – Thursday 11th July 2013 ]

Presentation of OpenNLP

Presenter : Dr Ir Robert Viseur

slide-2
SLIDE 2

2

What is OpenNLP ?

  • Toolkit for the processing of natural language text.
  • Project of the Apache Foundation.
  • Developped in Java.
  • Under Apache License, Version 2.
  • Download and documentation:

http://opennlp.apache.org/.

slide-3
SLIDE 3

3

What are the features ?

  • For common NLP tasks :
  • tokenization,
  • sentence segmentation,
  • part-of-speech tagging,
  • named entity extraction,
  • chuncking.
slide-4
SLIDE 4

4

What is the part-of-speech tagging ?

  • Example :
  • See more:

http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.

slide-5
SLIDE 5

5

What is the named entity extraction ?

  • Example :
  • See more:

http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.

slide-6
SLIDE 6

6

How does it work ? (1/2)

  • The features are associated to pre-trained models.
  • Each pre-trained model is created for one language

and for one type of use.

  • Supported languages: da, de, en, es, nl, pt, se.
  • Warnings :

– The functional coverage varies with languages. – The french language is not supported !

  • See http://opennlp.sourceforge.net/models-

1.5/.

  • Use in command line or as a Java library.
  • Warning : loading time of models with CLI.
slide-7
SLIDE 7

7

How does it work ? (2/2)

  • Example (English vs Spanish languages) :
slide-8
SLIDE 8

8

What are the criteria of choice ?

  • Support of the product.
  • License.
  • Available languages.
  • Precision / Recall.
  • Speed of text processing.
slide-9
SLIDE 9

9

Are there free (as freedom) alternative tools ?

  • Other light tools :
  • Stanford Log-linear Part-Of-Speech Tagger (POST),
  • Stanford Named Entity Recognizer (NER),
  • TagEN,
  • Java Automatic Term Extraction toolkit.
  • Frameworks :
  • In Java : UIMA (Java), GATE (Java).
  • In other languages : NLTK (Python).
slide-10
SLIDE 10

10

Example: tag cloud creation (1/6)

  • Starting point: website.
  • Example: www.adacore.com.
  • What we want (from website content):
  • common tag cloud,
  • circular tag cloud.
  • Main steps : crawl, cleaning of HTML documents,

named entities (person) and terminology extractions (+ merge) and display (tag cloud).

slide-11
SLIDE 11

11

Example: tag cloud creation (2/6)

  • Cleaning:
  • Remove the HTML tags and keep only the useful

content.

  • Warnings:
  • NLP tools are sensitive to noise in raw data.
  • Pay attention to the language of the document.
  • Use of HTML boilerplate tool (HTML -> TXT).
  • Tool: Boilerpipe.
  • See http://code.google.com/p/boilerpipe/.
  • Next: normalization of the text.
slide-12
SLIDE 12

12

Example: tag cloud creation (3/6)

  • Named entities extraction.
  • Standard in OpenNLP : OpenNLP adds tags in text.
  • Here : extraction of Person NE.
  • Terminology extraction.
  • First : part-of-speech tagging (POST).
  • Next : identification et filtering (threshold) of :
  • collocations (i.e: Name_Name, Adjective_Name,...),
  • proper names (often: brands or people).
slide-13
SLIDE 13

13

Example: tag cloud creation (4/6)

  • Process :

Raw HTML document

  • --- --- -- ----.
  • -- -- -- -- ----
  • -- -- ----.
  • --- --- -- ----.
  • -- -- -- -- ----
  • -- -- ----.

_--- _-- _-- _ _---- _--. _--- _-- _-- _-- _____ _____ _____

Conversion to text Normalization POS tagging _____

_____ _____

Terminology extraction NE extraction Tag cloud (for a website) Website (Internet) Website (local) Crawl Tags Merge

slide-14
SLIDE 14

14

Example: tag cloud creation (5/6)

  • Result: common tag cloud.
slide-15
SLIDE 15

15

Example: tag cloud creation (6/6)

  • Result: circular tag cloud.
slide-16
SLIDE 16

16

Thanks for your attention. Any questions ?

slide-17
SLIDE 17

17

Contact

Dr Ir Robert Viseur Email (@CETIC) : robert.viseur@cetic.be Email (@UMONS) : robert.viseur@umons.ac.be Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be

This presentation is covered by « CC-BY-ND » license.