Master EmLex CiTIUS Design and use of linguistic tools - - PowerPoint PPT Presentation

master emlex
SMART_READER_LITE
LIVE PREVIEW

Master EmLex CiTIUS Design and use of linguistic tools - - PowerPoint PPT Presentation

Introduction Linguistic Analysis Information Extraction NLP Applications Design and use of linguistic tools II Building linguistic resources with NLP tools Pablo Gamallo CiTIUS Universidade de Santiago de Compostela Master EmLex CiTIUS


slide-1
SLIDE 1

Introduction Linguistic Analysis Information Extraction NLP Applications

Design and use of linguistic tools II Building linguistic resources with NLP tools Pablo Gamallo

CiTIUS Universidade de Santiago de Compostela

Master EmLex

CiTIUS Design and use of linguistic tools

slide-2
SLIDE 2

Introduction Linguistic Analysis Information Extraction NLP Applications

Table of Contents

1

Introduction

2

Linguistic Analysis

3

Information Extraction

4

NLP Applications

CiTIUS Design and use of linguistic tools

slide-3
SLIDE 3

Introduction Linguistic Analysis Information Extraction NLP Applications

Table of Contents

1

Introduction

2

Linguistic Analysis

3

Information Extraction

4

NLP Applications

CiTIUS Design and use of linguistic tools

slide-4
SLIDE 4

Introduction Linguistic Analysis Information Extraction NLP Applications

Objectives

To use and apply NLP tools on text corpora:

tokenization and lemmatization PoS tagging syntactic analysis multi-word extraction named entity recognition sentiment analysis authorship attribution

CiTIUS Design and use of linguistic tools

slide-5
SLIDE 5

Introduction Linguistic Analysis Information Extraction NLP Applications

Tools for Natural Language Processing (NLP)

Analysis tokenization lemmatization morpho-syntactic analysis (PoS-taggers) sintactic analysis (dependency parsers) Extraction terms (multi-words) entities semantic relations concepts

  • pinions,

polarity Applications summarization spell/grammar checking authorship attribution language distance (translation)

CiTIUS Design and use of linguistic tools

slide-6
SLIDE 6

Introduction Linguistic Analysis Information Extraction NLP Applications

Tools for Natural Language Processing (NLP)

Analysis tokenization lemmatization morpho-syntactic analysis (PoS-taggers) sintactic analysis (dependency parsers) Extraction terms (multi-words) entities semantic relations concepts

  • pinions,

polarity Applications summarization spell/grammar checking authorship attribution language distance (translation)

CiTIUS Design and use of linguistic tools

slide-7
SLIDE 7

Introduction Linguistic Analysis Information Extraction NLP Applications

Tools for Natural Language Processing (NLP)

Analysis tokenization lemmatization morpho-syntactic analysis (PoS-taggers) sintactic analysis (dependency parsers) Extraction terms (multi-words) entities semantic relations concepts

  • pinions,

polarity Applications summarization spell/grammar checking authorship attribution language distance (translation)

CiTIUS Design and use of linguistic tools

slide-8
SLIDE 8

Introduction Linguistic Analysis Information Extraction NLP Applications

LinguaKit

Web demo: https://linguakit.com Open source code: https://github.com/citiususc/Linguakit

CiTIUS Design and use of linguistic tools

slide-9
SLIDE 9

Introduction Linguistic Analysis Information Extraction NLP Applications

Table of Contents

1

Introduction

2

Linguistic Analysis

3

Information Extraction

4

NLP Applications

CiTIUS Design and use of linguistic tools

slide-10
SLIDE 10

Introduction Linguistic Analysis Information Extraction NLP Applications

Tokenization

cat text.txt | PATH/Linguakit-master/linguakit tok es

CiTIUS Design and use of linguistic tools

slide-11
SLIDE 11

Introduction Linguistic Analysis Information Extraction NLP Applications

Counting and sorting

cat text.txt | PATH/Linguakit-master/linguakit tok es | wc cat text.txt | PATH/Linguakit-master/linguakit tok es -sort cat text.txt | PATH/Linguakit-master/linguakit tok es | sort | uniq -c | sort -nr

CiTIUS Design and use of linguistic tools

slide-12
SLIDE 12

Introduction Linguistic Analysis Information Extraction NLP Applications

PoS tagging and Lemmatization

cat text.txt | PATH/Linguakit-master/linguakit tagger es

CiTIUS Design and use of linguistic tools

slide-13
SLIDE 13

Introduction Linguistic Analysis Information Extraction NLP Applications

Counting PoS tags and lemmas

Count common nouns:

cat text.txt | PATH/Linguakit-master/linguakit tagger es | cut -d ‘‘ ‘‘ -f 3 | grep ‘‘NC" | wc

Count lemma “comer”:

cat text.txt | PATH/Linguakit-master/linguakit tagger es | cut -d ‘‘ ‘‘ -f 2 | grep ‘‘ˆcomer$’’ | wc

CiTIUS Design and use of linguistic tools

slide-14
SLIDE 14

Introduction Linguistic Analysis Information Extraction NLP Applications

Sorting PoS tags and lemmas

Sorting lemmas by frequency:

cat text.txt | PATH/Linguakit-master/linguakit tagger es | cut -d ‘‘ ‘‘ -f 2 | sort | uniq -c | sort -nr

Sorting PoS tags by frequency:

cat text.txt | PATH/Linguakit-master/linguakit tagger es | cut -d ‘‘ ‘‘ -f 3 | cut -c1-2 | sort | uniq -c | sort -nr

CiTIUS Design and use of linguistic tools

slide-15
SLIDE 15

Introduction Linguistic Analysis Information Extraction NLP Applications

Dependency Parsing

cat text.txt | PATH/Linguakit-master/linguakit dep es

CiTIUS Design and use of linguistic tools

slide-16
SLIDE 16

Introduction Linguistic Analysis Information Extraction NLP Applications

Dependency Parsing: Argument identification

Select the direct objects of the verb “comer”

cat text.txt | PATH/Linguakit-master/linguakit dep es |grep Dobj | grep "comer\_VERB" |awk -F ";" ’{print $3}’ |awk -F "\_" ’{print $1}’

CiTIUS Design and use of linguistic tools

slide-17
SLIDE 17

Introduction Linguistic Analysis Information Extraction NLP Applications

Named Entity Recognition-Classification (NER-NEC)

cat text.txt | PATH/Linguakit-master/linguakit tagger es -ner cat text.txt | PATH/Linguakit-master/linguakit tagger es -nec

CiTIUS Design and use of linguistic tools

slide-18
SLIDE 18

Introduction Linguistic Analysis Information Extraction NLP Applications

NERC: Selecting Locations and Organizations

Select locations:

cat text.txt | PATH/Linguakit-master/linguakit tagger es -nec | grep NP00G | cut -d " " -f 1 | sort | uniq -c | sort -nr

Select organizations:

cat text.txt | PATH/Linguakit-master/linguakit tagger es -nec | grep NP00O | cut -d " " -f 1 | sort | uniq -c | sort -nr

CiTIUS Design and use of linguistic tools

slide-19
SLIDE 19

Introduction Linguistic Analysis Information Extraction NLP Applications

Table of Contents

1

Introduction

2

Linguistic Analysis

3

Information Extraction

4

NLP Applications

CiTIUS Design and use of linguistic tools

slide-20
SLIDE 20

Introduction Linguistic Analysis Information Extraction NLP Applications

Multi-Word Extraction

cat text.txt | PATH/Linguakit-master/linguakit mwe es

CiTIUS Design and use of linguistic tools

slide-21
SLIDE 21

Introduction Linguistic Analysis Information Extraction NLP Applications

Multi-Word Extraction: Class Practice

Look for texts on a specific field (e.g. medicine, archeology,...) and use the multi-word extractor to build a terminology.

You can use a PDF to TXT conversor: cat text.pdf | pdftotext > text.txt

CiTIUS Design and use of linguistic tools

slide-22
SLIDE 22

Introduction Linguistic Analysis Information Extraction NLP Applications

Opinion Mining / Sentiment Analysis

cat text.txt | PATH/Linguakit-master/linguakit sent es

CiTIUS Design and use of linguistic tools

slide-23
SLIDE 23

Introduction Linguistic Analysis Information Extraction NLP Applications

Opinion Mining: Class Practice

Open the polarity lexicon and introduce new terms

You can edit the Spanish lexicon as follows: gedit PATH/Linguakit-master/sentiment/es/lex_es

CiTIUS Design and use of linguistic tools

slide-24
SLIDE 24

Introduction Linguistic Analysis Information Extraction NLP Applications

Semantic Relation Extraction

cat text.txt | PATH/Linguakit-master/linguakit rel es Open Information Extraction approach, described in: Gamallo, P . and Marcos Garcia (2015). Multilingual Open Information Extraction, Lecture Notes in Computer Science, 9273, Berlin: Springer-Verlag: 711-722. ISNN: 0302-9743.

CiTIUS Design and use of linguistic tools

slide-25
SLIDE 25

Introduction Linguistic Analysis Information Extraction NLP Applications

Table of Contents

1

Introduction

2

Linguistic Analysis

3

Information Extraction

4

NLP Applications

CiTIUS Design and use of linguistic tools

slide-26
SLIDE 26

Introduction Linguistic Analysis Information Extraction NLP Applications

Summarization

cat text.txt | PATH/Linguakit-master/linguakit sum es -p 5

CiTIUS Design and use of linguistic tools

slide-27
SLIDE 27

Introduction Linguistic Analysis Information Extraction NLP Applications

Grammar Checking: Aval´ ıngua

echo Vou a aportar a documentasci´

  • n |

PATH/Linguakit-master/linguakit aval gl -xml Online demos for Spanish: http://fegalaz.usc.es/nlpapi http://fegalaz.usc.es/avalingua

CiTIUS Design and use of linguistic tools

slide-28
SLIDE 28

Introduction Linguistic Analysis Information Extraction NLP Applications

Authorship Attribution

Source code in:

https://github.com/gamallo/Autoria

Requirements:

cpan Math::KullbackLeibler::Discrete

CiTIUS Design and use of linguistic tools

slide-29
SLIDE 29

Introduction Linguistic Analysis Information Extraction NLP Applications

Authorship Attribution: Class Practice

1

Select one book to be identified, for instance, “Fortunata y Jacinta”, de Gald´

  • s.

2

Select three other works by Gald´

  • s.

3

Select three works by other two authors, for instance, Borges and Unamuno.

4

Create four files in folder ./corpus/all: FortunataYJacinta.txt (to be compared against the rest of files) Galdos.txt (merging the other 3 works by Gald´

  • s)

Borges.txt (merging the selected 3 works by Borges) Unamuno.txt (merging the selected 3 works by Unamuno)

5

Run the script: sh run.sh FortunataYJacita.txt

CiTIUS Design and use of linguistic tools