Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - - PowerPoint PPT Presentation

text mining workshop
SMART_READER_LITE
LIVE PREVIEW

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - - PowerPoint PPT Presentation

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP


slide-1
SLIDE 1

Discover the world at Leiden University Discover the world at Leiden University

Text Mining Workshop

LUCDH Studium Digitale

  • A. Brandsen

25-09-2020

1

slide-2
SLIDE 2

Discover the world at Leiden University

2

Hello!

I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP @alex_brandsen

2

slide-3
SLIDE 3

Discover the world at Leiden University

slide-4
SLIDE 4

Discover the world at Leiden University

Deriving information from semi- or unstructured text collections

What is Text Mining?

4

  • Search engines
  • Spam filters
  • Translation
  • Turnitin
  • Customer service
slide-5
SLIDE 5

Discover the world at Leiden University

  • Challenges of text data
  • Pre-processing of text
  • Tasks involving text data

Today’s Topics

5

slide-6
SLIDE 6

Discover the world at Leiden University

  • You can list the challenges of processing text data
  • You can motivate and describe the most common text pre-

processing steps:

  • tokenisation
  • lowercasing
  • stopword removal
  • You can conceptually explain text classification as a supervised

learning task

After this lecture...

6

slide-7
SLIDE 7

Discover the world at Leiden University

7

Introduction: Challenges

slide-8
SLIDE 8

Discover the world at Leiden University

Can you think of any challenges / problems specific to text data?

Challenges of Processing Text Data

8

slide-9
SLIDE 9

Discover the world at Leiden University

Or at best semi-structured:

Textual Data is Unstructured

9

slide-10
SLIDE 10

Discover the world at Leiden University

  • Dat is best wel een big deal
  • Oh my god, die jurk is fantastisch
  • Me, when lezers meteen aan de slag gaan met iets dat ze op mijn blog hebben

gespot: 😮

  • DUO: “Hoi Valentina, we would advise you to delete your tweet with your

burgerservicenummer.”

  • Waarschijnlijk is dat seksistische beeld gwn zo ingrained into our society

Textual Data can be Multi-Lingual

10

slide-11
SLIDE 11

Discover the world at Leiden University

  • Optical Character Recognition (OCR): printed text to digital text
  • Special characters
  • Spelling errors

Textual Data is Noisy

11

slide-12
SLIDE 12

Discover the world at Leiden University

12

Preprocessing

slide-13
SLIDE 13

Discover the world at Leiden University

PreProcessing: from Raw Text to Features

13

Documents

GATHER CLEAN & NORMALISE

Raw text

ANNOTATION

Tokens

TOKENISE

Clean text

FEATURE CREATION & SELECTION

Linguistically annotated text Features

slide-14
SLIDE 14

Discover the world at Leiden University

Cutting a collection of characters (sentence) into tokens (words) Mr._O'Neill_thinks_that_Germany's_capital_isn’t_busy. Mr . O'Neill thinks that Germany 's capital is not busy .

Tokenisation

14

slide-15
SLIDE 15

Discover the world at Leiden University

Sometimes useful, depends on application Mr . O'Neill thinks that Germany 's capital is not busy . mr . o'neill thinks that germany 's capital is not busy .

Lowercasing

15

slide-16
SLIDE 16

Discover the world at Leiden University

Common words aren’t “useful” in analysis: mr . o'neill thinks that germany 's capital is not busy . mr . o'neill thinks germany capital busy .

Removing Stopwords

16

slide-17
SLIDE 17

Discover the world at Leiden University

Stemming Cut off end

  • Apples → apple
  • Studied → studi

Removing inflections

17

Lemmatization Use morphological analysis

  • Apples → apple
  • studied → study
slide-18
SLIDE 18

Discover the world at Leiden University

  • Text classification
  • input is collection of texts, output is label(s) per text
  • spam categorisation
  • Sequence labelling
  • input is text, output is sequence of labels
  • named entity tagging
  • Sequence-to-sequence learning
  • input is text, output is text
  • machine translation

Tasks Involving Text Data

18

slide-19
SLIDE 19

Discover the world at Leiden University

Supervised:

  • training data is labelled
  • utput is labels
  • “spam or not?”

Unsupervised:

  • training data has no labels
  • utput is clusters
  • “these two documents look alike”
  • “this group of texts has the same topic”

(Un)supervised Machine Learning

19

slide-20
SLIDE 20

Discover the world at Leiden University

20

Classification

slide-21
SLIDE 21

Discover the world at Leiden University

Texts -> labels Emails -> spam or not News item -> news category New article -> relevant to you or not (researchgate / academia) Review -> positive or negative

Classification

21

slide-22
SLIDE 22

Discover the world at Leiden University

Classification with Machine Learning

22

Doc id Content Class 1 request urgent interest urgent Spam 2 assistance low interest deposit Spam 3 symposium defense june Ham 4 notas symposium deadline june Ham 5 registration assistance symposium deadline ? Examples for machine learning New unlabeled email, predict class Machine Learning algorithm calculates probability of an email being spam based on how ‘spammy’ the words in the email are (spammyness calculated by looking at distribution of words over the spam and ham categories)

Neutral Spam Ham Ham Ham!

slide-23
SLIDE 23

Discover the world at Leiden University

Precision and Recall

23

  • Precision: How many of the positive predictions were correct?
  • How many predicted spam emails are actually spam?
  • Recall: How many of the positive documents did you retrieve?
  • How many of the spam emails did you catch?
  • F1: Harmonic mean of precision and recall
  • How well does my model perform in general?
slide-24
SLIDE 24

Discover the world at Leiden University

Sentiment Analysis Example

24

Let’s look at some statements! Unicorns are awesome Finding unicorns is difficult

slide-25
SLIDE 25

Discover the world at Leiden University

Sentiment Analysis Example

25

Unicorns are awesome Finding unicorns is difficult Tokenize!

slide-26
SLIDE 26

Discover the world at Leiden University

Sentiment Analysis Example

26

unicorns are awesome finding unicorns is difficult Lower case!

slide-27
SLIDE 27

Discover the world at Leiden University

Sentiment Analysis Example

27

unicorns are awesome finding unicorns is difficult Remove stopwords!

slide-28
SLIDE 28

Discover the world at Leiden University

Sentiment Analysis Example

28

unicorns awesome finding unicorns difficult Add structure! unicorns finding awesome difficult 1 1 1 1 1 sentiment Bag of words approach

slide-29
SLIDE 29

Discover the world at Leiden University

  • You can list the challenges of processing text data
  • You can motivate and describe the most common text pre-

processing steps:

  • tokenisation
  • lowercasing
  • stopword removal
  • You can conceptually explain text classification as a supervised

learning task

After this lecture...

29

slide-30
SLIDE 30

Discover the world at Leiden University

30

Tutorial

slide-31
SLIDE 31

Discover the world at Leiden University

What will we be doing?

1.

Get some movie reviews from IMDB

2.

Clean the data

3.

Look at the most frequent words

4.

Sentiment analysis → :) or :(

31

slide-32
SLIDE 32

Discover the world at Leiden University

Most frequent words

32

slide-33
SLIDE 33

Discover the world at Leiden University

Model performance

33

slide-34
SLIDE 34

Discover the world at Leiden University

Most important words for classifier

34

☹ ☺

slide-35
SLIDE 35

Discover the world at Leiden University

Get started!

Go to alexbrandsen.nl/tmtutorial/tutorial.pdf (or download from Kaltura) If you already have programming experience, you can also do these other tutorials: https://github.com/mchesterkadwell/bughunt-analysis https://github.com/alexbrandsen/Text-Analysis-for-Humanities-research

(In both tutorials you can click the Binder button to start a Python notebook in your browser, no installation required.)

35