Discover the world at Leiden University Discover the world at Leiden University
Text Mining Workshop
LUCDH Studium Digitale
- A. Brandsen
25-09-2020
1
Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - - PowerPoint PPT Presentation
Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP
Discover the world at Leiden University Discover the world at Leiden University
LUCDH Studium Digitale
25-09-2020
1
Discover the world at Leiden University
2
I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP @alex_brandsen
2
Discover the world at Leiden University
Discover the world at Leiden University
Deriving information from semi- or unstructured text collections
4
Discover the world at Leiden University
5
Discover the world at Leiden University
processing steps:
learning task
6
Discover the world at Leiden University
7
Discover the world at Leiden University
Can you think of any challenges / problems specific to text data?
8
Discover the world at Leiden University
Or at best semi-structured:
9
Discover the world at Leiden University
gespot: 😮
burgerservicenummer.”
10
Discover the world at Leiden University
11
Discover the world at Leiden University
12
Discover the world at Leiden University
13
Documents
GATHER CLEAN & NORMALISE
Raw text
ANNOTATION
Tokens
TOKENISE
Clean text
FEATURE CREATION & SELECTION
Linguistically annotated text Features
Discover the world at Leiden University
Cutting a collection of characters (sentence) into tokens (words) Mr._O'Neill_thinks_that_Germany's_capital_isn’t_busy. Mr . O'Neill thinks that Germany 's capital is not busy .
14
Discover the world at Leiden University
Sometimes useful, depends on application Mr . O'Neill thinks that Germany 's capital is not busy . mr . o'neill thinks that germany 's capital is not busy .
15
Discover the world at Leiden University
Common words aren’t “useful” in analysis: mr . o'neill thinks that germany 's capital is not busy . mr . o'neill thinks germany capital busy .
16
Discover the world at Leiden University
Stemming Cut off end
17
Lemmatization Use morphological analysis
Discover the world at Leiden University
18
Discover the world at Leiden University
Supervised:
Unsupervised:
19
Discover the world at Leiden University
20
Discover the world at Leiden University
Texts -> labels Emails -> spam or not News item -> news category New article -> relevant to you or not (researchgate / academia) Review -> positive or negative
21
Discover the world at Leiden University
22
Doc id Content Class 1 request urgent interest urgent Spam 2 assistance low interest deposit Spam 3 symposium defense june Ham 4 notas symposium deadline june Ham 5 registration assistance symposium deadline ? Examples for machine learning New unlabeled email, predict class Machine Learning algorithm calculates probability of an email being spam based on how ‘spammy’ the words in the email are (spammyness calculated by looking at distribution of words over the spam and ham categories)
Neutral Spam Ham Ham Ham!
Discover the world at Leiden University
23
Discover the world at Leiden University
24
Let’s look at some statements! Unicorns are awesome Finding unicorns is difficult
Discover the world at Leiden University
25
Unicorns are awesome Finding unicorns is difficult Tokenize!
Discover the world at Leiden University
26
unicorns are awesome finding unicorns is difficult Lower case!
Discover the world at Leiden University
27
unicorns are awesome finding unicorns is difficult Remove stopwords!
Discover the world at Leiden University
28
unicorns awesome finding unicorns difficult Add structure! unicorns finding awesome difficult 1 1 1 1 1 sentiment Bag of words approach
Discover the world at Leiden University
processing steps:
learning task
29
Discover the world at Leiden University
30
Discover the world at Leiden University
1.
Get some movie reviews from IMDB
2.
Clean the data
3.
Look at the most frequent words
4.
Sentiment analysis → :) or :(
31
Discover the world at Leiden University
32
Discover the world at Leiden University
33
Discover the world at Leiden University
34
Discover the world at Leiden University
Go to alexbrandsen.nl/tmtutorial/tutorial.pdf (or download from Kaltura) If you already have programming experience, you can also do these other tutorials: https://github.com/mchesterkadwell/bughunt-analysis https://github.com/alexbrandsen/Text-Analysis-for-Humanities-research
(In both tutorials you can click the Binder button to start a Python notebook in your browser, no installation required.)
35