[PPT] - What? Investigating what a corpus is about Max Kemman University of PowerPoint Presentation

SLIDE 1

What? Investigating what a corpus is about

Max Kemman

University of Luxembourg October 25, 2015

Doing Digital History: Introduction to Tools and Technology

SLIDE 2

Recap from last time

What is distant reading? What is an n-gram? What do the Y-axis and X-axis show?

SLIDE 3

Recap - Assignment

How did the assignment go? What did you think of the tools used? Could this be useful for your research?

SLIDE 4

One more thing on HTML: special characters

http://www.ascii.cl/htmlcodes.htm Find the symbol and the HTML number é & ü -> & é & ü -> é & ü In your HTML, write longue durée to write longue durée

SLIDE 5

One more thing: what is an algorithm?

A set of rules to follow to solve a problem Pretty much like a cooking recipe

a = 0 while(a < 10) { a = a + 1 }

SLIDE 6

Today

The W's of research

What a corpus is about
The entities in a corpus
Another look at our emails
Voyant Tools
Next time
Assignment

SLIDE 7

The W's of research

Thus far: Now: we have a digital corpus, what to do with it?

1. Abundance of sources
2. Writing for the Web
3. Digitisation and Digital Libraries
4. Big Data
5. Distant Reading

SLIDE 8

Research the corpus

Now come the W's of research:

1. What - Investigating what a corpus is about
2. Where - Investigating the spatial entities in a corpus
3. When - Investigating the temporal entities in a corpus
4. Who - Investigating the social entities in a corpus

SLIDE 9

What?

The first W of interest, what is this corpus actually about? Different methods are possible

Find a description of the corpus to read

Select a sample of documents to read
Visualize the used words

SLIDE 10

What a corpus is about

SLIDE 11

What is this conference about?

SLIDE 12

Word clouds

Advantages of word clouds

Very easy to create

Visually pleasing
Gives a quick overview

SLIDE 13

What does a word cloud do?

Put very simply, a word cloud does the following:

1. Count the number of occurrences per word
2. Size each word by its frequency
3. Layout the words to form a shape
4. Optional: colorize words for distinguishing and better readability

SLIDE 14

Layout

Unlike the Ngram viewer: no X or Y axes The position of each word is meaningless The meaning is in the size of the words

SLIDE 15

Counting

Word clouds visualize the frequency of words But how to count words that vary in spelling?

E.g. "Digital" and "digital" and "digitally", "digitize" and "digitization"

Normalization:

Lowercase

Tokenize
Stemming or lemmatizing
Stopwords

SLIDE 16

Lowercase We were on vacation in France in August 2015 we were on vacation in france in august 2015

SLIDE 17

Tokenize we were on vacation, in france, in august 2015 we|were|on|vacation|in|france|in|august|2015

SLIDE 18

Stemming or lemmatizing digitized|digital|digitization|digitizing Stemming: digit Lemmatizing: digitiz|digital Could be very useful especially with Latin texts

SLIDE 19

Stopwords Most common words in the language: and, or, the Sometimes: remove numbers Not of interest (usually) we|were|on|vacation|in|france|in|august|2015 we|were|vacation|france|august|

SLIDE 20

What are these grants about? (normalized)

SLIDE 21

Comparing between different parts of the corpus

Sources separated by their citation behaviour

SLIDE 22

Representing a model of the text

What if we do not know how to separate sources? Or if we want to know what other words are related to our keywords?

SLIDE 23

Topic modelling Documents and words can be directly observed, but topics are latent How to represent the topics in a corpus?

(Slides on topic modelling from Pim Huijnen and Marijn Koolen)

Statistics to find topics represented by groups of words

Document is a mix of topics
Topic is a mix of words

SLIDE 24

Topic modelling Assumption: two documents with the same topics will have overlap in words For a given corpus, modelling process does:

1. Create word probability distribution for topics
2. Create topic probability distribution for documents

SLIDE 25

Topic modelling In short: a corpus is represented by statistical topics This allows us to:

Separate sources by topics

Find related keywords

SLIDE 26

Comparing different parts of the corpus Mendeley Research Maps Comparing the topical similarity Assigned documents to disciplines to map disciplines by topics Which form of machine learning would this be?

SLIDE 27

What is the corpus about?

We can now represent the words or the topics of a corpus But, remember: World War I ≠ "World War I"

SLIDE 28

The entities in a corpus

Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's?

SLIDE 29

The entities in a corpus

Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's?

Where - places

When - dates
Who - people

SLIDE 30

People in the corpus

Ter Braake & Fokkens - Fairly easy to discover famous people (with biographical dictionaries and Ngram viewers) Ngrams help top-down: when you know who to search for But how to discover who did not become famous, while prominent in their

wn time?

Need to find all people bottom-up by identifying all the names

SLIDE 31

Bottom-up proces

Ter Braake & Fokkens

1. Identify all names in the corpus
2. Give all names an identifier
3. Disambiguate names referring to the same person
4. Compare results with a non-digital corpus
5. Visualize the results
6. Interpret!

SLIDE 32

Identifying names Combinations of words that start with a capital This won't work for German Their algorithm allows for two sequential lower case words: Johan van der Capellen Note: built for recall, not precision

SLIDE 33

Recall & Precision Recall: retrieve all relevant entities Precision: do no retrieve irrelevant entities For algorithms usually a choice what to optimize

Recall of people referred to with single name (Erasmus, Rembrandt) would lead to too much noise = lower precision

SLIDE 34

Difficulties

Spelling of names (especially before 19th century) People with the same name Nicknames and changing names People with the same title Context matters!

SLIDE 35

Named Entity Recognition

We want to identify the entities We were on vacation in France in August 2015. I went to shop at the

Intermarche. The area around Apt is really nice. Max also bought icecream,

which cost €2. We were on vacation in France in August 2015. I went to shop at the

Intermarche. The area around Apt is really nice. Max also bought icecream,

which cost €2.

SLIDE 36

Named Entities

Or we want to see: We were on vacation in France in August 2015. I went to shop at the

Intermarche. The area around Apt is really nice. Max also bought icecream,

which cost €2.

People: Max

Places: France, Apt
Organizations: Intermarche
Dates: August 2015
Currencies: €2

SLIDE 37

Another look at our emails

For all 30k emails, we performed text normalisation and named entity recognition Let's take a look at https://www.wikileaks.org/clinton-emails/emailid/8 Exercise 1: try to normalise the text Exercise 2: try to discover the named entities: People, Places, Organisations

SLIDE 38

Normalised

See Email8-normalised.txt in Moodle under "Emails" unclassife, us, department, state, case, f--, doc, date, release, full, hrod, clintonemailcom, sent, friday, july, pm, sullivanjj, stategov, subject, re, pakistan, bomb, ok, go, original, message, sullivan, jacob, sullivanjj, stategov, sent, fri, jul, subject, pakistan, bomb, fyi, put, follow, statement, statement, secretary, clinton, bomb, shrine, sy, ali, hujviri, lahore, shock, sadden, yesterday, attack, one, pakistan, popular, place, worship, shrine, sy, ali, hujviri, data, ganjbakhsh, lahore, claime, live, many, innocent, pakistane, extremist, shown, respect, neither, human, dignity, fundamental, religious, value, pakistani, society, violact, sanctity, rever, shrine, particularly, sinister, attempt, destabilize, pakistan, intimidate, people, attacker, will, succeed, pakistani, public, refuse, cow, violence, condemn, brutal, crime, reaffirm, commitment, support, pakistani, people, effort, defend, democracy, violent,

SLIDE 39

commitment, support, pakistani, people, effort, defend, democracy, violent, extremist, seek, destroy, thought, prayer, family, victim, people, pakistan

Named Entities

Try to do it by hand NER tool: http://nlp.stanford.edu:8080/ner/

People Places Organisations Sullivan Jacob CLINTON Ali Hujviri Pakistan Pakistan Lahore Pakistan Lahore Pakistan Pakistan U.S. Department of State Case No Shrine of Syed Ali Hujviri

SLIDE 40

Visualise the email

Go to http://tagcrowd.com/ Compare with and without stopwords Compare normal and normalised text

SLIDE 41

What?

So, what's the email about? Do we get different perspectives?

SLIDE 42

Voyant Tools

Go to www.voyant-tools.org/ Use Mozilla Firefox, it doesn't work in Chrome (that's what went wrong during lecture) From Moodle: download the files for emails 6000-6019 f6-20-raw.txt and f6-20-normalised.txt You can paste in text, or upload the file Continue by hitting reveal

SLIDE 43

Saving the Voyant session It might be a good idea to copy the URL early on, as this will allow you to refresh the page if the tool crashes, or to open the tool again later on using the data and stopwords you already had Share the session: mousehover the top blue bar, and click the third icon in the topright (see image), you can then choose to share the URL: this will open a new browser window where you can copy the address from

SLIDE 44

Voyant windows Look at all the windows in Voyant and see if you understand them

1. Cirrus (word cloud)
2. Reader
3. Summary
4. Trends
5. Contexts

SLIDE 45

Voyant Word Clouds

In the Cirrus, hold mouse on the title bar, and click 3rd icon

Select the stopword list you need
Or Edit List to add more words: 1 word per line, click Save
Check apply globally to activate in all windows
Use the word cloud to detect common words we're not interest in: unclassified,

department, subject, etc

Hit Confirm
When editing again, the stopwords are ordered alphabetically, so you might not see them

at the end anymore

SLIDE 46

Voyant Summary What is the longest email? What are distinctive words? Distinctive words calculated by TF-IDF: what was that again? Update: the distinctive words feature doesn't work now that we combined all the emails in a single text-file

SLIDE 47

Searching specific words In the Cirrus window, you can click Terms in the top bar to get the list of words ordered by count You can see immediately per word how it develops over time in the emails From this list you can select a word by checking the box left to it Alternatively, you can search for words per window. For example, in the Contexts window (lower-right), at the bottom is a search box where you can search for words

SLIDE 48

Interpreting with Voyant What are the biggest words? How do they develop throughout the emails? Does this tell what the emails are about and how it goes? If not: what is different?

SLIDE 49

Sharing the Voyant You can either

Take screenshots of what you want to show

Share the session: mousehover the top blue bar, and click the third icon in the topright

(see image), you can then choose to share the URL: this will open a new browser window where you can copy the address from

The HTML snippet will give an HTML code that you can embed in your report.
Share specific windows: for example, in the top bar of trends, click the first icon (see

image), and select to export a url, a HTML snippet for embedding, or a PNG for including in your report

SLIDE 50

Next time

1 November: No class 8 November

When? Temporal entities and timelines

SLIDE 51

Assignment

Perform Voyant analysis of HC emails Compare (see next slide for all the available files): Do comparisons in separate Voyant windows

f6-100-raw.txt vs f6-100-normalised.txt to see how text normalisation gives different perspective

For further comparisons, choose either the raw or the normalised text:
f6-1000-*.txt vs f7-1000-*.txt to see how the emails are different
If Voyant or your computer has difficulty with 1000 emails, compare f6-100-*.txt vs f7-

100-*.txt

SLIDE 52

Download files from Moodle:

Emails Raw Normalised 6000-6099 f6-100-raw.txt f6-100-normalised.txt 7000-7099 f7-100-raw.txt f7-100-normalised.txt 6000-6999 f6-1000-raw.txt f6-1000-normalised.txt 7000-7999 f7-1000-raw.txt f7-1000-normalised.txt

SLIDE 53

Assignment Work in pairs of two or three Use the tools discussed today to try and find something you find

interesting. Document your steps and choices and discuss why a finding is
f interest, and whether you can be certain of this finding.

Hand in the assignment in HTML, include your name and a decent profile photo 500-1000 words, in English

SLIDE 54

Possible questions you might ask of your corpora

What are these emails about?

Do we need to further clean the data?
How are these corpora different?
Does text normalisation lead to different results?

SLIDE 55

Grading Do note: the finding itself is not the most important part Email to max.kemman@uni.lu before the start of the next lecture

1pt for free

3pts for HTML
3pts for documentation of your process
3pts for critical reflection on your finding

What? Investigating what a corpus is about

Max Kemman

Doing Digital History: Introduction to Tools and Technology

Recap from last time

What is distant reading? What is an n-gram? What do the Y-axis and X-axis show?

Recap - Assignment

How did the assignment go? What did you think of the tools used? Could this be useful for your research?

One more thing on HTML: special characters

http://www.ascii.cl/htmlcodes.htm Find the symbol and the HTML number é & ü -> & &#233; & &#252; -> é & ü In your HTML, write longue dur&#233;e to write longue durée

One more thing: what is an algorithm?

A set of rules to follow to solve a problem Pretty much like a cooking recipe

a = 0 while(a < 10) { a = a + 1 }

Today

The W's of research

The W's of research

Thus far: Now: we have a digital corpus, what to do with it?

Research the corpus

Now come the W's of research:

What?

The first W of interest, what is this corpus actually about? Different methods are possible

Find a description of the corpus to read

What a corpus is about

What is this conference about?

Word clouds

Advantages of word clouds

Very easy to create

What does a word cloud do?

Put very simply, a word cloud does the following:

Layout

Unlike the Ngram viewer: no X or Y axes The position of each word is meaningless The meaning is in the size of the words

Counting

Word clouds visualize the frequency of words But how to count words that vary in spelling?

E.g. "Digital" and "digital" and "digitally", "digitize" and "digitization"

Lowercase

Lowercase We were on vacation in France in August 2015 we were on vacation in france in august 2015

Tokenize we were on vacation, in france, in august 2015 we|were|on|vacation|in|france|in|august|2015

Stemming or lemmatizing digitized|digital|digitization|digitizing Stemming: digit Lemmatizing: digitiz|digital Could be very useful especially with Latin texts

Stopwords Most common words in the language: and, or, the Sometimes: remove numbers Not of interest (usually) we|were|on|vacation|in|france|in|august|2015 we|were|vacation|france|august|

What are these grants about? (normalized)

Comparing between different parts of the corpus

Sources separated by their citation behaviour

Representing a model of the text

What if we do not know how to separate sources? Or if we want to know what other words are related to our keywords?

Topic modelling Documents and words can be directly observed, but topics are latent How to represent the topics in a corpus?

Statistics to find topics represented by groups of words

Topic modelling Assumption: two documents with the same topics will have overlap in words For a given corpus, modelling process does:

Topic modelling In short: a corpus is represented by statistical topics This allows us to:

Separate sources by topics

Comparing different parts of the corpus Mendeley Research Maps Comparing the topical similarity Assigned documents to disciplines to map disciplines by topics Which form of machine learning would this be?

What is the corpus about?

We can now represent the words or the topics of a corpus But, remember: World War I ≠ "World War I"

The entities in a corpus

Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's?

The entities in a corpus

Thus far we know the frequencies of all the words But what are we interested in? What do we need for the other W's?

Where - places

People in the corpus

Ter Braake & Fokkens - Fairly easy to discover famous people (with biographical dictionaries and Ngram viewers) Ngrams help top-down: when you know who to search for But how to discover who did not become famous, while prominent in their

Need to find all people bottom-up by identifying all the names

Bottom-up proces

Ter Braake & Fokkens

Identifying names Combinations of words that start with a capital This won't work for German Their algorithm allows for two sequential lower case words: Johan van der Capellen Note: built for recall, not precision

Recall & Precision Recall: retrieve all relevant entities Precision: do no retrieve irrelevant entities For algorithms usually a choice what to optimize

Recall of people referred to with single name (Erasmus, Rembrandt) would lead to too much noise = lower precision

Difficulties

Spelling of names (especially before 19th century) People with the same name Nicknames and changing names People with the same title Context matters!

Named Entity Recognition

We want to identify the entities We were on vacation in France in August 2015. I went to shop at the

which cost €2. We were on vacation in France in August 2015. I went to shop at the

which cost €2.

Named Entities

Or we want to see: We were on vacation in France in August 2015. I went to shop at the

which cost €2.

People: Max

Another look at our emails

For all 30k emails, we performed text normalisation and named entity recognition Let's take a look at https://www.wikileaks.org/clinton-emails/emailid/8 Exercise 1: try to normalise the text Exercise 2: try to discover the named entities: People, Places, Organisations

Normalised

commitment, support, pakistani, people, effort, defend, democracy, violent, extremist, seek, destroy, thought, prayer, family, victim, people, pakistan

Named Entities

Try to do it by hand NER tool: http://nlp.stanford.edu:8080/ner/

http://www.ascii.cl/htmlcodes.htm Find the symbol and the HTML number é & ü -> & é & ü -> é & ü In your HTML, write longue durée to write longue durée