[PPT] - Text Mining and Historical Research Beatrice Alex PowerPoint Presentation

SLIDE 1

Beatrice Alex balex@inf.ed.ac.uk

MSc Historical Research, University of Edinburgh, March 21st 2014

Text Mining and Historical Research

Friday, 21 March 2014

SLIDE 2

OVERVIEW

What is text mining? Types of text analyses. Trading Consequences: text mining applied.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 3

TEXT MINING

Describes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections automatically (overcoming data paralysis). Goal in DHSS research: By analysing large amounts of textual data, help HSS scholars to discover novel patterns and explore hypotheses.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 4

MINING WHAT TEXT?

Electronic text or things that can be turned into it.

Born electronic text (research papers, literature, tweets, blogs, comments on blogs etc.). Digitised text documents. Meta data (collection and document level). Image subtitles (Flickr image titles and subtitles). Video/audio transcripts (YouTube transcripts, TED talks, MOOC transcripts etc.)

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 5

TYPES OF ANALYSES

Named entity recognition. Grounding, e.g. geo-referencing. Relation extraction. Clustering, e.g. topic modelling. Sentiment analysis.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 6

NAMED ENTITY RECOGNITION

Identification and classification of entity mentions in text, things like: Names of persons, locations, organisations,... Dates, amounts ... Often used for improving access to collections.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 7

CONNECTED HISTORIES

http://www.connectedhistories.org

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 8

OLD BAILEY ONLINE

http://www.oldbaileyonline.org

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 9

GROUNDING

Linking entity mentions in text to a unique identifier, e.g.:

Person names to their Wikipedia pages Location names to lat/longs or Geonames IDs Gene names to gene ontologies

Goal is to disambiguate between mentions with the same surface form (e.g. “Paris”, “Victoria”).

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 10

EDINBURGH GEOPARSER

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 11

RELATION EXTRACTION

Identifying relations between entities in text or in meta data in order to

Triples: person - author_of -> book title, commodity - traded_at -> location person - born_in -> location person - born_at -> date

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 12

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 13

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 14

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 15

SUMMARY SO FAR

Different types of text analyses applied to historical and literary research but new opportunities are endless. Text mining can assist scholars in their research but it is not replacing them!

Traditional scholarship is well suited to close reading. HSS scholars can focus on questions which can be answered by computational methods. Human interpretation is vital.

Visualisation of TM output is important.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 16

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 17

PROJECT TEAM

Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne : data management, social media Aaron Quigley, Uta Hinrichs: information visualisation

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 18

TRADITIONAL HISTORICAL RESEARCH

Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Global Fats Supply 1894-98

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 19

PROJECT GOALS

Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 20

DOCUMENT COLLECTIONS

Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611 Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841)

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 21

DOCUMENT COLLECTIONS

Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611 Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841)

Over 10 million document pages, Over 7 billion word tokens.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 22

MINED INFORMATION

Example sentence: Normalised and grounded entities:

commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 23

MINED INFORMATION

Example sentence: Extracted entity attributes and relations:

rigin location: Padang

destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 24

COMMODITY LEXICON

Seed set from customs import records.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 25

LEXICON CREATION

Seed lexicon ~600 Extended lexicon ~17,000 With pluralisation of single word entries ~20,500

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 26 ...

LEXICON PRECISION

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 27 ...

LEXICON PRECISION

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 28

Extract of Early Canadiana Online document 9_00952_3, p. vi.

OCR ERRORS

Extract of Early Canadiana Online document 9_00952_3, p. vi.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 29

Extract of Early Canadiana Online document 9_00952_3, p. vi.

OCR ERRORS

Extract of Early Canadiana Online document 9_00952_3, p. vi.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 30

Extract of Early Canadiana Online document 9_00952_3, p. vi.

OCR ERRORS

Extract of Early Canadiana Online document 9_00952_3, p. vi.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 31

Extract of Early Canadiana Online document 9_00952_3, p. vi.

OCR ERRORS

Extract of Early Canadiana Online document 9_00952_3, p. vi.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 32

OCR ERRORS

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 33

BRINGING ARCHIVES ALIVE

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 34

SUMMARY

You have access to enormous amounts of data. Text mining can be applied to process large text collections, enrich existing text with information or pull out trends which can be visualised. Text mining is a way to enable distant reading, even if such technology is not 100% accurate. OCR errors in digitised collections can skew your results.

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 35

USEFUL SITES

Programming historian: http:// programminghistorian.org/ The Historian's Macroscope: Big Digital History: http://www.themacroscope.org/ Palladio: http://palladio.designhumanities.org

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014

SLIDE 36

THANK YOU

Demo: http://tcqdev.edina.ac.uk/search/ commodity/ , http://tcqdev.edina.ac.uk/vis/ tradConVis_new/ Contact: balex@inf.ed.ac.uk

MSc Historical Research, University of Edinburgh, March 21st 2014

Friday, 21 March 2014