Data Mining in the Humanities Text Data Beatrice Alex - - PowerPoint PPT Presentation

data mining in the humanities
SMART_READER_LITE
LIVE PREVIEW

Data Mining in the Humanities Text Data Beatrice Alex - - PowerPoint PPT Presentation

Data Mining in the Humanities Text Data Beatrice Alex balex@inf.ed.ac.uk DigitalHSS Seminar, University of Edinburgh November 19th 2013 Monday, 25 November 2013 MINING WHAT DATA? Electronic text or things that can be turned into it. Born


slide-1
SLIDE 1

Beatrice Alex balex@inf.ed.ac.uk

DigitalHSS Seminar, University of Edinburgh November 19th 2013

Data Mining in the Humanities Text ⊂ Data

Monday, 25 November 2013

slide-2
SLIDE 2

MINING WHAT DATA?

Electronic text or things that can be turned into it. Born electronic text (research papers, literature, tweets, blogs, comments on blogs etc.). Meta data (collection and document level). Image subtitles (Flickr image titles and subtitles). Video/audio transcripts (YouTube transcripts, TED talks, MOOC transcripts etc.)

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-3
SLIDE 3

TEXT MINING

Describes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections automatically (overcoming data paralysis). Goal in DHSS research: By analysing large amounts of digitised data, help HSS scholars to discover novel patterns and explore hypotheses.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-4
SLIDE 4

TYPES OF ANALYSES

Word or n-gram frequencies, concordances or collocations analysis. Named entity recognition. Grounding, e.g. geo-referencing. Relation extraction. Clustering, e.g. topic modelling.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-5
SLIDE 5

NAMED ENTITY RECOGNITION

Identification and classification of entity mentions in the text where entity refers to things like: Persons Locations Dates ... Often used for improving access to collections.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-6
SLIDE 6

CONNECTED HISTORIES

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

http://www.connectedhistories.org

Monday, 25 November 2013

slide-7
SLIDE 7

OLD BAILEY ONLINE

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

http://www.oldbaileyonline.org

Monday, 25 November 2013

slide-8
SLIDE 8

GROUNDING

Linking entity mentions in text to a unique identifier, e.g.:

Wikipedia pages Lat/longs or Geonames IDs Gene ontologies

Goal is to disambiguate between mentions with the same surface form.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-9
SLIDE 9

GEO-REFERENCING

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

http://placenames.org.uk

Monday, 25 November 2013

slide-10
SLIDE 10

GEO-REFERENCING

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

http://placenames.org.uk

Monday, 25 November 2013

slide-11
SLIDE 11

Ian Gregory: Mapping the Lakes followed by Spacial Humanities Recognises the importance of geo- referencing in DH research.

GEO-REFERENCING

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-12
SLIDE 12

RELATION EXTRACTION

Identifying relations between entities in text or in meta data

Triples: person - author_of - book title, commodity - traded_at - location, person - born_in location

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-13
SLIDE 13

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-14
SLIDE 14

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-15
SLIDE 15

RELATION EXTRACTION

ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-16
SLIDE 16

CLUSTERING

Documents or words are clustered into groups based on word probabilities and other features. Single membership clustering versus multi- membership clustering. Hierarchical clustering. Topic modelling.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-17
SLIDE 17

HIERARCHICAL CLUSTERING

Allison et al., 2011, Literary Lab Pamphlet 1.

Shakespeare Project and Visualising English Print

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-18
SLIDE 18

TOPIC MODELLING

Analysis of Martha Ballard’s diary (27 years of daily entries) by Cameron Blevins

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-19
SLIDE 19

TOPIC MODELLING

Analysis of Martha Ballard’s diary (27 years of daily entries) by Cameron Blevins

Cameron Blevins’ blog

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-20
SLIDE 20

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Ted Underwood’s blog

Monday, 25 November 2013

slide-21
SLIDE 21

SUMMARY SO FAR

Different types of text analyses applied to historical and literary research but new opportunities are endless. Text mining can assist scholars in their research but it is not replacing them!

Traditional scholarship is well suited to close reading. DHSS scholars can focus on questions which can be answered by computational methods. Human interpretation is vital.

Visualisation of TM output is important.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-22
SLIDE 22

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-23
SLIDE 23

PROJECT OVERVIEW

JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-24
SLIDE 24

PROJECT TEAM

Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Jim Clifford: historical analysis James Reid, Nicola Osborne : data management, social media Aaron Quigley, Uta Hinrichs: information visualisation

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-25
SLIDE 25

TRADITIONAL HISTORICAL RESEARCH

Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Global Fats Supply 1894-98

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-26
SLIDE 26

DOCUMENT COLLECTIONS

Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611 Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841)

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-27
SLIDE 27

SYSTEM

Documents Text Mining Annotated Documents XML 2 RDB

Commodities RDB

Lexicons & Gazetteers Query Interface Visualisation

Commodities Ontology

S K O S

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-28
SLIDE 28

MINED INFORMATION

Example sentence: Normalised and grounded entities:

commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-29
SLIDE 29

MINED INFORMATION

Example sentence: Extracted entity attributes and relations:

  • rigin location: Padang

destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-30
SLIDE 30

EDINBURGH GEOPARSER

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-31
SLIDE 31

COMMODITY LEXICON

Seed set from customs import records.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-32
SLIDE 32

COMMODITY LEXICON

Seed set from customs import records.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-33
SLIDE 33

COMMODITY LEXICON

Seed set from customs import records.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-34
SLIDE 34

SIBLING ACQUISITION

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

DBpedia

? ?

Monday, 25 November 2013

slide-35
SLIDE 35

LEXICON BOOTSTRAPPING

Seed lexicon ~600 Extended lexicon ~17,000 With pluralisation of single word entries ~20,500

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-36
SLIDE 36

EVALUATION

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-37
SLIDE 37

INTERMEDIATE RESULTS

Lexicon with 20,476 entries and 16,928 concepts. The prototype detected 31,169,104 commodities in 7 billion words. They correspond to 5,841 different commodities (4,466 concepts) and cover 28.5% of commodities in the lexicon.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-38
SLIDE 38 ...

LEXICON PRECISION

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions.

Monday, 25 November 2013

slide-39
SLIDE 39 ...

LEXICON PRECISION

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions.

Monday, 25 November 2013

slide-40
SLIDE 40

IMPROVING RECALL

Bigram analysis to bootstrap further commodities semi-automatically.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-41
SLIDE 41

TEXT MINING PERFORMANCE

Difficult to calculate performance through human evaluation of the output or visualisation. Indirect evaluation using annotated gold standard:

Let human annotator mark up commodities in 120 documents manually. Compare it against the text mining output.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-42
SLIDE 42

SYSTEM PERFORMANCE

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Precision Recall F-score Text mining prototype 58.67 55.01 56.78 System Relaxed entity boundaries 75.53 70.82 73.10 System Improved text mining 71.80 67.84 69.76 Relaxed entity boundaries 83.17 78.59 80.81 Human Inter-annotator agreement 71.65 72.19 71.92 Human Relaxed entity boundaries 79.49 80.10 79.80 Scores for the commodity recognition for 5 collections.

Monday, 25 November 2013

slide-43
SLIDE 43

SYSTEM PERFORMANCE

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Precision Recall F-score Text mining prototype 58.67 55.01 56.78 System Relaxed entity boundaries 75.53 70.82 73.10 System Improved text mining 71.80 67.84 69.76 Relaxed entity boundaries 83.17 78.59 80.81 Human Inter-annotator agreement 71.65 72.19 71.92 Human Relaxed entity boundaries 79.49 80.10 79.80 Scores for the commodity recognition for 5 collections.

Difference in annotators’ training

Monday, 25 November 2013

slide-44
SLIDE 44

SYSTEM PERFORMANCE

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Precision Recall F-score Text mining prototype 58.67 55.01 56.78 System Relaxed entity boundaries 75.53 70.82 73.10 System Improved text mining 71.80 67.84 69.76 Relaxed entity boundaries 83.17 78.59 80.81 Human Inter-annotator agreement 71.65 72.19 71.92 Human Relaxed entity boundaries 79.49 80.10 79.80 Scores for the commodity recognition for 5 collections.

Difference in training

Difference in annotators’ training

Monday, 25 November 2013

slide-45
SLIDE 45

EFFECTS OF OCR ERRORS

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-46
SLIDE 46

EFFECTS OF OCR ERRORS

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-47
SLIDE 47

EFFECTS OF OCR ERRORS

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-48
SLIDE 48

BRINGING ARCHIVES ALIVE

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-49
SLIDE 49

LESSONS LEARNED

Lexicon construction can be semi-automated but expert knowledge and historians’ input is ideal for

  • ptimisation.

Evaluation shows that automatic lexicon construction and text mining are not error free. But even human experts can disagree.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-50
SLIDE 50

LESSONS LEARNED

Importance of two-way collaboration between technology and humanities expert in digital HSS projects and value of iterative development and rapid prototyping.

Most OCR errors are noise but HSS scholars need to be made more aware of OCR errors affecting their search results for historical collections. Geo-referencing text is very important for historical analysis.

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013

slide-51
SLIDE 51

THANK YOU

Contact: balex@inf.ed.ac.uk

DigitalHSS Seminar, University of Edinburgh, November 19th 2013

Monday, 25 November 2013