Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian - - PowerPoint PPT Presentation

beatrice alex balex inf ed ac uk
SMART_READER_LITE
LIVE PREVIEW

Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian - - PowerPoint PPT Presentation

Finding Commodities in the Nineteenth Century British World: A collaboration between text miners and historians Beatrice Alex balex@inf.ed.ac.uk Robarts Centre for Canadian Studies, York University, Toronto October 11th 2013 OVERVIEW Trading


slide-1
SLIDE 1

Beatrice Alex balex@inf.ed.ac.uk

Robarts Centre for Canadian Studies, York University, Toronto October 11th 2013

Finding Commodities in the Nineteenth Century British World: A collaboration between text miners and historians

slide-2
SLIDE 2

OVERVIEW

Trading Consequences Text mining Lexicon/thesaurus creation Evaluation and fine-tuning

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-3
SLIDE 3

TRADING CONSEQUENCES

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-4
SLIDE 4

PROJECT OVERVIEW

JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-5
SLIDE 5

PROJECT TEAM

Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Jim Clifford: historical analysis James Reid: data management & integration Aaron Quigley, Uta Hinrichs: information visualisation

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-6
SLIDE 6

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-7
SLIDE 7

TRADITIONAL HISTORICAL RESEARCH

Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Global Fats Supply 1894-98

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-8
SLIDE 8

Robarts Centre for Canadian Studies, York University, 11/09/2013

slide-9
SLIDE 9

Robarts Centre for Canadian Studies, York University, 11/09/2013

slide-10
SLIDE 10

DOCUMENT COLLECTIONS

Collection # of Documents # of Images HCPP 118,526 6,448,739 ECO 83,016 3,938,758 Kew Directors’ Letters 14,340 n/a Confidential Prints 1,315 140,010 FCOC (partial) 1,000 41,611 NEW: NCCO AATW 4,725 948,773 (ocred: 450,841)

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-11
SLIDE 11

SYSTEM

Documents Text Mining Annotated Documents XML 2 RDB

Commodities RDB

Lexicons & Gazetteers Query Interface Visualisation

Commodities Ontology

S K O S

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-12
SLIDE 12

USER INTERFACE

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-13
SLIDE 13

COMMODITY RELATIONS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-14
SLIDE 14

TEXT MINING

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-15
SLIDE 15

TEXT MINING

Describes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections

  • automatically. (data paralysis)

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-16
SLIDE 16

TM IN DIGITAL HISTORY

Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypotheses. Change to traditional history.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-17
SLIDE 17

TEXT MINING

TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Our focus is on named entity recognition, entity grounding and relation extraction.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-18
SLIDE 18

MINED INFORMATION

Example sentence:

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-19
SLIDE 19

MINED INFORMATION

Example sentence: Normalised and grounded entities:

commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-20
SLIDE 20

MINED INFORMATION

Example sentence: Extracted entity attributes and relations:

  • rigin location: Padang

destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-21
SLIDE 21

NOISY DATA

Optical character recognition contains many errors and often the structure of the page layout is lost.

Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-22
SLIDE 22

FIXING NOISY DATA

Text normalisation and correction: End-of-line soft hyphen removal

Dehyphen all token-splitting hyphens using a dictionary- based approach.

“False f”-to-s conversion

Convert all false f characters to s using a corpus.

Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012).

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-23
SLIDE 23

FIXING NOISY DATA

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-24
SLIDE 24

FIXING NOISY DATA

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-25
SLIDE 25

Extract from document 10.2307/60238580 in FCOC.

HOW NOISY IS TOO NOISY?

qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-26
SLIDE 26

Extract from document 10.2307/60238580 in FCOC.

HOW NOISY IS TOO NOISY?

qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-27
SLIDE 27

COMMODITY LEXICON CREATION

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-28
SLIDE 28

EXTRACTED INFO

Example sentence: Normalised and grounded entities:

commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-29
SLIDE 29

SEED SET

Customs import records.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-30
SLIDE 30

SEED SET

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-31
SLIDE 31

SEED SET

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-32
SLIDE 32

STRUCTURE

How should synonyms be represented? How should commodity mentions be grounded? How do we group commodities together by type?

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-33
SLIDE 33

SKOS

Simple Knowledge Organization System Designed to bridge between Thesauri, classifications, and legacy KOS OWL-based formal ontologies Looser semantics than strict hierarchies

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-34
SLIDE 34

EXAMPLE

skos:Concept: ex:Cassia_Bar k “cassia bark”@en “cinnamonum cassia”@en rdf:type skos:prefLabel skos:altLabel

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-35
SLIDE 35

EXAMPLE

skos:Concept: ex:Cassia_B “cassia bark”@en “cinnamonum cassia”@en rdf:type skos:prefLabel skos:altLabel ex:Cassia_bark: skos:Concept: ex:Cassia_B “mahogany”@en rdf:type skos:prefLabel ex:Mahogany ex:Commodity skos:broader skos:broader

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-36
SLIDE 36

LEXICON DEVELOPMENT

Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia)

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-37
SLIDE 37

LEXICON DEVELOPMENT

Concepts labeled by URIs (global IDs) reuse rather than coin V1: Umbel (derived from OpenCyc) V2: DBpedia (ontology based on Wikipedia)

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-38
SLIDE 38

EXAMPLE

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-39
SLIDE 39

EXAMPLE

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-40
SLIDE 40

HIERARCHY

root concept wikimedia categories leaf concepts

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-41
SLIDE 41

SIBLING ACQUISITION

DBpedia

? ?

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-42
SLIDE 42

LEXICON BOOTSTRAPPING

Seed lexicon ~600 DBpedia extended lexicon ~17,000 With pluralisation of single word entries ~20,500

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-43
SLIDE 43

EVALUATION

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-44
SLIDE 44

INTERMEDIATE RESULTS

Lexicon with 20,476 entries and 16,928 concepts. Need to evaluate lexicon precision and recall. Frequency distribution of all commodities detected in our data (31,169,104 in 7 billion words). Found 5,841 different commodities (belonging to 4,466 concepts) in the data: 28.5% of commodities in the lexicon.

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-45
SLIDE 45

Precision: How many lexicon entries are actual commodities? Recall: How many commodities does the lexicon already cover?

PRECISION/RECALL

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-46
SLIDE 46

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-47
SLIDE 47

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-48
SLIDE 48

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-49
SLIDE 49

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-50
SLIDE 50

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-51
SLIDE 51

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-52
SLIDE 52

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

slide-53
SLIDE 53

INTERMEDIATE RESULTS

Robarts Centre for Canadian Studies, York University, 11/10/2013

99.8% of mentions