Bootstrapping a historical commodities lexicon with SKOS and - - PowerPoint PPT Presentation

bootstrapping a historical commodities lexicon with skos
SMART_READER_LITE
LIVE PREVIEW

Bootstrapping a historical commodities lexicon with SKOS and - - PowerPoint PPT Presentation

Bootstrapping a historical commodities lexicon with SKOS and DBpedia. Ewan Klein, Beatrice Alex, Jim Clifford @digtrade LaTeCH 2014, Gothenburg, April 26th 2014 PROJECT OVERVIEW JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013


slide-1
SLIDE 1

LaTeCH 2014, Gothenburg, April 26th 2014

Bootstrapping a historical commodities lexicon with SKOS and DBpedia. Ewan Klein, Beatrice Alex, Jim Clifford

@digtrade

slide-2
SLIDE 2

PROJECT OVERVIEW

JISC/SSHRC Digging into Data Challenge II Jan 2012 - Dec 2013 Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-3
SLIDE 3

PROJECT TEAM

Ewan Klein, Beatrice Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne: data management, social media Aaron Quigley, Uta Hinrichs: information visualisation

LaTeCH 2014, Gothenburg, April 26th 2014

slide-4
SLIDE 4

COMMODITY LEXICON CREATION

LaTeCH 2014, Gothenburg, April 26th 2014

slide-5
SLIDE 5

SEED SET

Seed set from customs import records.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-6
SLIDE 6

SEED SET

Seed set from customs import records.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-7
SLIDE 7

SEED SET

Seed set from customs import records.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-8
SLIDE 8

STRUCTURE

How should synonyms be represented?

donkey ~ ass

How should commodity mentions be grounded?

cinnamon -> cinnamonum verum

  • > cinnamonum cassia

How do we group commodities together by type?

lemons, limes, oranges -> citrus fruit

LaTeCH 2014, Gothenburg, April 26th 2014

slide-9
SLIDE 9

SKOS

Simple Knowledge Organisation System

A W3C initiative for the representation of thesauri, classification schemes, taxonomies etc. A standard way to represent knowledge organisation systems using the Resource Description Framework. Looser semantics than strict hierarchies.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-10
SLIDE 10

EXAMPLE

skos:Concept: “cassia bark”@en “cinnamonum cassia”@en rdf:type skos:prefLabel skos:altLabel dbp:Cassia_bark: skos:Concept: “mahogany”@en rdf:type skos:prefLabel dbp:Mahogany dbp:Commodity skos:broader skos:broader

LaTeCH 2014, Gothenburg, April 26th 2014

slide-11
SLIDE 11

SEED SET IN SKOS

!"#$%&' &(%)*+,%- +-'*+,%- !"#$%&'()*+,-.'/,01 2&'( !"#$%&'3+.,0 2&'3+.,0 /3!/,3442&'34+.,0542&'34+.,0 !"#$%&--&3 2&--&3 2&--&346/".' !"#$%&--&3)7..! 2&--&347..! !"#$%&8'9 2&8'9 2&8'/. !"#$%&9#: 2&9#: 3:-'/,54'/;.'4',- !"#$%',3".''9 2',3".''9 !"#$%'&-&3)2,72,'/00, 2'&-&342,72,'/00, 2,72,'/00, !"#$%'&-&3)&/0 2'&-&34&/0 !"#$%:"." 2:"." 2:"/"54<,;,4#.##.' !"#$%:0+ 2:0+ !"#$=,++,')>:+ !,++,'4>:+ >:+4!,++,' !"#$=..' !..' !"#$=/#7,2:7 !/#7,2:7 4-.,7.0 !"#$=&+.7-/2)7?..# !&+.7-/247?..# !"#$=&3(.9 !&3(.9 ,77 !"#$=',2,.3,)2/33,",'/ !',2,.3,42/33,",'/ 7,3>:/74!',2&3/754>:+4!',>&3@74"0&&!

LaTeCH 2014, Gothenburg, April 26th 2014

slide-12
SLIDE 12

EXAMPLE

LaTeCH 2014, Gothenburg, April 26th 2014

slide-13
SLIDE 13

SIBLING ACQUISITION

base thesaurus category acquisition sibling acquisition

LaTeCH 2014, Gothenburg, April 26th 2014

slide-14
SLIDE 14

HIERARCHY

root concept wikimedia categories leaf concepts

LaTeCH 2014, Gothenburg, April 26th 2014

slide-15
SLIDE 15

LEXICON IN XML

LaTeCH 2014, Gothenburg, April 26th 2014

slide-16
SLIDE 16

LEXICON BOOTSTRAPPING

Seed lexicon 319 concepts Extended lexicon 16,928 concepts With pluralisation of single word entries 20,476 entries

LaTeCH 2014, Gothenburg, April 26th 2014

slide-17
SLIDE 17

EVALUATION

LaTeCH 2014, Gothenburg, April 26th 2014

slide-18
SLIDE 18

DOCUMENT COLLECTIONS

Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611

LaTeCH 2014, Gothenburg, April 26th 2014

slide-19
SLIDE 19

DOCUMENT COLLECTIONS

Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611

LaTeCH 2014, Gothenburg, April 26th 2014

Over 10 million document pages, Over 7 billion word tokens.

slide-20
SLIDE 20

INTERMEDIATE RESULTS

Lexicon with 20,476 entries and 16,928 concepts. Need to evaluate lexicon precision and recall. Commodity recognition using rule-based (context and linguistically sensitive) matching. Frequency distribution of all commodities detected in

  • ur data (31,169,104 in 7 billion words).

Found 5,841 different commodities (belonging to 4,466 concepts) in the data: 28.5% (26.4%) of commodities in the lexicon.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-21
SLIDE 21

EVALUATION

How well does our commodity recognition perform

  • n a random test set?

Indirect evaluation using annotated gold standard: Let human annotator mark up commodities in 120 documents manually. Compared that against the text mining output.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-22
SLIDE 22

PROTOTYPE EVALUATION

Error analysis showed errors in the lexicon and boundary errors affect precision. Boundary errors, OCR errors and spelling variations affect recall.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-23
SLIDE 23

LaTeCH 2014, Gothenburg, April 26th 2014

slide-24
SLIDE 24 ...

LEXICON PRECISION

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. Error types: wrong (village account), too general (crop), ambiguous due to OCR error (lime), not in definition (paper)

LaTeCH 2014, Gothenburg, April 26th 2014

slide-25
SLIDE 25 ...

LEXICON PRECISION

From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. Error types: wrong (village account), too general (crop), ambiguous due to OCR error (lime), not in definition (paper)

LaTeCH 2014, Gothenburg, April 26th 2014

slide-26
SLIDE 26

FALSE NEGATIVES

Hand annotated texts contain 1,107 commodity mentions (506 different entities). 178 entities (683 mentions) are in the first version of the expanded lexicon. 329 terms (424 mentions) are not in the lexicon:

110 (115 mentions) contain OCR errors, approx. 10% of all commodity mentions. 160 commodities are missing, 59 should not be added.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-27
SLIDE 27

IMPROVING RECALL

Bigram analysis to bootstrap further commodities semi-automatically.

LaTeCH 2014, Gothenburg, April 26th 2014

slide-28
SLIDE 28

IMPROVEMENTS

  • i. Removing terms based on frequency analysis
  • ii. Boundary extension rules

iii.Adding terms based on bigram analysis iv.Combination of i-v (with new lexicon: 17,247 concepts and 22,723 entries)

LaTeCH 2014, Gothenburg, April 26th 2014

slide-29
SLIDE 29

SYSTEM PERFORMANCE

LaTeCH 2014, Gothenburg, April 26th 2014

slide-30
SLIDE 30

LESSONS LEARNED

SKOS is useful for organising a lexicon. We developed a method for bootstrapping from a seed set using categorial similarity of other entities. Expert knowledge and historians’ input was important for optimisation. Bootstrapping a lexicon and text mining are not error free (but even human experts can disagree).

LaTeCH 2014, Gothenburg, April 26th 2014

slide-31
SLIDE 31

USER INTERFACE

LaTeCH 2014, Gothenburg, April 26th 2014

slide-32
SLIDE 32

THANK YOU

Website: http://tradingconsequences.blogs.edina.ac.uk/ Demo: http://tcqdev.edina.ac.uk/search/commodity/ , http://tcqdev.edina.ac.uk/vis/tradConVis Contact: balex@inf.ed.ac.uk

LaTeCH 2014, Gothenburg, April 26th 2014

slide-33
SLIDE 33

TRADITIONAL HISTORICAL RESEARCH

Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Global Fats Supply 1894-98

LaTeCH 2014, Gothenburg, April 26th 2014

slide-34
SLIDE 34

SYSTEM

Documents Text Mining Annotated Documents XML 2 RDB

Commodities RDB

Lexicons & Gazetteers Query Interface Visualisation

Commodities Ontology

S K O S

LaTeCH 2014, Gothenburg, April 26th 2014

slide-35
SLIDE 35

MINED INFORMATION

Example sentence: Normalised and grounded entities:

commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

LaTeCH 2014, Gothenburg, April 26th 2014

slide-36
SLIDE 36

MINED INFORMATION

Example sentence: Extracted entity attributes and relations:

  • rigin location: Padang

destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America

LaTeCH 2014, Gothenburg, April 26th 2014

slide-37
SLIDE 37

SIBLING ACQUISITION

LaTeCH 2014, Gothenburg, April 26th 2014

slide-38
SLIDE 38

EXAMPLE

LaTeCH 2014, Gothenburg, April 26th 2014

slide-39
SLIDE 39

CATEGORY ACQUISITION

base thesaurus category acquisition sibling acquisition

LaTeCH 2014, Gothenburg, April 26th 2014