Digital history and big data: Text mining historical documents on - - PowerPoint PPT Presentation

digital history and big data
SMART_READER_LITE
LIVE PREVIEW

Digital history and big data: Text mining historical documents on - - PowerPoint PPT Presentation

Digital history and big data: Text mining historical documents on trade in the British Empire Beatrice Alex balex@inf.ed.ac.uk Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013 OVERVIEW What is text mining? Text Mining in digital


slide-1
SLIDE 1

Beatrice Alex balex@inf.ed.ac.uk

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Digital history and big data:

Text mining historical documents on trade in the British Empire

slide-2
SLIDE 2

OVERVIEW

What is text mining? Text Mining in digital history Trading Consequences “Big data” Visualisation Challenge of noisy data Collaborating with historians

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-3
SLIDE 3

TEXT MINING

Describes a set of linguistic, statistical and machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections automatically.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-4
SLIDE 4

TEXT MINING

TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Currently our focus is on named entity recognition, entity grounding and relation extraction.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-5
SLIDE 5

TM IN DIGITAL HISTORY

Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypothesis. Methods: linguistic text analysis, named entity recognition, geo-grounding and relation extraction to transform the text into structured data. Sea-change to methods used in ‘traditional’ history.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-6
SLIDE 6

“TRADITIONAL” HISTORICAL RESEARCH

Cinchona plantations in George King’s A Manual of Cinchona Cultivation in India (1880). Global Fats Supply 1894-98

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-7
SLIDE 7

TRADING CONSEQUENCES

Digging into Data II project (till Dec. 2013) Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex, Dr. Claire Grover, Clare Llewellyn, Richard Tobin, James Reid, Nicola Osborne, Ian Fieldhouse

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-8
SLIDE 8

TRADING CONSEQUENCES

Trading Consequences

Bea Alex, Timothy Bristow, Jim Clifford, Colin Coates, Ian Fieldhouse, Claire Grover, Uta Hinrichs, Ewan Klein, Clare Llewellyn, Nicola Osborne, Aaron Quigley, James Reid and Richard Tobin

Contact: dig-trade@inf.ed.ac.uk, Twitter: @digtrade Blog: http://tradingconsequences.blogs.edina.ac.uk/

Informa(on)visualisa(on Text)mining)and)ontology)management Historical)analysis)&))

  • ntology)development

Data)integra(on)&)dissemina(on

Type to enter text

!! From!Padang!was!exported,!in!1871,!6,127!piculs!of ! cassia!bark,!of!which!a!large!portion!was!shipped!to ! America!(Fliickiger!and!Hanbury).!...!! ! (excerpt!from!Spices,!Ridley,!1912)

Early Canadiana Online AMD Confidential Prints ProQuestʼs House of Commons Parliamentary Papers Kew Gardenʼs Directorʼs Correspondence Archive JSTORʼs Foreign and Commonwealth Office collection (sample)

"Captive Tomes" by traceyp3031 on Flickr "Library Archives 05” by peteashton on Flickr :Cinnamon_Spice skos:prefLabel :Spice skos:narrowerThan cassia bark cinnamon cinnamomum vera skos:prefLabel skos:altLabel tc:Cassia_Bark skos:narrowerThan cinnamomum cassia skos:altLabel Document id spices1912ridley docid spices1912ridley title Spices url http://archive.org/details/spiceshenry00ridlrich pubdate 1912 type text author Ridley, Henry N. lang eng Collection id books text books Lo Location id geonames: 1633419 text Padang latitude
  • 0.94924
longitude 100.35427 geom 0101000020E610 00000DFD135CA C165940E3C2819 02C60EEBF Commod mmodityMention id rb5373 preLabel cinamon text cassia bark start_word w446990 end_word w446997 date 1871 Comm
  • mmodityMention
id rb5373 text cassia bark prefLabel cassia bark altLabel cinnamonum cassia Locatio
  • cationMention
id rb5370 text Padang start_word w446944 end_word w446944 in_country Indonesia gazref geonames: 1633419 feature_ty pe populated place direction
  • rigin
Date DateMention id rb5371 text 1871 year 1871 month day Quanti uantityMention id rb5372 text 6127 piculs quantity 6127 unit piculs

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-9
SLIDE 9

TRADING CONSEQUENCES

What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? Scope: global, but with focus on Canadian natural resources. Example questions:

  • What were the routes and volumes of international trade in

resource commodities in the nineteenth century?

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-10
SLIDE 10

DOCUMENT COLLECTIONS

Big data for historians:

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-11
SLIDE 11

MINED INFORMATION

Example sentence:

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-12
SLIDE 12

MINED INFORMATION

Example sentence: Extracted entities:

commodity: cassia bark date: 1871 location: Padang location: America quantity + unit: 6,127 piculs

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-13
SLIDE 13

MINED INFORMATION

Example sentence: Normalised and grounded entities:

commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-14
SLIDE 14

MINED INFORMATION

Example sentence: Extracted entity attributes and relations:

  • rigin location: Padang

destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-15
SLIDE 15

COMMODITY ONTOLOGY

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

:Cinnamon_Spice skos:prefLabel :Spice skos:narrowerThan cassia bark cinnamon cinnamomum vera skos:prefLabel skos:altLabel tc:Cassia_Bark skos:narrowerThan cinnamomum cassia skos:altLabel

slide-16
SLIDE 16

IMPROVED SEARCH & VISUALISATIONS

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-17
SLIDE 17

IMPROVED SEARCH & VISUALISATIONS

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-18
SLIDE 18

IMPROVED SEARCH & VISUALISATIONS

Seminar Talk, School of Computing, Dundee, 19/03/2013

slide-19
SLIDE 19

NOISY DATA

Optical character recognition contains many errors and often the structure of the page layout is lost.

Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-20
SLIDE 20

FIXING NOISY DATA

Text normalisation and correction: End-of-line soft hyphen removal

Dehyphen all token-splitting hyphens using a dictionary- based approach.

“False f”-to-s conversion

Convert all false f characters to s using a corpus.

Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012).

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-21
SLIDE 21

FIXING NOISY DATA

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-22
SLIDE 22

FIXING NOISY DATA

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-23
SLIDE 23

Extract from document 10.2307/60238580 in FCOC.

HOW NOISY IS TOO NOISY?

qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-24
SLIDE 24

Extract from document 10.2307/60238580 in FCOC.

HOW NOISY IS TOO NOISY?

qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT 'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx 'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-25
SLIDE 25

THE USERS (HISTORIANS)

Involvement of historians:

Everything is based on the use cases and build on users’ hypotheses/research questions. They are responsible for identification of relevant collections and are involved in the ontology development. They provide feedback for us to improve technology iteratively: Partners at York use of the prototype for their research and track errors; Workshop at CHESS 2013 with a group of independent historians

Clarity of the text mining accuracy is

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-26
SLIDE 26

SUMMARY

Text mining historic documents in Trading Consequences. Processing “big data”. Power of visualising structured data. Fixing noisy data. Importance of two-way collaboration between technology experts and users in digital history.

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

slide-27
SLIDE 27

THANK YOU

Questions? Fire away or contact me at: balex@inf.ed.ac.uk

Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013