text mining and historical research
play

Text Mining and Historical Research Beatrice Alex - PowerPoint PPT Presentation

Text Mining and Historical Research Beatrice Alex balex@inf.ed.ac.uk MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014 OVERVIEW What is text mining? Types of text analyses. Trading Consequences: text


  1. Text Mining and Historical Research Beatrice Alex balex@inf.ed.ac.uk MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  2. OVERVIEW What is text mining? Types of text analyses. Trading Consequences: text mining applied. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  3. TEXT MINING D escribes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data) . Is very useful for analysing large text collections automatically (overcoming data paralysis). Goal in DHSS research: By analysing large amounts of textual data, help HSS scholars to discover novel patterns and explore hypotheses. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  4. MINING WHAT TEXT? Electronic text or things that can be turned into it. Born electronic text (research papers, literature, tweets, blogs, comments on blogs etc.). Digitised text documents. Meta data (collection and document level). Image subtitles (Flickr image titles and subtitles). Video/audio transcripts (YouTube transcripts, TED talks, MOOC transcripts etc.) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  5. TYPES OF ANALYSES Named entity recognition. Grounding, e.g. geo-referencing. Relation extraction. Clustering, e.g. topic modelling. Sentiment analysis. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  6. NAMED ENTITY RECOGNITION Identification and classification of entity mentions in text, things like: Names of persons, locations, organisations,... Dates, amounts ... Often used for improving access to collections. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  7. CONNECTED HISTORIES http://www.connectedhistories.org MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  8. OLD BAILEY ONLINE http://www.oldbaileyonline.org MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  9. GROUNDING Linking entity mentions in text to a unique identifier, e.g.: Person names to their Wikipedia pages Location names to lat/longs or Geonames IDs Gene names to gene ontologies Goal is to disambiguate between mentions with the same surface form (e.g. “Paris”, “Victoria”). MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  10. EDINBURGH GEOPARSER MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  11. RELATION EXTRACTION Identifying relations between entities in text or in meta data in order to Triples: person - author_of -> book title, commodity - traded_at -> location person - born_in -> location person - born_at -> date MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  12. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  13. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  14. RELATION EXTRACTION ChartEX, Discovering spatial descriptions and relationships in medieval charters, October 2013. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  15. SUMMARY SO FAR Different types of text analyses applied to historical and literary research but new opportunities are endless. Text mining can assist scholars in their research but it is not replacing them! Traditional scholarship is well suited to close reading. HSS scholars can focus on questions which can be answered by computational methods. Human interpretation is vital. Visualisation of TM output is important. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  16. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  17. PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne : data management, social media Aaron Quigley, Uta Hinrichs: information visualisation MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  18. TRADITIONAL HISTORICAL RESEARCH Global Fats Supply 1894-98 Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  19. PROJECT GOALS Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  20. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Directors’ Letters of 14,340 n/a Correspondence (Kew) Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  21. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers 118,526 6,448,739 (ProQuest) Early Canadiana Online 83,016 3,938,758 Over 10 million document pages, Directors’ Letters of 14,340 n/a Correspondence (Kew) Over 7 billion word tokens. Confidential Prints (Adam 1,315 140,010 Matthews) Foreign and Commonwealth Office 1,000 41,611 Collection Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  22. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  23. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  24. COMMODITY LEXICON Seed set from customs import records. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  25. LEXICON CREATION Seed lexicon ~600 Extended lexicon ~17,000 With pluralisation of ~20,500 single word entries MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  26. LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  27. LEXICON PRECISION ... From the top 1,757 entries only 84 (4.8%) had to be filtered. The top 1,757 entities amount to 99.8% of mentions. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  28. OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  29. OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  30. OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  31. OCR ERRORS Extract of Early Canadiana Online document 9_00952_3, p. vi. Extract of Early Canadiana Online document 9_00952_3, p. vi. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  32. OCR ERRORS MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  33. BRINGING ARCHIVES ALIVE MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

  34. SUMMARY You have access to enormous amounts of data. Text mining can be applied to process large text collections, enrich existing text with information or pull out trends which can be visualised. Text mining is a way to enable distant reading, even if such technology is not 100% accurate. OCR errors in digitised collections can skew your results. MSc Historical Research, University of Edinburgh, March 21st 2014 Friday, 21 March 2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend