Data Mining the 1918 Influenza Pandemic: An Introduction to the - - PowerPoint PPT Presentation

data mining the 1918
SMART_READER_LITE
LIVE PREVIEW

Data Mining the 1918 Influenza Pandemic: An Introduction to the - - PowerPoint PPT Presentation

Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information Partner Institutions: Virginia Tech and the University of Toronto Presentation for the Joint Conference on Digital Libraries Washington DC June 12,


slide-1
SLIDE 1

Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information

Partner Institutions: Virginia Tech and the University of Toronto Presentation for the Joint Conference on Digital Libraries Washington DC June 12, 2012 Tom Ewing (etewing@vt.edu) Virginia Tech

Kansas City Star, October 10,1918, p. 8

slide-2
SLIDE 2

Kansas City Star, October 10,1918, p. 8 Accessed from Newsbank America’s Historical Newspapers Collection

slide-3
SLIDE 3

An Epidemiology of Information

Research Questions

– Influenza and the War, Fall 1918 – Tracking the Flow of Information – Prevention, Treatment, and Cure

slide-4
SLIDE 4

Big Data

100,000 PLUS newspaper articles about the influenza pandemic in the United States and Canada, 1917-1919 Data Sources

Chronicling America Peel’s Prairie Provinces Readex Newsbank America’s Historical Newspapers * Proquest Historical Newspapers * Georgia / California Newspaper Projects Newspapers Not (Yet) Digitized (Microfilm) * Subscription only

slide-5
SLIDE 5

Articles (Key word: “influenza”)

Database (Titles) 1917-1919 Just 1918 Chronicling America 12,365 6,389 Peel’s Prairie Provinces 2,147 1,212 Newsbank America’s Historical Newspapers 51,929 31,717 Proquest: New York Times 9,304 3,518 Washington Post 1,545 1,069 San Francisco Chronicle 1,366 914 Los Angeles Times 13,033 1,970 Chicago Tribune 3,430 1,455 Atlanta Constitution 1,772 931 Baltimore Sun 3,586 1,639 Boston Globe 1,440 843 Georgia Newspaper Project 669 517 California Newspaper Project 203 123 Totals 102,789 52,174

slide-6
SLIDE 6

Project Team

Principal Investigators:

– Tom Ewing, Department of History (VT) – Bernice Hausman, Department of English (VT) – Bruce Pencek, University Libraries (VT) – Naren Ramakrishnan, Dept of Computer Science (VT) – Gunther Eysenbach, Centre for Global eHealth Innovation at University Health Network (UT)

Graduate Research Assistants

– Samah Gad, Dept of Computer Science (VT) – Michelle Seref, Department of English (VT) – Laura West, Department of History (VT)

slide-7
SLIDE 7

Kansas City Star, October 10, 1918, p. 8

slide-8
SLIDE 8

Washington Star, October 10, 1918, p. 1

slide-9
SLIDE 9

Evening Times (Washington) October 10, 1918, p. 6

slide-10
SLIDE 10

Evening Missourian, Oct 10, 1918, p. 1

slide-11
SLIDE 11

Evening Public Ledger (Philadelphia), October 10, 1918, p. 1

slide-12
SLIDE 12

Atlanta Constitution, October 10, 1918

slide-13
SLIDE 13

Method: Manual Text Analysis

  • Step One: Identify Relevant Articles

– Key word search within date parameters, or – Review every issue of newspaper on microfilm

  • Step Two: Read and analyze each article

– Identify main themes – Use textual evidence to address research questions – Make comparisons across time / context

  • Step Three: Develop interpretation

– Rhetoric: how does the language convey meaning? – History: how does the text represent experience?

slide-14
SLIDE 14

The Digging into Data Challenge

How can big data analysis enhance the methods of manual text analysis, with the goals of 1) contributing to understanding of the 1918 influenza epidemic; and 2) contributing to developing new methods of data mining applicable to other pandemics?

slide-15
SLIDE 15

10 20 30 40 50 60 70

Results for “influenza” in Kansas City Star, September 15-November 15, 1918

slide-16
SLIDE 16

Topic Clouds, for Washington Times Newspaper

slide-17
SLIDE 17

Tone Analysis of Advertisements with word “influenza,” Washington Times and Kansas City Star, October 1918 (n=104)

22 10 8 21 9 62 27 27 8

Alarmist Warning Encouraging Patriotic / war Explanatory / Information Prevention / Preparation Selling other products Cure / Treatment Recovery

slide-18
SLIDE 18

Building a tone detection algorithm

Categories used to Describe Tone for Paragraphs before/after word “influenza”:  Alarmist/Fear  Warning/Caution  Encouraging/Exhorting  Patriotic/War  Explanatory/information  Prevention/Preparation  Humorous/Anecdotal  Cure/Treatment of Symptoms  Recovery  Incomprehensible

slide-19
SLIDE 19

Major Challenges

  • Matching research questions to practices across

disciplines (History, Rhetoric, Public Health, Computer Sciences, Information Sciences)

  • Inconsistency in data due to OCR procedures
  • Gaps in digitized newspapers across databases
  • Incorporating newspapers that are not digital
  • Defining a consistent methodology that yields

meaningful results

slide-20
SLIDE 20

Next steps

  • Develop methods to identify tone in articles about

influenza from Chronicling America

  • Expand applications across newspaper databases
  • Refine methods for content analysis using data mining

techniques

  • Apply methods to track discussion of disease in

contemporary forms of media (Facebook, Twitter, Google)

slide-21
SLIDE 21

Roanoke Times, October 27, 1918, p. 2 Data Mining the 1918 Influenza Pandemic: An Introduction to the Epidemiology of Information

Partner Institutions: Virginia Tech and the University of Toronto

Contact information: Tom Ewing etewing@vt.edu