Why Is It Difficult to Detect Outbreaks in Twitter? Avar Stewart, - - PowerPoint PPT Presentation

why is it difficult to detect outbreaks in twitter
SMART_READER_LITE
LIVE PREVIEW

Why Is It Difficult to Detect Outbreaks in Twitter? Avar Stewart, - - PowerPoint PPT Presentation

Why Is It Difficult to Detect Outbreaks in Twitter? Avar Stewart, Nattiya Kanhabua, Sara Romano Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl L3S Research Center / Leibniz Universitt Hannover, Germany SIGIR 2013 Workshop on Health


slide-1
SLIDE 1

Why Is It Difficult to Detect Outbreaks in Twitter?

Avaré Stewart, Nattiya Kanhabua, Sara Romano Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl

L3S Research Center / Leibniz Universität Hannover, Germany SIGIR 2013 Workshop on Health Search and Discovery 1 August 2013, Dublin, Ireland

slide-2
SLIDE 2

Motivation

  • Numerous works use Twitter to infer the existence

and magnitude of real-world events in real-time

– Earthquake [Sakaki et al., 2010] – Predicting financial time series [Ruiz et al., 2012] – Influenza epidemics [Culotta, 2010; Lampos et al., 2011; Paul et al., 2011]

slide-3
SLIDE 3

Early Warnings

slide-4
SLIDE 4

Health related tweets

  • User status updates or news related to

public health are common in Twitter

– I have the mumps...am I alone? – my baby girl has a Gastroenteritis so great!! Please do not give it to meee – #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/.... – As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.

slide-5
SLIDE 5

Matching Tweets

[Kanhabua et al., CIKM’12]

slide-6
SLIDE 6

Matching Tweets

[Kanhabua et al., CIKM’12]

slide-7
SLIDE 7

Twitter vs. Official Source

slide-8
SLIDE 8

M-Eco System

Medical Ecosystem: Personalized Event-based Surveillance

http://www.meco-project.eu/

slide-9
SLIDE 9

Data Collection

  • Official outbreak reports

– ~3,000 ProMED-mail reports from 2011 – WHO reports have very small coverage

  • Twitter data

– ~1,200 health-related terms (i.e., infectious diseases, their synonyms, pathogens and symptoms) – Over 112 millions of tweets from 2011

  • Series of NLP tools including

– OpenNLP (tokenization, sentence splitting, POS tagging) – OpenCalais (named entity recognition) – HeidelTime (temporal expression extraction)

slide-10
SLIDE 10

Ground Truths

[Kanhabua et al., TAIA’ 12]

slide-11
SLIDE 11

Event Extraction

  • An event is a sentence containing two entities

– (1) medical condition and (2) geographic expression – A minimum requirement by domain experts

  • A victim and the time of an event can be identified

from the sentence itself, or its surrounding context

  • Output: a set of event candidates

Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak in Uganda since the beginning of July 2012

[Kanhabua et al., TAIA’ 12]

slide-12
SLIDE 12

Message Filtering: Challenges

  • Ambiguity

– having several meanings – used in different contexts

  • Incompleteness

– missing or under-reported events – data processing errors

slide-13
SLIDE 13

Message Filtering: Challenges

  • Ambiguity

– having several meanings – used in different contexts

  • Incompleteness

– missing or under-reported events – data processing errors

Category Example tweet

Literature A two hour train journey, Love In the Time of Cholera ... Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith Universal Audio... Marketing Exclusive distributor of high quality #HIV/AIDS Blood & Urine and #Hepatitis #Self -testers. General Identification of genotype 4 Hepatitis E virus binding proteins on swine liver cells: Hepatitis E virus... Negative i dont have sniffles and no real coughing..well its coughing but not like an influenza cough. Joke Thought I had Bieber Fever. Ends up I just had a combo

  • f the mumps, mono, measles & the hershey squ...
slide-14
SLIDE 14

Challenge I. Noisy/evolving

  • Evolving data

– Relevant features changes over time

slide-15
SLIDE 15

Challenge I. Noisy/evolving

slide-16
SLIDE 16

Approach for Noisy Data

  • MedISys1

– providing a list of negative keywords created by medical experts

  • Urban Dictionary2

– a Web-based dictionary of slang, ethnic culture words or phrases

1http://medusa.jrc.it/medisys/homeedition/en/home.html 2http://www.urbandictionary.com/

slide-17
SLIDE 17

Approach for Noisy Data

1http://medusa.jrc.it/medisys/homeedition/en/home.html 2http://www.urbandictionary.com/

slide-18
SLIDE 18

[Kanhabua and Nejdl, WOW’ 13]

slide-19
SLIDE 19

[Kanhabua and Nejdl, WOW’ 13]

slide-20
SLIDE 20

Approach for Feature Changes

slide-21
SLIDE 21

Signal Generation: Challenges

  • Temporal Dynamics

– seasonal infectious diseases – rare and spontaneous outbreaks

  • Location Dynamics

– frequency and duration – levels of prevalence or severity

slide-22
SLIDE 22

Signal Generation: Challenges

  • Temporal Dynamics

– seasonal infectious diseases – rare and spontaneous outbreaks

  • Location Dynamics

– frequency and duration – levels of prevalence or severity

[Rortais et al., 2010 in Journal of Food Research International]

slide-23
SLIDE 23

Signal Generation: Challenges

  • Temporal Dynamics

– seasonal infectious diseases – rare and spontaneous outbreaks

  • Location Dynamics

– frequency and duration – levels of prevalence or severity

slide-24
SLIDE 24

Signal Generation: Challenges

[Emch et al., 2008 in International Journal of Health Geographics]

slide-25
SLIDE 25

Outbreak Categorization

slide-26
SLIDE 26

Outbreak Categorization

How to generate a reliable signal for low aggregate counts?

slide-27
SLIDE 27

Approach

[Kanhabua and Nejdl, WOW’ 13]

slide-28
SLIDE 28

Temporal Diversity

  • Refined Jaccard Index (RDJ-index)

– average Jaccard similarity of all object pairs

  • Note: lower RDJ corresponds to higher diversity
  • Problem: “All-Pair comparison”
  • Solution: Estimation algorithms with probabilistic

error bound guarantees

[Deng et al., CIKM’ 12]

<

− =

j i j i O

O JS n n RDJ ) , ( ) 1 ( 2

n j i ≤ < ≤ 1

∩ U U

Jaccard similarity

slide-29
SLIDE 29

Temporal Diversity

  • Refined Jaccard Index (RDJ-index)

– average Jaccard similarity of all object pairs

  • Note: lower RDJ corresponds to higher diversity
  • Problem: “All-Pair comparison”
  • Solution: Estimation algorithms with probabilistic

error bound guarantees

[Deng et al., CIKM’ 12]

<

− =

j i j i O

O JS n n RDJ ) , ( ) 1 ( 2

n j i ≤ < ≤ 1

∩ U U

Jaccard similarity

(1) Top-k terms (2) Entities

slide-30
SLIDE 30

Threat Assessment: Challenge

  • Overwhelming with the large number of tweets
slide-31
SLIDE 31

Approach

  • Personalized Tweet Ranking for Epidemic

Intelligence

– Learning to rank and recommender systems – User's context as implicit criteria for recommendation [Diaz-Aviles et al., WWW’ 12, Diaz-Aviles et al., ICWSM’ 12]

slide-32
SLIDE 32

Approach

slide-33
SLIDE 33

Signal Search Prototype

slide-34
SLIDE 34

Future Work

  • Real-Time Analysis of Big and Fast

Social Web Streams

– Scalable, efficient methods for filtering and generating signals in real-time – Effective methods for aggregating and visualizing information in a meaningful way

slide-35
SLIDE 35

Thank you!

kanhabua@L3S.de

slide-36
SLIDE 36

References

  • [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In

Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.

  • [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards

personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012.

  • [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic

intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2012), 2012.

  • [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal

Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access (TAIA'2012), 2012.

  • [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting

Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.

  • [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the

Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at WWW'2013, 2013.

  • [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with

statistical learning. ACM TIST, 3, 2011.

  • [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health.

In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011.

  • [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time

series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.

  • [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time

event detection by social sensors. In Proceedings of WWW’2010, 2010.