On the im importance of keywords for the applic lication of Twit - - PowerPoint PPT Presentation

on the im importance of keywords for the applic lication
SMART_READER_LITE
LIVE PREVIEW

On the im importance of keywords for the applic lication of Twit - - PowerPoint PPT Presentation

On the im importance of keywords for the applic lication of Twit itter posts for traffic in incid ident detection Camille Kamga Anil Yazici Sandeep Mudigonda Wei Hao Nathalie Martinez 1 Traffic In Incidents Roadway incidents


slide-1
SLIDE 1

On the im importance of keywords for the applic lication of Twit itter posts for traffic in incid ident detection

Camille Kamga Anil Yazici Sandeep Mudigonda Wei Hao Nathalie Martinez

1

slide-2
SLIDE 2

Traffic In Incidents

  • Roadway incidents  57.9% of the total delay on road networks.
  • Improve roadway geometric design for safer driving
  • Mitigate incident impacts:

 1 min less incident duration  4-6 min/vehicle delay saving & 9 gal fuel, 0.7 kg HC, 9 kg CO, 1.3 kg NO)  Reduce detection and clearance times

  • Gather and disseminate the incident information fastest way possible efficient response
  • Harvest the information content of crowd-sourced online Twitter feeds
  • Use as an incident management (IM) support tool

Oak Ridge National Lab Report by Shih-Miao et al., 2004 Texas Transportation Institute, 2012

2

Crowdsourced so socia ial l media ia (T (Twitter) data can an help lp

slide-3
SLIDE 3

Use of f Social Media

  • Web 2.0  user generated content  everybody is a “reporter”

Social media feeds as information source

  • Brand adoption; Political public opinion; “meet up”;
  • Monitor disease outbreaks; Disaster information
  • Transportation
  • Surveys: policy, demand, etc.
  • Transit service disruptions real-time interaction
  • Potential for extracting real-time information

3

slide-4
SLIDE 4

Transportation Agency Adoptions of f Social Media

4

Iowa DOT Utah DOT Florida DOT

slide-5
SLIDE 5

In Information Ext xtraction fr from Social Media

  • ”needle in a haystack” problem (Grant-Muller et al., 2014).
  • Natural language form  80% unstructured (Liu et al., 2011),
  • Ungrammatical, abbreviated
  • Approach:

1.

  • 1. In

Information retrieval: query-based 2.

  • 2. In

Information extr xtraction: text  relevant information

  • “Dictionary”  List of common words  best “candidate” tweets
  • Context dependent, different set for different purposes
  • Lack/ambiguity of context  challenge! (Pereira et al., 2014)

3.

  • 3. Prediction: extracted information  predict future transportation states

5

slide-6
SLIDE 6

Potential value of f Social Media for IM IM

  • Most “prominent” (organizational) accounts use incident info from 511, DOT
  • Early detection of incidents is possible, for at least few incidents
  • Usually from tweets from people (personal accounts)
  • Important to distinguish between organizational and personal tweets

Dictionary!

Organizational & Personal

6

slide-7
SLIDE 7

Proposed Methodology

7

  • Waking up early to beat

BQE traffic sucks #offtowork…

  • Accident in #Queens…
  • Omg a car crashed into

  • Genius is talent set on

fire by courage. - Henry Van

1. Accident in #Queens… 2. Omg a car crashed into … 3. Waking up early to beat BQE traffic sucks #offtowork… 4. … 5. …

  • 10. Genius is talent set on

fire by courage. - Henry Van

  • 11. …

Dictionaries weighted words Cleaning Twitter API tf-idf Initial Crawled Dataset Potential Dataset ranked for importance Twitter Universe Key words Manually classify raw data into: Relevant (incident-related) & irrelevant Organizational account vs. personal accounts Score tweets using tf-idf “weights” ← importance of words

   

 

, ( , ) max , : f t d tf t d f w d w d  

   

, log : N idf t D d D t d           

slide-8
SLIDE 8

Proposed Methodology

8

1. Accident in #Queens… 2. Omg a car crashed into … 3. Waking up early to beat BQE traffic sucks #offtowork… 4. … 5. …

NB Classifier

Accident in #Queens… Omg a car crashed into … Waking up early to beat BQE traffic sucks #offtowork…

Accident- related

Irrelevant

Geo-code Potential Dataset Classified Geo- coded Dataset Classified Dataset

  • Naïve-Bayesian (NB) Classifier

What is the probability that a tweet is relevant given that it includes “car” and “crash”?

  • NB for each account type (Organizational vs. personal)

     

 

( )

1

| | : ( )

d i

m n i NB

p c p f c P c d P d

Manually coded tweets (train)

slide-9
SLIDE 9

Geocoding T

  • < 3% tweets have accurate geo-location

9

Account Tweet text Geocode Reported @TotalTraffic NYC Accident cleared in #Queens

  • n The L.I.E. WB at Douglaston

Pkwy, stop and go traffic back to x34, delay of 6 mins #traffic

  • 73.9626, -73.9626, -

73.6998, -73.6998, 40.5417, …, @sfgiantsfan1 @KTVU there was a high speed crash on Thornton ave in Newark car flipped several times before bursting into flames

  • 122.0731, -122.0731,
  • 121.9876, -121.9876,

37… @511NY Accident with property damage on #US9 NB at Montrose station rd

  • 73.9535, -73.9535, -

73.9166, -73.9166, 41.2298, …,

slide-10
SLIDE 10

Geocoding

  • Regular expressions (ave, pkwy, hwy, st, rd, at, near, between…)
  • Hastags (#Queens)
  • Location

10

Tweet text Geocode Reported Location @TotalTra fficNYC Accident cleared in #Queens on The L.I.E. WB at Douglaston Pkwy, stop and go traffic back to x34, delay of 6 mins #traffic

  • 73.9626, -

73.9626, -73.6998,

  • 73.6998, 40.5417,

…, Queens, NY @ @KTVU there was a high speed crash on Thornton ave in Newark car flipped several times before bursting into flames

  • 122.0731, -

122.0731, - 121.9876, - 121.9876, 37… Newark, CA

slide-11
SLIDE 11

Impact of dictionaries

Filtered based on a 20th percentile of normalized tf-idf Organizational tweets Personal tweets Total Organizational dictionary 435 4 439 Personal dictionary 409 49 458 Organizational + personal dictionary 469 18 487

11

Organization accounts Personal accounts "exit "ave" "accident" "lane" "block" "delay" "min" "pkwy" "traffic" "right" "back" "stop" "crash" "clear" "close" "left" "vehicle" "road" "disable" "accident" "just" "car" "traffic" "got" "bridge" "block" "crash" "highway" "thank" "get" "road" "today"

𝑂𝑝𝑠𝑛𝑏𝑚𝑗𝑨𝑓𝑒 𝑢𝑔𝑗𝑒𝑔 𝑇 = 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑢 𝑗𝑜 𝑒 𝑢𝑔𝑗𝑒𝑔 𝑢, 𝑒 𝑢∈𝑇 𝑢

6900 randomly selected public tweets collected using Twitter API. Manually coded raw data: incident-related & irrelevant Organizational vs. personal

slide-12
SLIDE 12

Impact of dictionaries

12

Relevant tweet Account type Using

  • rganizational

+ personal keywords Using

  • nly
  • rganization

al keywords Using

  • nly

personal keywords #1 State troopers just blocked the ramps leading from route 138 in Canton

  • nto 93 due to serious crash #WCVB

Agency 0.27 0.27 0.8 #2 Omg a car crashed into the paramus Wendy's @amandabootsy http://t.co/C4DwTEIyHN Personal 0.2 0.16 0.4 #3 @crosattto it was a bad wreck that a car went straight into the wall and went up in flames. http://t.co/XCvA7QkAF8 Personal 0.04 0.1 #4 car

  • n

fire

  • n

Lower level

  • f

Verrazano Bridge. 🚚🔦🚓🚩💧 @ Verrazano Bridge Tolls https://t.co/lpEPEGGXWn Personal 0.34 1.5

slide-13
SLIDE 13

Classification using different dictionaries

  • Raw data  80% training, 20% test
  • NBorg using only organizational dictionary.
  • NBall using organizational and personal dictionary.
  • NBpers using only personal dictionary.

13

Classifier Accuracy in predicting relevant tweets NBorg 75.6% NBall 85.5% Classifier Accuracy in predicting relevant personal tweets NBorg 50.5% NBall 54% NBper 74.4%

slide-14
SLIDE 14

Geocoding

14

Account Tweet text Geocode Reported @TotalTra fficNYC Accident cleared in #Queens on The L.I.E. WB at Douglaston Pkwy, stop and go traffic back to x34, delay of 6 mins #traffic

  • 73.9626, -

73.9626, - 73.6998, - 73.6998, 40.5417, …, @511NY Accident with property damage on #US9 NB at Montrose station rd

  • 73.9535, -

73.9535, - 73.9166, - 73.9166, 41.2298, …,

slide-15
SLIDE 15

Summary

  • All incident information is useful for early detection
  • Dictionaries derived from prominent accounts give lesser importance

to personal accounts

  • Personal dictionaries are more effective in
  • Filtering potentially useful tweets
  • Classification of relevant tweets
  • Geocoding requires analysis of regular expressions, hashtags, location
  • f account, neighborhood information

15

slide-16
SLIDE 16

Remarks

  • More raw data for personal tweets
  • Extra effort for identifying personal & organization (automated)
  • IM  incidence, location and time
  • Geo-coding : 3% on all tweets
  • Further text analysis
  • Time of tweet not always incident time

16

slide-17
SLIDE 17

Future Potential

Debris is, dead anim imal l Accide ident preventio ion!

Information Driven Operational

17

slide-18
SLIDE 18

18

Thank you! @nyserda @nysdot #Questions?