Overview of the NSF-CDI project (Year-3) and Research Progress - - PowerPoint PPT Presentation

overview of the nsf cdi project year 3 and research
SMART_READER_LITE
LIVE PREVIEW

Overview of the NSF-CDI project (Year-3) and Research Progress - - PowerPoint PPT Presentation

Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber- Enabled Discovery and Innovation ( CDI ) program. Award # 1028177. (2010-2014) http://mappingideas.sdsu.edu/ Overview of the NSF-CDI project (Year-3) and Research Progress


slide-1
SLIDE 1

Principle Investigator: Dr. Ming-Hsiang Tsou mtsou@mail.sdsu.edu, (Geography), Co-Pis: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography). San Diego State University, USA.

Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber- Enabled Discovery and Innovation (CDI) program. Award # 1028177. (2010-2014) http://mappingideas.sdsu.edu/

Overview of the NSF-CDI project (Year-3) and Research Progress

  • Ming-Hsiang (Ming) Tsou mtsou@mail.sdsu.edu,

Professor (Geography), San Diego State University, PI of “Mapping Ideas” project. Co-PIs:

  • Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic),
  • Dr. Brian Spitzberg (Communication), Dr. Li An (Geography)
slide-2
SLIDE 2

Starting Date: October 1, 2010

(Four Years, $1.38M total) Goal 1: Establish a new multidisciplinary research framework to represent the spatiotemporal diffusion of ideas and the semantic web on the Internet. Goal 2: Create effective visualization and analysis methods for the dynamic geospatial information landscape with three selected topics (e.g. natural disasters, continuous threats for human beings, and radical social movements). Goal 3: Build domain-specific ontology, citation, and (provocative) event knowledge bases with thesaurus and citation networks for the three selected topics and their Semantic Webs. Goal 4: Develop theoretical model(s) capable of integrating the individual (semantic usage, online motivations) and societal (diffusion) motives and practices associated with the spatiotemporal diffusion of ideas.

slide-3
SLIDE 3

Knowledge Discovery in Cyberspace (KDC)

Similar to the multidisciplinary research field, called “knowledge discovery in databases (KDD)” (Fayyad et al. 1996), this emerging research field, knowledge discovery in cyberspace (KDC), will focus on how to handle and analyze very large information and human messages collected from cyberspace and social media. The purpose of KDC is to scale up our research capability of handling millions of records and information items available in social media (such as Twitter) or web pages (searched by Google, Yahoo, or Bing search engines).

(Cited from: Ming-Hsiang Tsou & Michael Leitner (2013): Visualization of social media: seeing a mirage or a message?, Cartography and Geographic Information Science, 40:2, 55-60 )

Goal 1: Establish a new multidisciplinary research framework

slide-4
SLIDE 4

Place

(Scale, Space, context)

Time

(Dynamic)

Messages

( content / function, who, what, how, media)

Tweets, web pages, emails, short messages (San Diego, New York, 92119, SDSU, Bus Stops, Sea World…) August 23, 2012 (snapshot), one week, two months, before / after etc.

Triangular Knowledge Base (Human centered) Interdependent!

The Uniqueness

  • f KDC
slide-5
SLIDE 5

Cyberspace

  • Social Media (Twitter, Facebook, Flickr, Youtube)
  • Web Pages, Weblogs, News, RSS, Emails, etc.

Target Data:

  • Tweets (keywords, regions, API types)
  • Web Pages (keywords, web search engines)

Collected Data:

  • SQL databases (Tweet contents)
  • Excel files (Web Search Results)

Preprocessed Data:

  • SQL-output-Tweets – remove errors and duplicated
  • Geocoded Excels - add lat/long ,
  • Improving geolocation results.

Transformed Data:

  • Graphics, Bar charts, WordCloud, etc.
  • Original Point Maps (each point represented one web page or
  • ne tweets).

Selection (research focus) Collection

(Tools, APIs)

Preprocessing

(reduce noises, data clearing, select regions, time scale/series ???)

Transformation (mapping + graphs)

KDC: Knowledge Discovery in Cyberspace (7 steps)

slide-6
SLIDE 6

Transformed Data: (continued from last page)

  • Graphics, Bar charts, WordCloud, etc.
  • Original Point Maps (each point represented one web page or one

tweets).

Visualized Data:

  • Kernel Density maps, Differential KD maps, Point Density Maps.
  • Excel files (Web Search Results)

Pattern Recognition:

  • Decision Trees and Rules
  • Nonlinear Regression and Classification Methods
  • Example-based Methods (nearest-neighbor classification)
  • Probabilistic Graphic Dependency Models
  • Relational Learning Models

Knowledge Formalization:

  • Verification (City Mayor Maps, Movie tweets)
  • Discovery
  • Prediction (Election)
  • Description (Outbreaks, election)

Explore/Compare Methods

(Select algorithms)

Information Mining

(Analyze Space-Time- Information relationships)

Interpretation / Evaluation All seven steps are systematic, algorithm-based procedures.

slide-7
SLIDE 7

Real World

Cyber Information Space (BIG DATA)

(Web Pages, Social Media, Weblogs, Forums, News) Information Mining Tools

CyberDiscovery Tools

  • Yahoo API
  • Bing API
  • Google API

Twitter_GeoSearch_Tool

  • Search API
  • Streaming API

VISION Ontological Analysis Platform WHO, Where, When, What - WHY? (Networks) Spatial Visualization Tools / Methods Computational Linguistics Analysis Tools / Methods Place – Time – Messages (content/functions) New Theories (explanation) , New Models (simulation), New Knowledge

“V”isualizing “I”nformation “S”pace “I”n “O”ntological “N”etworks

(VISION)

slide-8
SLIDE 8

Information Communication Channels in Cyberspace

  • Web Pages ( Semi-Public Information Communication )
  • Social Media (Twitter: Semi-Private Information Communication)

Web Pages: Use Web Search Engines (Google, Yahoo, and Bing) to retrieve up to 1,000 web pages per

  • keyword. Then analyze their contents

associated with their ranks and geolocations. Social Media (Tweets): Use Twitter APIs to retrieve tweets based on Keywords or #Hashtag and geolocations (self-defined home-towns

  • r GPS locations).
slide-9
SLIDE 9

Collect Web Page Contents, Ranks, and Locations: We develop Cyber-Discovery Search Engine (Retrieve up to 1000 results from Yahoo or Bing)

slide-10
SLIDE 10

Twitter – Spatial search API

Center: 41.961295, -93.281859 Radius: 180 miles

Search API Limitations:

  • 1. Spatial Search can only trace back up to

seven days. (Regular search can trace back to

14 days.)

  • 2. Each search results can not exceed 1500

tweets.

Twitter APIs:

  • REST API
  • Stream API
  • Search API
slide-11
SLIDE 11

Web Pages Search Results vs. Tweets

slide-12
SLIDE 12

Web Page Visualization maps (using Google or

Yahoo search egnine result to convert Web Page IP addresses into Lat/Lon. with Maxmind Lookup tables). IP geolocation - the “registration location of Web server” (not the physical location of machines).

(What is the veracity of geolocation ?).

slide-13
SLIDE 13

COMPARE between Bing engine and Yahoo engine: (“Jerry Sanders” keyword -- % in 12 different web page categories – defined by our team members) Bing search: more commercial and informational (wiki), social media. Yahoo search: more blogs and news, and educational pages. (But in general, there are some similarity between the two engines)

Classifying different types of web pages and social medias for content and linguistic analysis;

slide-14
SLIDE 14

Spatial Accuracy of Web Page Categories based on the IP IP address geo-convertion.

 Highest:

 Educational

 73.86%

 Social Media

 68.97%

 Government

 60.98%

 Lowest:

 Blog

 10.81%

 Special

Interest Group

 12.81%

 NGO

 20.93%

slide-15
SLIDE 15

Geolo location Accuracy in in Dif ifferent Keywords

  • GREEN (correct)
  • Blue (incorrect)
  • Gray (unknown)
  • Highest Spatial Accuracy
  • McGinn – 33.57%
  • Lowest Spatial Accuracy
  • Santorum – 21.29%
  • Highest N/A
  • Flu – 35.52%
slide-16
SLIDE 16

Web Page Information Landscape (2012 Presidential Election)

Ming-Hsiang Tsou , Jiue-An Yang , Daniel Lusher , Su Han , Brian Spitzberg , Jean Mark Gawron , Dipak Gupta & Li An (2013): Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): a case study in 2012 US Presidential Election, Cartography and Geographic Information Science, DOI:10.1080/15230406.2013.799738

slide-17
SLIDE 17

http://mappingideas.sdsu.edu/mapshowcase/election/webpage/election3.html

slide-18
SLIDE 18

1) Select 30 major U.S. Cities within 17 miles radius -- collect tweets with movie keywords (TED, Spider Man, etc.) 2) Compare the daily movie box office results and the number of tweets containing each movie keywords.

Twitter Case Study #1: 2012 Summer Comparing FIVE Movie Tweets & Box Office

slide-19
SLIDE 19

Daily Weekly TED 0.8826 0.9989 Spider-Man 0.9409 0.9725 Ice Age 0.8895 0.9528 Dark Knight 0.9523 0.9375 Step Up 0.8931 0.8123

Five Movies Correlation Test

Daily: Daily_Tweets vs. Daily_Box_Revenue Weekly: 8_to_13_days_before, one_week_before, release_day,

  • ne_week_after, two_weeks_after, three_weeks_after,

four_weeks_after

slide-20
SLIDE 20

Tweet_Daily and Box_Daily (TED)

5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 10,000 20,000 30,000 40,000 50,000 60,000

  • 13-12-11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Box Office Revenue Tweets

Tweet_Daily Box_Daily

Release

One Week Two Week

slide-21
SLIDE 21

Tweet_Weekly and Box_Weekly (TED)

10000000 20000000 30000000 40000000 50000000 60000000 70000000 80000000 90000000

20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000

  • 2
  • 1

1 2 3 4

Box Office Revenue Tweets

Week (releasing day as 0)

Box_weekly Tweets_weekly

slide-22
SLIDE 22
slide-23
SLIDE 23

Before Hurricane Sandy After Hurricane Sandy

Case Study #2: 2012 Presidential Election (Tweets)

slide-24
SLIDE 24

Sentiment Analysis

(case study: 2012 Presidential Election) (Before / After Hurricane Sandy)

slide-25
SLIDE 25

User Behavior Analysis: Tweeting Pattern – by Hours

0.2 0.4 0.6 0.8 1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Total Tweets (Z) Hour in Day President Election Public Health Environmental Movie

slide-26
SLIDE 26

Location Consistency

Consistent 60% non- Consistent 20% N/A 20%

Locations are aggregated by States

Geo-tagged Location vs. Self-report Location

Domain GPS Tweet % Elections 1.018 Health 4.360 Environmental 1.315 Movie 2.875

slide-27
SLIDE 27

THE INNOVATION DIFFUSION MODEL (E. Rogers, 1962): Diffusion: “a special type of communication concerned with the spread of messages that are perceived as new ideas” (E. Rogers, p. 35).

Torsten Hägerstrand, 1953 – Doctoral Dissertation - Culture Diffusion. 1967 Innovation Diffusion as a Spatial Process (book).

Developing new communication theories for cyberspace social networks: DIFFUSION OF INNOVATIONS

Public (Mass Media, TV, News) vs. Private (Personal Communications)

slide-28
SLIDE 28

Publication (2013)

One special issue in CaGIS journal Five refereed journal papers

slide-29
SLIDE 29

Human Dynamics Mobile Technology Spatial Science

SDSU: Human Dynamics in the Mobile Age (HDMA)

A Proposal for an Area of Research Excellence at San Diego State University, 2013. Five Core Faculty:

  • Ming-Hsiang Tsou (Coordinator of HDMA, Professor of Geography) 2012 SDSU President's

Leadership Fund (PLF) Awards for Excellence.

  • Sheldon Zhang ( Chair of Sociology, Professor ), expert in Human Trafficking, Transnational Organized

Crime, Juvenile Delinquency, Probation and Parole, etc.

  • John Elder (Distinguished Professor of Public Health, the Director of IBACH) 2012 SDSU President's

Leadership Fund (PLF) Awards for Excellence, expert in randomized trials for chronic disease prevention in Latino communities.

  • Piotr Jankowski Incoming Chair of Geography), expert in Spatial Decision Support Systems.
  • Brian Spitzberg (Senate Distinguished Professor, School of Communication), expert in communication

theories, communicative (in)competence, obsessive relational intrusion.

along with other 29 funding faculty members from 14 different departments in 5 colleges.

slide-30
SLIDE 30

Four new faculty lines will be hired in 2014 and 2015.

(Assistant or Associate Professor levels)

  • One faculty line for Graduate School of Public Health is

suggested for the Institute for Behavioral and Community Health [IBACH] to focus on Behavior-Environmental interactions in Community Health. (2014)

  • One faculty line is suggested for the Department of Geography

to focus on spatio-temporal data analytics. (2014)

  • One faculty line is suggested for the Department of Sociology

to focus on Urban Sociology with demography and spatial analysis techniques. (2015)

  • One faculty line is suggested for the School of Communication

to focus on strategic communication with Big Data and Social

  • Media. (2015).