using Big Data to Understand Human Systems Ryan Kennedy University - - PowerPoint PPT Presentation

using big data to understand
SMART_READER_LITE
LIVE PREVIEW

using Big Data to Understand Human Systems Ryan Kennedy University - - PowerPoint PPT Presentation

Opportunities and Challenges in using Big Data to Understand Human Systems Ryan Kennedy University of Houston How Big is Big Data? In the 3 rd Century BC, the Library of Alexandria was thought to contain the entire sum of human knowledge.


slide-1
SLIDE 1

Opportunities and Challenges in using Big Data to Understand Human Systems

Ryan Kennedy University of Houston

slide-2
SLIDE 2

How Big is Big Data?

  • In the 3rd Century BC, the Library of Alexandria was thought to

contain the entire sum of human knowledge.

  • Today, there is enough digitally stored information to give every

person alive 320 times as much information as we think were stored in that library.

slide-3
SLIDE 3

Opportunities in Big Data

  • Microtargeting
  • Real-time feedback
  • Information on hard-to-find groups
  • Data on social systems
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Big Data Hype

  • (Mayer-Schönberger and Cukier 2014):
  • Approaching “N = all”
  • Correlation is enough (no need for theory)
slide-8
SLIDE 8

Problems in Big Data

  • Big Data Hubris
  • Overfitting
  • Vulnerability to Artifacts
  • Non-ideal Users
  • Blue Team Issues
  • Red Team Issues
slide-9
SLIDE 9

Big Data Hubris

  • The belief that volume can solve all problems.
  • Don’t get me wrong, it can solve some (e.g. Xbox project).
  • But there are several things that have to be clear:
  • Sampling frame
  • Convenience samples are still convenience samples
  • Generalizability still has the standard limitations
  • Motivations for uses of technology not always clear
  • Behavioral analogue must be clear (and often requires small data)
slide-10
SLIDE 10

Can Twitter Predict Elections?

Party Election Results Share of Twitter Mentions (Original Study) Share of Twitter Mentions (Replication) Christian Democrats (CDU) 28.4% 30.1% 18.6% Christian Social Democrats (CSU) 6.8% 5.6% 3.0% Social Democratic Party (SPD) 24.0% 26.6% 14.7% Free Democratic Party (FDP) 15.2% 17.3% 11.2% The Left (Die Linke) 12.4% 12.4% 8.3% Green Party (Grüne) 11.1% 8.0% 9.3% Pirate Party (Piraten) 2.1%

  • 34.8%
slide-11
SLIDE 11

Can Twitter Predict Elections?

Tweets mentioning Chavez Tweets mentioning Capriles

slide-12
SLIDE 12

Better Small Data = Better Big Data

slide-13
SLIDE 13

Using Big Data to Supplement Small Data

slide-14
SLIDE 14

Overfitting

  • When there is a lot of data to fit to a relatively small number of data

points, the number of strong correlations that will be found by chance alone increases dramatically.

  • This danger is made even worse by modern algorithms that can find

very non-linear and highly-interactive relationships.

  • Out-of-sample prediction helps, but it doesn’t completely solve the

problems.

  • Causality is still important.
slide-15
SLIDE 15

Google Flu Trends (GFT)

  • Examine 50m search terms.
  • Utilized those most heavily

correlated with flu prevalence, as measured by CDC regional reports, but curated to weed out non-flu-related searches.

  • Released in 2008, updated in

2009, updated again in 2013.

slide-16
SLIDE 16

Big data means big danger of overfitting…

  • In Google Flu the search terms

are identified through “brute force” (50 million search terms fit to 1152 data points).

  • Similar problems for other “big

data” resources (e.g. Twitter).

#feelingalittlesick #nyquilfixeseverything #blowingmyownbrainsout #fallingapart #shamblingnose #YOLO?

slide-17
SLIDE 17

Vulnerability to Artifacts

  • Need to know how your system turns unstructured data into

quantitative data.

  • Examples:
  • Google NGram project
  • Stanford Sentiment Analyzer

(http://nlp.stanford.edu:8080/sentiment/rntnDemo.html)

slide-18
SLIDE 18

Vulnerability to Artifacts

slide-19
SLIDE 19

Vulnerability to Artifacts

slide-20
SLIDE 20

Vulnerability to Artifacts

slide-21
SLIDE 21

Non-Ideal Users

  • Often an ideal-user assumption, but:
  • Users often create multiple accounts
  • Users will sometimes not answer particular questions
  • Users will modify behavior if know are observed
  • Users will modify behavior due to irrelevant events
slide-22
SLIDE 22

Trends: abdominal pain on my right side

slide-23
SLIDE 23

Blue Team Issues

  • “Data exhaust” not designed for analysis.
  • Process generating data always changing and geared towards goals
  • ther than collecting accurate data.
  • This means that the data-generating process can change without

warning and in unpredictable ways (e.g. Google’s Search Algorithm).

  • Even if the data-generating process remains relatively stable, it can

still be idiosyncratic (e.g. event data).

slide-24
SLIDE 24

Changes that could affect Google Flu…

  • Improved Trends geolocation.
  • Recommended searches.
  • Related search listings.
  • Diagnosis related to search

terms for symptoms.

  • Customized search results by

time and location.

  • Popularization of particular

terms.

  • And many many more.
slide-25
SLIDE 25

Red Team Issues

  • As we become better at monitoring systems, individuals have

incentives to manipulate those signals.

  • Bots and puppets.
  • Purchased support.
slide-26
SLIDE 26

Red Team Issues

slide-27
SLIDE 27

Methods to Address Issues

  • Strong ground truth measures (need small data to have good big

data).

  • Causality still matters.
  • Critically evaluate the sample from which big data is derived.
  • Dynamic re-estimation.
  • Understanding data generating process and critically tracking system

changes.

  • Anticipating manipulation and setting up early warnings.
  • “Rosetta Stone” for linking digital behavior to specific social context.
slide-28
SLIDE 28

Dynamic re-estimation…

Santillana et al. Forthcoming. American Journal of Preventative Medicine.

Can we reach Asymptopia?

slide-29
SLIDE 29

Understanding Causality and the Data Generating Process

slide-30
SLIDE 30

Linking Digital Behavior to Context

slide-31
SLIDE 31

Thank You

Ryan Kennedy University of Houston