Data Science, Demography and Social Media Challenges and - - PowerPoint PPT Presentation

data science demography and social media
SMART_READER_LITE
LIVE PREVIEW

Data Science, Demography and Social Media Challenges and - - PowerPoint PPT Presentation

Data Science, Demography and Social Media Challenges and Opportunities Emilio Zagheni Department of Sociology and eScience Institute University of Washington, Seattle February 2, 2017 Todays seminar 1. How data science and social media


slide-1
SLIDE 1

Data Science, Demography and Social Media

Challenges and Opportunities Emilio Zagheni

Department of Sociology and eScience Institute University of Washington, Seattle

February 2, 2017

slide-2
SLIDE 2

Today’s seminar

  • 1. How data science and social media are

transforming demography

  • 2. How demographic thinking helps us

make sense of messy and biased data

  • 3. How misuse of new tools and data may

lead to dangerous outcomes

slide-3
SLIDE 3

Outline

  • a. Background on ‘digital demography’
  • b. An example from my own research:

Estimating migration using Facebook advertisement data

  • c. Potential misuse of online advertising

platforms

  • d. Making sense of messy data: an example

using Twitter data

slide-4
SLIDE 4

credit: Emmanuel Letouz´ e

slide-5
SLIDE 5

Demography is the study of populations (including non-human populations). It deals with processes related to mortality, fertility, migration. It attempts to explain the causes and consequences of population dynamics.

slide-6
SLIDE 6

“Demography is the quintessential quantitative social

  • science. It bears something of the same relationship to other

social sciences that physics bears to other natural sciences”

  • Ken Wachter, Essential Demographic Methods
slide-7
SLIDE 7

“Demography is the quintessential quantitative social

  • science. It bears something of the same relationship to other

social sciences that physics bears to other natural sciences”

  • Ken Wachter, Essential Demographic Methods

⇒ Demography is a discipline that plays a central role in the social sciences

slide-8
SLIDE 8

“Biodemography fundamentally deepens our understanding

  • f the underlying evolutionary drivers of demographic

patterns across the tree of life”

  • Jim Vaupel
slide-9
SLIDE 9

“Biodemography fundamentally deepens our understanding

  • f the underlying evolutionary drivers of demographic

patterns across the tree of life”

  • Jim Vaupel

⇒ Demography as an engine of innovation in the biological sciences

slide-10
SLIDE 10

One of Demography’s many traits Demography is (or aspires to be) a driver of innovation for all sciences and is energized by exchange of ideas with other disciplines

slide-11
SLIDE 11

Parallels with Data Science Demography Data Science is (or aspires to be) a driver of innovation for all sciences and is energized by exchange of ideas with

  • ther disciplines
slide-12
SLIDE 12

What has made Demography successful?

◮ The intrinsic nature of the discipline:

  • It deals with quantities that are relatively easy to

measure

  • The object of analysis is suitable for mathematical

modeling

  • Everything is a population

◮ Extrinsic factors:

  • Data availability (often collected from authorities for

a number of purpose)

  • The questions asked have policy relevance
  • Major demographic issues faced by societies (e.g.

population growth, population aging)

slide-13
SLIDE 13

What is the next frontier in Demography? What are the challenges ahead?

slide-14
SLIDE 14

“Digital Demography”

◮ The Web, social media and smartphones have

had a sudden and unprecedented impact on

  • ur lives and have given researchers new data

to study demographic behavior.

slide-15
SLIDE 15

“Digital Demography”

◮ The Web, social media and smartphones have

had a sudden and unprecedented impact on

  • ur lives and have given researchers new data

to study demographic behavior.

◮ ‘Digital demography’ is about:

  • 1. Studying the implications of the digital revolution on

demographic behavior

  • 2. Using new data sources to better understand

demographic processes

slide-16
SLIDE 16

Using Facebook Advertisement Data to Estimate Migration

Joint work with Ingmar Weber (QCRI) and Krishna Gummadi (MPI)

slide-17
SLIDE 17

What ads looked like in the 1930s...

slide-18
SLIDE 18

Today: Online (targeted) advertising

slide-19
SLIDE 19
slide-20
SLIDE 20

Targeting a demographic group on Facebook

slide-21
SLIDE 21

You can access the data in a programmatic way

slide-22
SLIDE 22

Leveraging Facebook to study Migration

slide-23
SLIDE 23
  • 0.00

0.05 0.10 0.15 0.00 0.05 0.10 0.15 Migrants to US states for different countries of origin Fraction of 'expats' in Facebook Fraction of foreign born in the ACS

Mexicans in CA Filipinos in HI Mexicans in NM

1e−04 5e−04 5e−03 5e−02 1e−05 1e−03 1e−01

log−log plot

slide-24
SLIDE 24
  • −6

−4 −2 −4 −3 −2 −1

log(Fraction of immigrants in Facebook) log(Fraction of immigrants − World Bank) Continent

  • Africa

Asia Europe Latin America North America Oceania

Fraction of immigrants by country of destination

slide-25
SLIDE 25

A tool potentially useful for demographic and survey research, but that could also be misused...

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

https://www.propublica.org/article/ breaking-the-black-box-what-facebook-knows-about-you

slide-29
SLIDE 29

Making sense of noisy and messy data

slide-30
SLIDE 30

Can you recognize this city?

slide-31
SLIDE 31

Does this look a bit more familiar?

slide-32
SLIDE 32

The original picture

slide-33
SLIDE 33

Can we infer the height of the Space Needle from

  • ne of the images?
slide-34
SLIDE 34

Can we infer the height of the Space Needle from

  • ne of the images?

◮ No distortions ⇒ Compare with buildings

around it

slide-35
SLIDE 35

Can we infer the height of the Space Needle from

  • ne of the images?

◮ No distortions ⇒ Compare with buildings

around it

◮ Distortions consistent across the image ⇒

you can still compare with buildings nearby

slide-36
SLIDE 36

Can we infer the height of the Space Needle from

  • ne of the images?

◮ No distortions ⇒ Compare with buildings

around it

◮ Distortions consistent across the image ⇒

you can still compare with buildings nearby

◮ No clear pattern in distortions ⇒ develop a

statistical model to understand patterns

slide-37
SLIDE 37

Social Media offer a “distorted” image of the real world

slide-38
SLIDE 38

Social Media offer a “distorted” image of the real world

◮ We want to know the true rates for the

underlying population

slide-39
SLIDE 39

Social Media offer a “distorted” image of the real world

◮ We want to know the true rates for the

underlying population ⇒ Combining different sources of information is key to extracting value from potentially biased data

slide-40
SLIDE 40

...Social media data were produced and collected for reasons other than population studies There is a lot of useful information in big social data, but we need to work hard to interpret the new data sources

slide-41
SLIDE 41

Example: Inferring Migration/Mobility patterns from Twitter Data

Zagheni, Garimella, Weber and State 2014

slide-42
SLIDE 42

Geo-located Twitter data

slide-43
SLIDE 43

Geo-located Twitter data

◮ We collected a large sample of geo-located

Twitter tweets (with geographic coordinates) for the period 2011-2013 in OECD countries

slide-44
SLIDE 44

Geo-located Twitter data

◮ We collected a large sample of geo-located

Twitter tweets (with geographic coordinates) for the period 2011-2013 in OECD countries

◮ We evaluated short-term mobility (periods of

4 months)

slide-45
SLIDE 45

Geo-located Twitter data

◮ We collected a large sample of geo-located

Twitter tweets (with geographic coordinates) for the period 2011-2013 in OECD countries

◮ We evaluated short-term mobility (periods of

4 months)

◮ No official statistics to calibrate a model

slide-46
SLIDE 46

Geo-located Twitter data

◮ We collected a large sample of geo-located

Twitter tweets (with geographic coordinates) for the period 2011-2013 in OECD countries

◮ We evaluated short-term mobility (periods of

4 months)

◮ No official statistics to calibrate a model

⇒ We proposed a difference-in-differences approach to estimate trends

slide-47
SLIDE 47

Geographic mobility for Twitter users

  • 0.00

0.01 0.02 0.03 0.04 0.05

Mobility rates for Twitter users

  • May−Aug11

Sep−Dec11 Jan−Apr12 May−Aug12 Sep−Dec12 average OECD Mexico USA Germany Japan

Source: Zagheni, Garimella, Weber and State, WWW’14

slide-48
SLIDE 48

Assumptions when ‘ground truth’ data do not exist

slide-49
SLIDE 49

Assumptions when ‘ground truth’ data do not exist

Consider the following situation: yt

i

  • Observation from

social media for location i

= n

  • bias for

location i

+ xt

i

  • “true” rate

for location i

and yt

z

  • Observation from

social media for location z

= m

  • bias for

location z

+ xt

z

  • “true” rate

for location z

Additive bias different across regions, but constant (or changes by the same amount across regions) over short periods of time

slide-50
SLIDE 50

Assume that we knew the ‘true’ rates (x) for France and Spain

xt+1

FR = 0.7

xt+1

SP = 0.5

xt

FR = 0.5

xt

SP = 0.4

Let’s define δt+1 as the differential in the variation of these quantities of interest between time t and (t + 1) δt+1 = (xt+1

FR − xt FR) − (xt+1 SP − xt SP)

  • difference in the increments

=?

slide-51
SLIDE 51

Assume that we knew the ‘true’ rates (x) for France and Spain

xt+1

FR = 0.7

xt+1

SP = 0.5

xt

FR = 0.5

xt

SP = 0.4

Let’s define δt+1 as the differential in the variation of these quantities of interest between time t and (t + 1) δt+1 = (xt+1

FR − xt FR) − (xt+1 SP − xt SP)

  • difference in the increments

=? δt+1 = (0.7 − 0.5) − (0.5 − 0.4) =

slide-52
SLIDE 52

Assume that we knew the ‘true’ rates (x) for France and Spain

xt+1

FR = 0.7

xt+1

SP = 0.5

xt

FR = 0.5

xt

SP = 0.4

Let’s define δt+1 as the differential in the variation of these quantities of interest between time t and (t + 1) δt+1 = (xt+1

FR − xt FR) − (xt+1 SP − xt SP)

  • difference in the increments

=? δt+1 = (0.7 − 0.5) − (0.5 − 0.4) = = 0.2 − 0.1 = 0.1

slide-53
SLIDE 53

Plato’s allegory of the Cave

slide-54
SLIDE 54

All we see is a distorted image (y) of the ‘true’ rates (x)

yt+1

FR = 0.2 + 0.7

yt+1

SP = 0.1 + 0.5

yt

FR = 0.2 + 0.5

yt

SP = 0.1 + 0.4

What is δt+1? δt+1 = (yt+1

FR − yt FR) − (yt+1 SP − yt SP)

  • difference in the increments

=?

slide-55
SLIDE 55

All we see is a distorted image (y) of the ‘true’ rates (x)

yt+1

FR = 0.2 + 0.7

yt+1

SP = 0.1 + 0.5

yt

FR = 0.2 + 0.5

yt

SP = 0.1 + 0.4

What is δt+1? δt+1 = (yt+1

FR − yt FR) − (yt+1 SP − yt SP)

  • difference in the increments

=? δt+1 = (0.9 − 0.7) − (0.6 − 0.5) =

slide-56
SLIDE 56

All we see is a distorted image (y) of the ‘true’ rates (x)

yt+1

FR = 0.2 + 0.7

yt+1

SP = 0.1 + 0.5

yt

FR = 0.2 + 0.5

yt

SP = 0.1 + 0.4

What is δt+1? δt+1 = (yt+1

FR − yt FR) − (yt+1 SP − yt SP)

  • difference in the increments

=? δt+1 = (0.9 − 0.7) − (0.6 − 0.5) = = 0.2 − 0.1 = 0.1 Same as before...

slide-57
SLIDE 57

Difference in differences estimator

δt+1 = (yt+1

i

− yt+1

z

) − (yt

i − yt z)

After substituting: δt+1 = (xt+1

i

− xt

i) − (xt+1 z

− xt

z)

  • difference in the increments

Additive values of the bias (m and n) cancel out

slide-58
SLIDE 58

Twitter example

AUS BEL CAN CHL DNK FIN FRA DEU GRC HUN IRL ITA JPN KOR MEX NLD NZL NOR PRT ESP SWE CHE TUR GBR USA

Difference in difference of out−mobility rates

−0.015 −0.005 0.005 0.015

Source: Zagheni, Garimella, Weber and State, WWW’14

slide-59
SLIDE 59

Remarks

If the bias is expected to be multiplicative: yt

i

  • Observation from

social media for location i

= n

  • bias for

location i

× xt

i

  • “true” rate

for location i

slide-60
SLIDE 60

Remarks

If the bias is expected to be multiplicative: yt

i

  • Observation from

social media for location i

= n

  • bias for

location i

× xt

i

  • “true” rate

for location i

Use a logarithmic transformation log(yt

i) = log(n) + log(xt i)

slide-61
SLIDE 61

Remarks

If the bias is expected to be multiplicative: yt

i

  • Observation from

social media for location i

= n

  • bias for

location i

× xt

i

  • “true” rate

for location i

Use a logarithmic transformation log(yt

i) = log(n) + log(xt i)

Then use the difference-in-differences estimator on the logs: δt+1 = [log(yt+1

i

) − log(yt+1

z

)] − [log(yt

i) − log(yt z)]

slide-62
SLIDE 62

Tip of the iceberg

slide-63
SLIDE 63

Tip of the iceberg

◮ Digital demography is potentially relevant for

every area of population studies. Examples include

  • 1. How is online dating affecting household formation?
  • 2. How do people behave in the online marriage market?

And how do they react to demographic imbalances and shocks?

  • 3. How are new technologies affecting intergenerational

relationships?

  • 4. How does online exposure to peers affect health and

fertility behavior?

  • 5. What do Google searches reveal about fertility or

abortions?

slide-64
SLIDE 64

Underlying themes

slide-65
SLIDE 65

Underlying themes

  • 1. How to access and make sense of new, messy

and biased data sources?

  • 2. To what extent traditional research design

can be re-purposed to address new challenges?

  • 3. What is the impact of new data and new

tools on our society? ⇒ SOC 401: “Data Science and Population Processes”, is offered in the Fall

slide-66
SLIDE 66

www.csde.washington.edu