The Paradoxes of Social Data: How Heterogeneity Distorts Information - - PowerPoint PPT Presentation

the paradoxes of social data
SMART_READER_LITE
LIVE PREVIEW

The Paradoxes of Social Data: How Heterogeneity Distorts Information - - PowerPoint PPT Presentation

The Paradoxes of Social Data: How Heterogeneity Distorts Information in Networks Kristina Lerman USC Information Sciences Institute http://www.isi.edu/~lerman USC Information Sciences Institute Local vs Global The local and global views of


slide-1
SLIDE 1

The Paradoxes of Social Data:

How Heterogeneity Distorts Information in Networks

Kristina Lerman USC Information Sciences Institute

http://www.isi.edu/~lerman

slide-2
SLIDE 2

USC Information Sciences Institute

Local vs Global

The local and global views of the same information are

  • ften irreconcilable
  • Global view does not reflect local information
  • Simpson’s paradox in behavioral data
  • Global (population-level) trends may not reflect local

(individual-level) tendencies

  • Local views do not reflect the global reality
  • Friendship paradoxes
  • Network structure skews local perceptions of nodes
slide-3
SLIDE 3

USC Information Sciences Institute

  • What is Simpson’s

paradox

  • Why it occurs
  • Some real-world

examples

  • How to test for it
  • How to find it in data

SimpSon’S paradox

slide-4
SLIDE 4

USC Information Sciences Institute

SimpSon’S par aradox adox

  • A trend exists in aggregate data but disappears or

reverses when data is disaggregated by subgroups

www.methodsman.com

slide-5
SLIDE 5

USC Information Sciences Institute

SimpSon’S par aradox adox

  • A trend exists in aggregate data but disappears or

reverses when data is disaggregated by subgroups

www.methodsman.com

slide-6
SLIDE 6

USC Information Sciences Institute

Survivor bias and heterogeneous population

Vaupel, J. W. and Yashin, A. I. (1985). Heterogeneity's ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3):176-185.

Recidivism rate of convicts released from prison declines with time since release

slide-7
SLIDE 7

USC Information Sciences Institute

Survivor bias and heterogeneous population

Vaupel, J. W. and Yashin, A. I. (1985). Heterogeneity's ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3):176-185.

Recidivism rate of convicts released from prison declines with time since release In reality, two populations: incorrigibles and reformed. Over time, fewer incorrigibles in the population

slide-8
SLIDE 8

USC Information Sciences Institute

Why does Simpson’s paradox occur?

  • Subgroups differ in the background factor
  • The background factor and the independent variable are

correlated

slide-9
SLIDE 9

USC Information Sciences Institute

Survivor bias and heterogeneous population

Vaupel, J. W. and Yashin, A. I. (1985). Heterogeneity's ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3):176-185.

Average rate appears to decrease… … over time, there are fewer people from subgroup1 (incorrigibles) in the population

slide-10
SLIDE 10

USC Information Sciences Institute

Stack Exchange: deterioration in answer quality

Better answers?: Users appear to write better answers (more likely to be accepted as best answer) later in a session

[Ferrara, Alipoufard, Burghardt, Gopal & Lerman (2017) “Dynamics of content quality in collaborative knowledge production”, in ICWSM.]

Worse answers: When the same data is disaggregated by length

  • f the session, later answers are

less likely to be accepted.

slide-11
SLIDE 11

USC Information Sciences Institute

Facebook: content consumption rates

95 96 97 98 99 100 10 20 30

Time in the session (minutes) Average time spent (normalized) 80 85 90 95 100 10 20 30 40

Time in the session (minutes) − www Average time spent (normalized)

Slowdown?: Facebook users appear to spend more time reading each story over the course of a session Speedup: When the data is disaggregated by session length, users spend less time reading each story later in a session

[Kooti, Subbian, Mason, Adamic & Lerman (2017) “Understanding short-term changes in online activity sessions”, in WWW.]

slide-12
SLIDE 12

USC Information Sciences Institute

Social contagion: do friends amplify or suppress response?

[1. Romero, Meeder & Kleinberg (2011) “Differences in the Mechanics of Information Diffusion Across Topics” in WWW.] [2. Hodas & Lerman (2012) “How visibility and divided attention constrain social contagion”, in SocialCom.]

Complex contagion?: Additional exposures by friends appear to suppress response (probability to use a hashtag)1 Simple contagion?: When disaggregated by cognitive load (number of friends), additional exposures by friends amplify response (probability to retweet)2

Number of tweeting friends

slide-13
SLIDE 13

USC Information Sciences Institute

How to test for Simpson’s paradox

slide-14
SLIDE 14

USC Information Sciences Institute

The shuffle test

Randomize the data with respect to independent variable

  • Trend should disappear in shuffled data
  • E.g., online shopping: Is there a relationships between

item price and how long a user waits to buy it?

  • Randomize the time items were purchased

$$ $$$ $$$ $ $$

[Lerman, K. (2018). Computational social scientist beware: Simpson's paradox in behavioral data. Journal

  • f Computational Social Sciences, 1(1):49-58.]
slide-15
SLIDE 15

USC Information Sciences Institute

The shuffle test

Randomize the data with respect to independent variable

  • Trend should disappear in shuffled data
  • E.g., online shopping: Is there a relationships between

item price and how long a user waits to buy it?

  • Randomize the time items were purchased

$$ $$$ $$$ $ $$

[Lerman, K. (2018). Computational social scientist beware: Simpson's paradox in behavioral data. Journal

  • f Computational Social Sciences, 1(1):49-58.]
slide-16
SLIDE 16

USC Information Sciences Institute

Testing the trend: online shopping

0.19 0.20 0.21 0.22 50 100 150

Days from last purchase Average normalized price

Normal Shuffled

Users with 5 purchases

20 30 40 50 60 50 100 150

Days from last purchase Average item price

Normal Shuffled

Online shopping: trend persists in the aggregated data after shuffling Online shopping: trend disappears (as expected) in the disaggregated data after shuffling

slide-17
SLIDE 17

USC Information Sciences Institute

Stack Exchange: Original aggregate data

[Ferrara, Alipoufard, Burghardt, Gopal & Lerman (2017) “Dynamics of content quality in collaborative knowledge production”, in ICWSM.]

Original disaggregated data Trend remains in the shuffled aggregate data Trends disappear in the shuffled disaggregated data

slide-18
SLIDE 18

USC Information Sciences Institute

Deterioration in comment quality on Reddit

 The more time people spend online, the worse they perform

slide-19
SLIDE 19

USC Information Sciences Institute

Automating discovery of Simpson’s paradoxes

slide-20
SLIDE 20

USC Information Sciences Institute

Method to discover Simpson’s paradoxes in data

[Alipourfard, Fennell & Lerman (2017) “Don’t trust the trend: Discovering Simpson’s paradoxes in social data”, in WSDM.]

Step 1: Estimate trend with respect to an independent variable Xp Step 2: Disaggregate data by conditioning

  • n another variable Xc

Step 3: Compare trends in disaggregated subgroups to trends in aggregate data

slide-21
SLIDE 21

USC Information Sciences Institute

Paradoxes discovered in Stack Exchange data

[Alipourfard, Fennell & Lerman (2017) “Don’t trust the trend: Discovering Simpson’s paradoxes in social data”, in WSDM.]

slide-22
SLIDE 22

USC Information Sciences Institute

Stack Exchange: a new paradox we discovered

Does experience help?: Users who have already written more answers appear to write better answers (more likely to be accepted)

[Alipourfard, Fennell & Lerman (2017) “Don’t trust the trend: Discovering Simpson’s paradoxes in social data”, in WSDM.]

Worse answers: When the same data is disaggregated by reputation, having more experience does not help write better answers.

slide-23
SLIDE 23

USC Information Sciences Institute

Data-driven discovery

Reputation Rate better explains behavior

[Alipourfard, Fennell & Lerman (2017) “Don’t trust the trend: Discovering Simpson’s paradoxes in social data”, in WSDM.]

slide-24
SLIDE 24

USC Information Sciences Institute

FRIENDSHIP (AND OTHER) PARADOXES IN NETWORKS

slide-25
SLIDE 25

USC Information Sciences Institute

Networks distort individuals’ perceptions A town is voting to officially declare baseball caps fashionable. A polling firm asks people whether they thought baseball caps have popular support. People only know their own opinion and what their friends think.

By Kevin Schaul

slide-26
SLIDE 26

USC Information Sciences Institute

Majority illusion

A minority opinion can appear to be very popular within many local social circles.

slide-27
SLIDE 27

USC Information Sciences Institute

slide-28
SLIDE 28

USC Information Sciences Institute

Friendship paradox

Friendship paradox: On average, your friends have more friends than you do [Feld, 1991].

slide-29
SLIDE 29

USC Information Sciences Institute

Friendship paradox

Friendship paradox: On average, your friends have more friends than you do [Feld, 1991].

3

slide-30
SLIDE 30

USC Information Sciences Institute

Friendship paradox

Friendship paradox: On average, your friends have more friends than you do [Feld, 1991].

3 4 3 6

slide-31
SLIDE 31

USC Information Sciences Institute

Friendship paradox

3 4 3 6 2 2 2 2 2

2 1

2 3 3 3 3 2 2 2 2 4 4 4 4 5 5 4 4 5

Friendship paradox: On average, your friends have more friends than you do [Feld, 1991].

slide-32
SLIDE 32

USC Information Sciences Institute

Strong friendship paradox

3 3 2 2 2 2 2

2 1

2 3 3 3 3 2 2 2 2 4 4 4 5 4 4 5

Strong friendship paradox: Most of your friends have more friends than you do [Kooti, Hodas and Lerman, 2014].

slide-33
SLIDE 33

USC Information Sciences Institute

How strong is strong friendship paradox?

Network Type Nodes Probability of paradox LiveJournal Social 3,997,962 84% Twitter Social 780,000 98% Skitter Internet 1,696,415 89% Google Hyperlink 875,713 77% ProsperLoan Social Finance 89,269 88% ArXiv Citation 34,546 79% WordNet Semantic 146,005 75%

A very large fraction of individual nodes observe that most of their neighbors have a larger degree

slide-34
SLIDE 34

USC Information Sciences Institute

Twitter Digg 50 100 % users mean median

Twitter Digg 50 100 % users

Generalized friendship paradoxes

Twitter Digg 50 100 % users mean median

[Kooti, et al (2014) “Network Weirdness: Exploring the origins of network paradoxes” in ICWSM]

Activity paradox: Most of your friends post more messages than you do. Diversity paradox: Most of your friends receive more diverse information than you do Virality paradox: Most of your friends receive more viral information than you do.

weak paradox strong paradox

slide-35
SLIDE 35

USC Information Sciences Institute

Strong friendship paradox creates majority illusion

When high degree nodes are more likely to have a trait, the remaining nodes will experience majority illusion

  • Large degree-trait (k-x) correlation amplifies the illusion
  • Stronger in disassorative networks (smaller r)

[Lerman, Wu & Yan (2016) The “Majority Illusion” in Social Networks, in Plos One.]

slide-36
SLIDE 36

USC Information Sciences Institute

Friendship paradox and risky behavior

  • Strong friendship paradox can systematically distort

individual’s perceptions

  • Example: College students overestimate peers’ alcohol

use

Source: Most Students Do PartySafe@Cal

How many alcoholic drinks are consumed at a party

0% 10% 20% 30% 40% None 1-2 drinks 3-4 drinks 5-6 drinks 7+ drinks

Myself

0% 10% 20% 30% 40% None 1-2 drinks 3-4 drinks 5-6 drinks 7+ drinks

My Friends

slide-37
SLIDE 37

USC Information Sciences Institute

To summarize

  • Network structure can systematically bias local perceptions
  • Heterogeneous degree distribution (1K structure)
  • Large inequality of connectivity
  • Disassortativity (2K structure)
  • Popular people linking to unpopular people
  • Neighbor assortativity (3K structure)
  • Degree correlation of neighbors
  • Degree-trait correlation
  • Popular people more likely to have the trait, e.g., be rich
  • Open questions: What is the impact of network bias on
  • Collective dynamics in networks, e.g., contagious outbreaks
  • Network sampling and inference
  • Network control and intervention
slide-38
SLIDE 38

USC Information Sciences Institute

To summarize

  • Simpson’s paradox occurs when an association
  • bserved in the subgroups disappears or reverses

when the subgroups are combined into one.

  • Also occurs when measuring trends with respect to an

independent variable

  • Algorithm to automatically identify subgroups with

different trends

  • A tool for data-driven discovery
  • And to formulate new hypotheses about data.
slide-39
SLIDE 39

USC Information Sciences Institute

THANK YOU!

Sponsors NSF: CIF-1217605 ARO: W911NF-15-1-0142, W911NF-16-1-0306

Questions? lerman@isi.edu