Generalists and Specialists Using Community Embeddings to Quantify - - PowerPoint PPT Presentation

generalists and specialists
SMART_READER_LITE
LIVE PREVIEW

Generalists and Specialists Using Community Embeddings to Quantify - - PowerPoint PPT Presentation

Isaac Waller walleris@cs.toronto.edu Ashton Anderson ashton@cs.toronto.edu University of Toronto The Web Conference 2019 Generalists and Specialists Using Community Embeddings to Quantify Activity Diversity in Online Platforms full-stack


slide-1
SLIDE 1

Generalists and Specialists

Using Community Embeddings to Quantify Activity Diversity in Online Platforms Isaac Waller

walleris@cs.toronto.edu

Ashton Anderson

ashton@cs.toronto.edu University of Toronto

The Web Conference 2019

slide-2
SLIDE 2

Generalists and specialists

full-stack developer vs. React developer family doctor vs. neurosurgeon generalist vs. specialist

slide-3
SLIDE 3

Generalists and specialists

full-stack developer vs. React developer family doctor vs. neurosurgeon generalist vs. specialist

slide-4
SLIDE 4

Generalists and specialists

vulture generalist koala specialist

Koala photo by DAVID ILIFF. License: CC-BY-SA 3.0. Vulture photo by Charles Sharp. License: CC-BY-SA 4.0

slide-5
SLIDE 5

Reddit

Games MakeupAddiction medicalschool soccer math programming Cartalk chromeos Construction funny television Aquariums

slide-6
SLIDE 6

Which is the specialist?

User 1: C = {China, nba, Buddhism, startrek} User 2: C = {Fitness, powerlifting, bodybuilding, weightroom} GS C ?

slide-7
SLIDE 7

Which is the specialist?

User 1: C = {China, nba, Buddhism, startrek} User 2: C = {Fitness, powerlifting, bodybuilding, weightroom} GS(C) = ?

slide-8
SLIDE 8

Word2vec1

[1] Mikolov et al. (2013) Distributed Representations of Words and Phrases and their Compositionality

slide-9
SLIDE 9

Word2vec for communities2,3

Input: a (community, user) pair for each comment made in a community (Games, user1) (Fitness, user3) (medicalschool, user2) (China, user4) (Science, user2) (weightlifting, user3) Output: a vector for each community in the input, where communities with high user overlap are closer to each other

[2] Kumar et al. (2018) Community Interaction and Conflict on the Web [3] Martin (2017) community2vec: Vector representations of online communities encode semantic relationships

slide-10
SLIDE 10

Word2vec for communities2,3

Input: a (community, user) pair for each comment made in a community (Games, user1) (Fitness, user3) (medicalschool, user2) (China, user4) (Science, user2) (weightlifting, user3) Output: a vector for each community in the input, where communities with high user overlap are closer to each other

[2] Kumar et al. (2018) Community Interaction and Conflict on the Web [3] Martin (2017) community2vec: Vector representations of online communities encode semantic relationships

slide-11
SLIDE 11

A first embedding

slide-12
SLIDE 12

A first embedding

slide-13
SLIDE 13

Word analogies

Male to female Verb tense

slide-14
SLIDE 14

Community analogies

University to city Sports team to sport / city

slide-15
SLIDE 15

4,392 analogies total

brocku → stcatharinesON as uakron → akron angelsbaseball → baseball as LAClippers → nba nus → singapore as UMT → missoula Colts → indianapolis as

  • aklandraiders

  • akland

PolkStateCollege → WinterHaven as csun → LosAngeles Coyotes → phoenix as AnaheimDucks → LosAngeles FLC → folsom as OxfordBrookes →

  • xford

phillies → philadelphia as Torontobluejays → toronto

slide-16
SLIDE 16

Hyperparameter search

100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 23.81% 31.53% 39.25% 46.97% 54.69% 62.41% 70.14% 77.86% 85.58% 93.30%

72% perfect, 93% top 5 cycling swimming running triathalon

slide-17
SLIDE 17

Hyperparameter search

100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 23.81% 31.53% 39.25% 46.97% 54.69% 62.41% 70.14% 77.86% 85.58% 93.30%

72% perfect, 93% top 5 cycling swimming running triathalon

slide-18
SLIDE 18

Hyperparameter search

100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 100 120 140 160 180 200 size 0.16 0.18 0.20 0.22 alpha 23.81% 31.53% 39.25% 46.97% 54.69% 62.41% 70.14% 77.86% 85.58% 93.30%

72% perfect, 93% top 5 cycling + swimming + running = triathalon

slide-19
SLIDE 19

Our better embedding

slide-20
SLIDE 20

Back to generalists and specialists

User 1: C = {China, nba, Buddhism, startrek} User 2: C = {Fitness, powerlifting, bodybuilding, weightroom} GS(C) = ?

slide-21
SLIDE 21

GS-score

generalist specialist GS C C

c C

wccos c

slide-22
SLIDE 22

GS-score

generalist specialist GS(C) = 1 |C| ∑

c∈C

wccos(c, µ)

slide-23
SLIDE 23

GS-score

User 1: GS({China, nba, Buddhism, startrek}) = 0.69

24th percentile

User 2: GS({Fitness, powerlifting, bodybuilding, weightroom}) = 0.89

72nd percentile

GS(C) = 1 |C| ∑

c∈C

wccos(c, µ)

slide-24
SLIDE 24

Data

All comments in 2017 All commits, pull requests, forks, watches, and stars in 2017 900M comments, 11.4M distinct users 413M actions, 8.3M distinct users Top 10,000 subreddits by activity Top 40,000 repos by number of stars

Sources: pushshift.io, gharchive.org

slide-25
SLIDE 25

Results

0.6 0.8 1.0 25000 50000 75000 Frequency 0.6 0.8 1.0 5000 10000

3 5 6 11 12 31 32

Reddit (left) and GitHub (right)

slide-26
SLIDE 26

Results

Specialists stay engaged with communities longer but generalists stay engaged with the platform longer

0.6 0.8 1.0 0.003 0.004 0.005 P(stay for >= 6 months) 0.6 0.8 1.0 0.000 0.001 0.002 0.0 0.2 0.4 0.6 0.8 1.0 User's GS-score 0.0 0.2 0.4 0.6 0.8 1.0

slide-27
SLIDE 27

Results

Specialists stay engaged with communities longer but generalists stay engaged with the platform longer

20 40 60 80 0.6 0.7 0.8 P(remaining on platform) 20 40 60 80 0.70 0.75 0.80 0.85 0.90

1st quartile 2nd quartile 3rd quartile 4th quartile

0.0 0.2 0.4 0.6 0.8 1.0 Activity (# of comments) 0.0 0.2 0.4 0.6 0.8 1.0

slide-28
SLIDE 28

Results

Specialists stay engaged with communities longer but generalists stay engaged with the platform longer

slide-29
SLIDE 29

Results

On Reddit, specialists tend to be make more exceptional comments

20 40 60 80 100 Percentile author GS-score 0.14 0.16 P(score > parent)

slide-30
SLIDE 30

Results

but generalists are exposed to a more diverse set of users

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 User's GS-score 0.6 0.7 0.8 0.9 1.0 Parent-universe GS-score

slide-31
SLIDE 31

Results

Can GS-score predict new communities a user joins?

20 40 60 80 100 User GS-score percentile 0.00 0.02 0.04 0.06 Mean average precision Center-of-mass NN Collaborative filtering Popularity Random

slide-32
SLIDE 32

Results

Can GS-score predict new communities a user joins?

20 40 60 80 100 User GS-score percentile 0.00 0.02 0.04 0.06 Mean average precision Center-of-mass NN Collaborative filtering Popularity Random

slide-33
SLIDE 33

Community GS-scores

slide-34
SLIDE 34

Community GS-scores

2015 2016 2017 2018 0.70 0.75 0.80 Community GS-score 1st 2nd 3rd 4th quartile 2017-1 2017-6 2017-11 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 Month 0.00 0.25 0.50 0.75 1.00

slide-35
SLIDE 35

In summary

Users on Reddit and GitHub range from generalist to specialist

20 40 60 80 100 Percentile author GS-score 0.14 0.16 P(score > parent)

On Reddit, specialists are more likely to make exceptional comments

0.6 0.8 1.0 0.003 0.004 0.005 P(stay for >= 6 months) 0.6 0.8 1.0 0.000 0.001 0.002 0.0 0.2 0.4 0.6 0.8 1.0 User's GS-score 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 0.6 0.7 0.8 P(remaining on platform) 20 40 60 80 0.70 0.75 0.80 0.85 0.90

1st quartile 2nd quartile 3rd quartile 4th quartile

0.0 0.2 0.4 0.6 0.8 1.0 Activity (# of comments) 0.0 0.2 0.4 0.6 0.8 1.0

Specialists stay engaged with individual communities longer, but generalists stay engaged with the platform longer

20 40 60 80 100 User GS-score percentile 0.00 0.02 0.04 0.06 Mean average precision Center-of-mass NN Collaborative filtering Popularity Random

Specialists are significantly more predictable than generalists

slide-36
SLIDE 36