Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - - PowerPoint PPT Presentation

mining the social web
SMART_READER_LITE
LIVE PREVIEW

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - - PowerPoint PPT Presentation

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013 Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1 Web Science Definition from Web


slide-1
SLIDE 1

Mining the Social Web

Asmelash Teka Hadgu teka@l3s.de

L3S Research Center

April 30, 2013

slide-2
SLIDE 2

Outline

Introduction User Classification Network Analysis Content Analysis Privacy Issues

L3S Web Science 1

slide-3
SLIDE 3

Web Science

Definition from Web Science Conference1

◮ Web Science is the emergent science of the people,

  • rganizations, applications, and of policies that shape and are

shaped by the Web.

◮ Web Science embraces the study of the Web as a vast

universal information network of people and communities.

◮ Studying human behavior and social interaction contributes to

  • ur understanding of the Web, while Web data is transforming

how social science is conducted.

1http://www.websci13.org/

L3S Web Science 2

slide-4
SLIDE 4

Social Media

Figure: Social Media billboard2

2http://bit.ly/10216Jy

L3S Web Science 3

slide-5
SLIDE 5

Twitter

◮ Politicians use Twitter to mobilize users. ◮ Companies use Twitter for marketing products.

L3S Web Science 4

slide-6
SLIDE 6

User classification in twitter [1]

How can we automatically construct user profiles?

L3S Web Science 5

slide-7
SLIDE 7

Applications

◮ Authoritative users extraction - Discovering expert users for a

target topic.

◮ Personalized web search - Personalized social media posts

retrieval.

◮ User recommendation - Suggesting new interesting users to a

target user.

L3S Web Science 6

slide-8
SLIDE 8

Example tasks

◮ Political affiliation detection (Right vs Left) ◮ Ethnicity identification (African-Americans or not) ◮ Detecting affinity for a particular business (Starbucks fans)

L3S Web Science 7

slide-9
SLIDE 9

Machine Learning Model

◮ Feature Construction: Profile, tweeting behaviour, linguistic

content, social network features.

◮ Classification Algorithm: Gradient Boosted Decision Trees,

GBDT framework

L3S Web Science 8

slide-10
SLIDE 10

Profile features

Profile information does not contain enough quality information to be directly used for user classification.

L3S Web Science 9

slide-11
SLIDE 11

Profile features

Profile information does not contain enough quality information to be directly used for user classification.

◮ Length of name, Number of alphanumeric chars. ◮ Capitalization forms in user name ◮ Use of avatar picture ◮ Number of followers/ friends ◮ Regular expression matches in bio:

(I|i)(m|am|′m|[0 − 9] + (yo|yearold)whiteman|woman|boy|girl

L3S Web Science 9

slide-12
SLIDE 12

Tweeting behaviour features

A set of statistics capturing the way users interact with the micro-blogging service.

L3S Web Science 10

slide-13
SLIDE 13

Tweeting behaviour features

A set of statistics capturing the way users interact with the micro-blogging service.

◮ Number of tweets of a user. ◮ Number and fraction of retweets of a user. ◮ Ave. number of hashtags and URLs per tweet ◮ Ave. time and std between tweets

L3S Web Science 10

slide-14
SLIDE 14

Linguistic content features

Linguistic content contains the user’s lexical usage and the main topics of interest to the user.

L3S Web Science 11

slide-15
SLIDE 15

Linguistic content features

Linguistic content contains the user’s lexical usage and the main topics of interest to the user.

◮ prototypical words, hashtags instead of bag-of-words

representation.

◮ Generic LDA, Domain-specific LDA ◮ Sentiment words

L3S Web Science 11

slide-16
SLIDE 16

Social network features

These features contain the social connections between a user and those one follows, replies to or whose messages they retweet.

◮ Friend accounts - Prototypical ‘friend’ accounts are generated

by exploring the social network of users in the training set.

◮ Number of prototypical friends, percentage number of

prototypical friend

◮ Prototypical replied users, Prototypical retweeted users

L3S Web Science 12

slide-17
SLIDE 17

Experiments

◮ Political affiliation, more than 80% ◮ Starbucks fans ◮ Ethnicity

L3S Web Science 13

slide-18
SLIDE 18

Political Polarization on Twitter [2]

How social media shape the networked public sphere and facilitate communication between communities with different political

  • rientations.

L3S Web Science 14

slide-19
SLIDE 19

Data Set

◮ 250,000 politically relevant tweets from more than 45,000

users.

◮ Construct two networks of political communication - retweet

and mention networks.

◮ Data set available at:

cnets.indiana.edu/groups/nan/truthy

L3S Web Science 15

slide-20
SLIDE 20

Finding

Figure: Political retweet network (left) and mention network(right)

L3S Web Science 16

slide-21
SLIDE 21

Framework

◮ Data gathering ◮ Identifying political content ◮ Political communication networks ◮ Network analysis

L3S Web Science 17

slide-22
SLIDE 22

Identifying Political Content

◮ Political communication - any tweet containing at least one

politically relevant hashtag.

◮ Political hashtags constructed from seed hashtags #p2 and

#tcot using Jaccard similarity.

◮ Let S set of tweets containing seed hashtag and T set of

tweets containing another hashtag. σ(S, T) = S ∩ T S ∪ T

L3S Web Science 18

slide-23
SLIDE 23

Community Structure

◮ Community detection using a label propagation method for

two communities.

◮ Label propagation - Assign an initial arbitrary cluster

membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors.

◮ Modularity to measure segregation.

L3S Web Science 19

slide-24
SLIDE 24

Do clusters have similar content?

◮ Associate each user with a profile vector of hashtags in their

tweets, weighted by frequency.

◮ Cosine similarity among users.

L3S Web Science 20

slide-25
SLIDE 25

Do clusters in the retweet network correspond to groups of users of similar political alignment?

◮ Qualitative content analysis from social science. ◮ One author annotates 1,000 random users as ‘left’ or ‘right’. ◮ Another user annotates 200 random users from the 1,000

users above.

◮ Inter annotator agreement measured using Cohen’s Kappa ◮

k = P(α) − P(ǫ) 1 − P(ǫ where P(α) is observed rate of agreement between annotators and P(ǫ) is expected rate of random agreement given relative frequency of each class label.

L3S Web Science 21

slide-26
SLIDE 26

Political Twitter Trends, PTT [3]

◮ Analysis tool for political polarization of Twitter hashtags 3.

3http://politicalhashtagtrends.sandbox.yahoo.com/

L3S Web Science 22

slide-27
SLIDE 27

Data Set

◮ Start with a set of seed political users such as @BarackObama

and @MittRomney whose political leaning is known.

◮ Get their tweets.

L3S Web Science 23

slide-28
SLIDE 28

Data Set . . .

◮ Collect users that retweet seed users’ tweets.

L3S Web Science 24

slide-29
SLIDE 29

Filtering Users by Location

◮ We want to limit our analysis to the U.S.

L3S Web Science 25

slide-30
SLIDE 30

Evaluating Data Quality

◮ Against Web Directories. ◮ Precision = 0.98, 0.93 for Wefollow and Twellow respectively. ◮ Manual inspection: “greatest environmentalist. Also, despise

republicans”

L3S Web Science 26

slide-31
SLIDE 31

Detecting Political Hashtags

◮ Look into co-occurrence with seed political hashtags

(#p2, #tcot,#gop, #ows) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’)

◮ Volume filtering to avoid rare hashtags.

L3S Web Science 27

slide-32
SLIDE 32

Detecting Political Hashtags

◮ Look into co-occurrence with seed political hashtags

(#p2, #tcot,#gop, #ows) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’)

◮ Volume filtering to avoid rare hashtags.

L3S Web Science 27

slide-33
SLIDE 33

Computing Trending Score

◮ trending - currently popular. Having a higher volume than

expected.

L3S Web Science 28

slide-34
SLIDE 34

Computing Trending Score

◮ trending - currently popular. Having a higher volume than

expected. trend(h, w) :=

f (h,w)/

h′∈H f (h′,w)

  • u≤w f (h,u)/

h′∈H

  • u≤w f (h′,u)

L3S Web Science 28

slide-35
SLIDE 35

Computing Trending Score

◮ trending - currently popular. Having a higher volume than

expected. trend(h, w) :=

f (h,w)/

h′∈H f (h′,w)

  • u≤w f (h,u)/

h′∈H

  • u≤w f (h′,u)

Examples:

◮ #obamagotosama: 01 May 2011 to 08 May 2011. ◮ #ows: 25 Sep. 2011 to 2 Oct. 2011.

L3S Web Science 28

slide-36
SLIDE 36

Computing Trending Score

◮ trending - currently popular. Having a higher volume than

expected. trend(h, w) :=

f (h,w)/

h′∈H f (h′,w)

  • u≤w f (h,u)/

h′∈H

  • u≤w f (h′,u)

Examples:

◮ #obamagotosama: 01 May 2011 to 08 May 2011. ◮ #ows: 25 Sep. 2011 to 2 Oct. 2011. ◮ Non-trending hashtags: #vote, #democracy.

L3S Web Science 28

slide-37
SLIDE 37

Assigning a Leaning to Hashtags

Using Voting approach: Lean(h,w) := Vote(h, w)L

  • l∈L Vote(h, w)l .

L3S Web Science 29

slide-38
SLIDE 38

Assigning a Leaning to Hashtags

Using Voting approach: Lean(h,w) := Vote(h, w)L

  • l∈L Vote(h, w)l .

◮ Vote(h,w)l = f (h, w)l

L3S Web Science 29

slide-39
SLIDE 39

Assigning a Leaning to Hashtags

Using Voting approach: Lean(h,w) := Vote(h, w)L

  • l∈L Vote(h, w)l .

◮ Vote(h,w)l = f (h, w)l ◮ Vote(h,w)l = f (h,w)l

  • h′∈H f (h′,w)l Normalization

L3S Web Science 29

slide-40
SLIDE 40

Assigning a Leaning to Hashtags

Using Voting approach: Lean(h,w) := Vote(h, w)L

  • l∈L Vote(h, w)l .

◮ Vote(h,w)l = f (h, w)l ◮ Vote(h,w)l = f (h,w)l

  • h′∈H f (h′,w)l Normalization

◮ Vote(h,w)l = f (h,w)l

  • h′∈H f (h′,w)l +
  • h′∈H f (h′,w)l

40000

(c.f. Laplace Smoothing)

L3S Web Science 29

slide-41
SLIDE 41

“Change points” detected

L3S Web Science 30

slide-42
SLIDE 42

“Change points” detected

◮ Hypothesis: “change points” are caused by users from the

  • ther leaning deliberately trying to ‘hijack’ a hashtag.

L3S Web Science 30

slide-43
SLIDE 43

Leanings over time

L3S Web Science 31

slide-44
SLIDE 44

Leanings over time . . .

L3S Web Science 32

slide-45
SLIDE 45

Leanings over time . . .

L3S Web Science 33

slide-46
SLIDE 46

Detecting “change points” in hashtags

◮ Filter hashtags without sufficient support.

Total number of weeks > 4

◮ Relative and absolute change in leaning from previous week.

Change from previous week > std and Change from previous week > 0.25

◮ Change from average value is big.

Current value - Average value > std

◮ Change in leaning is in the direction of other leaning.

Change in direction = TRUE

L3S Web Science 34

slide-47
SLIDE 47

Detecting “change points” in hashtags

◮ Filter hashtags without sufficient support.

Total number of weeks > 4

◮ Relative and absolute change in leaning from previous week.

Change from previous week > std and Change from previous week > 0.25

◮ Change from average value is big.

Current value - Average value > std

◮ Change in leaning is in the direction of other leaning.

Change in direction = TRUE

L3S Web Science 34

slide-48
SLIDE 48

Detecting “change points” in hashtags

◮ Filter hashtags without sufficient support.

Total number of weeks > 4

◮ Relative and absolute change in leaning from previous week.

Change from previous week > std and Change from previous week > 0.25

◮ Change from average value is big.

Current value - Average value > std

◮ Change in leaning is in the direction of other leaning.

Change in direction = TRUE

L3S Web Science 34

slide-49
SLIDE 49

Detecting “change points” in hashtags

◮ Filter hashtags without sufficient support.

Total number of weeks > 4

◮ Relative and absolute change in leaning from previous week.

Change from previous week > std and Change from previous week > 0.25

◮ Change from average value is big.

Current value - Average value > std

◮ Change in leaning is in the direction of other leaning.

Change in direction = TRUE

L3S Web Science 34

slide-50
SLIDE 50

Who Wants to Get Fired? [4]

Work from L3S Research Center

Figure: FireMe! on the news

L3S Web Science 35

slide-51
SLIDE 51

FireMe! Homepage 4

Figure: Fireme! Home page

4http://fireme.l3s.uni-hannover.de/fireme.php

L3S Web Science 36

slide-52
SLIDE 52

Problem Definition

Study posting behaviour of individuals who tweet they hate their job or boss.

L3S Web Science 37

slide-53
SLIDE 53

Observation

◮ Many users are not aware of their audience. ◮ Build an alerting system to address the danger of public online

data.

L3S Web Science 38

slide-54
SLIDE 54

Data Set

◮ Collect ‘haters’ - users mentioning sentences like: ‘I hate my

job’, ‘I hate my boss’, ‘I hate the worst job’ . . .

◮ Collect ‘lovers’ - Users posting positive messages like: ‘I love

my job’, ‘my bosss is the best’.

◮ Crawled users for one week, June 18 - June 26, 2012. 21,851

haters and 44,710 lovers.

◮ Randomly select 10,000 users from each group. Get the latest

200 tweets of each user.

L3S Web Science 39

slide-55
SLIDE 55

Data Analysis

Characterising groups:

◮ lovers are more connected. Three times as many followers,

and 20% more friends.

◮ haters are more active in terms of tweeting speed - twice as

many tweets per day.

◮ haters link to their facebook profile in their bio.

L3S Web Science 40

slide-56
SLIDE 56

FireMe! Features

Figure: FireMe! alert page

◮ Shows user, her hate tweet and FireMeter! score ◮ FireMeter! score fraction of hate tweets in the latest 100

tweets of a user.

L3S Web Science 41

slide-57
SLIDE 57

Alert Impact

◮ Sent out over 4000 alert messages to haters. ‘What are you

going to do about it?’

◮ Feedback: 42% delete hate tweet, 18%change privacy setting,

40%don’t care.

◮ Overall 60% users concerned about their personal data.

L3S Web Science 42

slide-58
SLIDE 58

FireMe! Features . . .

Figure: FireMe! provides a leaderboard

L3S Web Science 43

slide-59
SLIDE 59

Validation and more analysis . . .

◮ Majority of users deleted the compromising tweets. ◮ Tone of alert messages (neutral, aggressive, action) is

important.

L3S Web Science 44

slide-60
SLIDE 60

Reference

  • M. Pennacchiotti and A.-M. Popescu, “Democrats, republicans and

starbucks afficionados: user classification in twitter,” in Proceedings

  • f the 17th ACM SIGKDD international conference on Knowledge

discovery and data mining, KDD ’11, (New York, NY, USA),

  • pp. 430–438, ACM, 2011.
  • M. Conover, J. Ratkiewicz, M. Francisco, B. Gon¸

calves,

  • A. Flammini, and F. Menczer, “Political polarization on twitter,” in
  • Proc. 5th Intl. Conference on Weblogs and Social Media, 2011.
  • I. Weber, V. Garimella, and A. Teka, “Political hashtag trends,” in

Advances in Information Retrieval (P. Serdyukov, P. Braslavski,

  • S. Kuznetsov, J. Kamps, S. Rger, E. Agichtein, I. Segalovich, and
  • E. Yilmaz, eds.), vol. 7814 of Lecture Notes in Computer Science,
  • pp. 857–860, Springer Berlin Heidelberg, 2013.
  • R. Kawase, B. P. Nunes, E. Herder, W. Nejdl, and M. A. Casanova,

“Who wants to get fired?,” CHI, April 27May 2 2013.

L3S Web Science 45