Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - - PowerPoint PPT Presentation
Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - - PowerPoint PPT Presentation
Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013 Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1 Web Science Definition from Web
Outline
Introduction User Classification Network Analysis Content Analysis Privacy Issues
L3S Web Science 1
Web Science
Definition from Web Science Conference1
◮ Web Science is the emergent science of the people,
- rganizations, applications, and of policies that shape and are
shaped by the Web.
◮ Web Science embraces the study of the Web as a vast
universal information network of people and communities.
◮ Studying human behavior and social interaction contributes to
- ur understanding of the Web, while Web data is transforming
how social science is conducted.
1http://www.websci13.org/
L3S Web Science 2
Social Media
Figure: Social Media billboard2
2http://bit.ly/10216Jy
L3S Web Science 3
◮ Politicians use Twitter to mobilize users. ◮ Companies use Twitter for marketing products.
L3S Web Science 4
User classification in twitter [1]
How can we automatically construct user profiles?
L3S Web Science 5
Applications
◮ Authoritative users extraction - Discovering expert users for a
target topic.
◮ Personalized web search - Personalized social media posts
retrieval.
◮ User recommendation - Suggesting new interesting users to a
target user.
L3S Web Science 6
Example tasks
◮ Political affiliation detection (Right vs Left) ◮ Ethnicity identification (African-Americans or not) ◮ Detecting affinity for a particular business (Starbucks fans)
L3S Web Science 7
Machine Learning Model
◮ Feature Construction: Profile, tweeting behaviour, linguistic
content, social network features.
◮ Classification Algorithm: Gradient Boosted Decision Trees,
GBDT framework
L3S Web Science 8
Profile features
Profile information does not contain enough quality information to be directly used for user classification.
L3S Web Science 9
Profile features
Profile information does not contain enough quality information to be directly used for user classification.
◮ Length of name, Number of alphanumeric chars. ◮ Capitalization forms in user name ◮ Use of avatar picture ◮ Number of followers/ friends ◮ Regular expression matches in bio:
(I|i)(m|am|′m|[0 − 9] + (yo|yearold)whiteman|woman|boy|girl
L3S Web Science 9
Tweeting behaviour features
A set of statistics capturing the way users interact with the micro-blogging service.
L3S Web Science 10
Tweeting behaviour features
A set of statistics capturing the way users interact with the micro-blogging service.
◮ Number of tweets of a user. ◮ Number and fraction of retweets of a user. ◮ Ave. number of hashtags and URLs per tweet ◮ Ave. time and std between tweets
L3S Web Science 10
Linguistic content features
Linguistic content contains the user’s lexical usage and the main topics of interest to the user.
L3S Web Science 11
Linguistic content features
Linguistic content contains the user’s lexical usage and the main topics of interest to the user.
◮ prototypical words, hashtags instead of bag-of-words
representation.
◮ Generic LDA, Domain-specific LDA ◮ Sentiment words
L3S Web Science 11
Social network features
These features contain the social connections between a user and those one follows, replies to or whose messages they retweet.
◮ Friend accounts - Prototypical ‘friend’ accounts are generated
by exploring the social network of users in the training set.
◮ Number of prototypical friends, percentage number of
prototypical friend
◮ Prototypical replied users, Prototypical retweeted users
L3S Web Science 12
Experiments
◮ Political affiliation, more than 80% ◮ Starbucks fans ◮ Ethnicity
L3S Web Science 13
Political Polarization on Twitter [2]
How social media shape the networked public sphere and facilitate communication between communities with different political
- rientations.
L3S Web Science 14
Data Set
◮ 250,000 politically relevant tweets from more than 45,000
users.
◮ Construct two networks of political communication - retweet
and mention networks.
◮ Data set available at:
cnets.indiana.edu/groups/nan/truthy
L3S Web Science 15
Finding
Figure: Political retweet network (left) and mention network(right)
L3S Web Science 16
Framework
◮ Data gathering ◮ Identifying political content ◮ Political communication networks ◮ Network analysis
L3S Web Science 17
Identifying Political Content
◮ Political communication - any tweet containing at least one
politically relevant hashtag.
◮ Political hashtags constructed from seed hashtags #p2 and
#tcot using Jaccard similarity.
◮ Let S set of tweets containing seed hashtag and T set of
tweets containing another hashtag. σ(S, T) = S ∩ T S ∪ T
L3S Web Science 18
Community Structure
◮ Community detection using a label propagation method for
two communities.
◮ Label propagation - Assign an initial arbitrary cluster
membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors.
◮ Modularity to measure segregation.
L3S Web Science 19
Do clusters have similar content?
◮ Associate each user with a profile vector of hashtags in their
tweets, weighted by frequency.
◮ Cosine similarity among users.
L3S Web Science 20
Do clusters in the retweet network correspond to groups of users of similar political alignment?
◮ Qualitative content analysis from social science. ◮ One author annotates 1,000 random users as ‘left’ or ‘right’. ◮ Another user annotates 200 random users from the 1,000
users above.
◮ Inter annotator agreement measured using Cohen’s Kappa ◮
k = P(α) − P(ǫ) 1 − P(ǫ where P(α) is observed rate of agreement between annotators and P(ǫ) is expected rate of random agreement given relative frequency of each class label.
L3S Web Science 21
Political Twitter Trends, PTT [3]
◮ Analysis tool for political polarization of Twitter hashtags 3.
3http://politicalhashtagtrends.sandbox.yahoo.com/
L3S Web Science 22
Data Set
◮ Start with a set of seed political users such as @BarackObama
and @MittRomney whose political leaning is known.
◮ Get their tweets.
L3S Web Science 23
Data Set . . .
◮ Collect users that retweet seed users’ tweets.
L3S Web Science 24
Filtering Users by Location
◮ We want to limit our analysis to the U.S.
L3S Web Science 25
Evaluating Data Quality
◮ Against Web Directories. ◮ Precision = 0.98, 0.93 for Wefollow and Twellow respectively. ◮ Manual inspection: “greatest environmentalist. Also, despise
republicans”
L3S Web Science 26
Detecting Political Hashtags
◮ Look into co-occurrence with seed political hashtags
(#p2, #tcot,#gop, #ows) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’)
◮ Volume filtering to avoid rare hashtags.
L3S Web Science 27
Detecting Political Hashtags
◮ Look into co-occurrence with seed political hashtags
(#p2, #tcot,#gop, #ows) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’)
◮ Volume filtering to avoid rare hashtags.
L3S Web Science 27
Computing Trending Score
◮ trending - currently popular. Having a higher volume than
expected.
L3S Web Science 28
Computing Trending Score
◮ trending - currently popular. Having a higher volume than
expected. trend(h, w) :=
f (h,w)/
h′∈H f (h′,w)
- u≤w f (h,u)/
h′∈H
- u≤w f (h′,u)
L3S Web Science 28
Computing Trending Score
◮ trending - currently popular. Having a higher volume than
expected. trend(h, w) :=
f (h,w)/
h′∈H f (h′,w)
- u≤w f (h,u)/
h′∈H
- u≤w f (h′,u)
Examples:
◮ #obamagotosama: 01 May 2011 to 08 May 2011. ◮ #ows: 25 Sep. 2011 to 2 Oct. 2011.
L3S Web Science 28
Computing Trending Score
◮ trending - currently popular. Having a higher volume than
expected. trend(h, w) :=
f (h,w)/
h′∈H f (h′,w)
- u≤w f (h,u)/
h′∈H
- u≤w f (h′,u)
Examples:
◮ #obamagotosama: 01 May 2011 to 08 May 2011. ◮ #ows: 25 Sep. 2011 to 2 Oct. 2011. ◮ Non-trending hashtags: #vote, #democracy.
L3S Web Science 28
Assigning a Leaning to Hashtags
Using Voting approach: Lean(h,w) := Vote(h, w)L
- l∈L Vote(h, w)l .
L3S Web Science 29
Assigning a Leaning to Hashtags
Using Voting approach: Lean(h,w) := Vote(h, w)L
- l∈L Vote(h, w)l .
◮ Vote(h,w)l = f (h, w)l
L3S Web Science 29
Assigning a Leaning to Hashtags
Using Voting approach: Lean(h,w) := Vote(h, w)L
- l∈L Vote(h, w)l .
◮ Vote(h,w)l = f (h, w)l ◮ Vote(h,w)l = f (h,w)l
- h′∈H f (h′,w)l Normalization
L3S Web Science 29
Assigning a Leaning to Hashtags
Using Voting approach: Lean(h,w) := Vote(h, w)L
- l∈L Vote(h, w)l .
◮ Vote(h,w)l = f (h, w)l ◮ Vote(h,w)l = f (h,w)l
- h′∈H f (h′,w)l Normalization
◮ Vote(h,w)l = f (h,w)l
- h′∈H f (h′,w)l +
- h′∈H f (h′,w)l
40000
(c.f. Laplace Smoothing)
L3S Web Science 29
“Change points” detected
L3S Web Science 30
“Change points” detected
◮ Hypothesis: “change points” are caused by users from the
- ther leaning deliberately trying to ‘hijack’ a hashtag.
L3S Web Science 30
Leanings over time
L3S Web Science 31
Leanings over time . . .
L3S Web Science 32
Leanings over time . . .
L3S Web Science 33
Detecting “change points” in hashtags
◮ Filter hashtags without sufficient support.
Total number of weeks > 4
◮ Relative and absolute change in leaning from previous week.
Change from previous week > std and Change from previous week > 0.25
◮ Change from average value is big.
Current value - Average value > std
◮ Change in leaning is in the direction of other leaning.
Change in direction = TRUE
L3S Web Science 34
Detecting “change points” in hashtags
◮ Filter hashtags without sufficient support.
Total number of weeks > 4
◮ Relative and absolute change in leaning from previous week.
Change from previous week > std and Change from previous week > 0.25
◮ Change from average value is big.
Current value - Average value > std
◮ Change in leaning is in the direction of other leaning.
Change in direction = TRUE
L3S Web Science 34
Detecting “change points” in hashtags
◮ Filter hashtags without sufficient support.
Total number of weeks > 4
◮ Relative and absolute change in leaning from previous week.
Change from previous week > std and Change from previous week > 0.25
◮ Change from average value is big.
Current value - Average value > std
◮ Change in leaning is in the direction of other leaning.
Change in direction = TRUE
L3S Web Science 34
Detecting “change points” in hashtags
◮ Filter hashtags without sufficient support.
Total number of weeks > 4
◮ Relative and absolute change in leaning from previous week.
Change from previous week > std and Change from previous week > 0.25
◮ Change from average value is big.
Current value - Average value > std
◮ Change in leaning is in the direction of other leaning.
Change in direction = TRUE
L3S Web Science 34
Who Wants to Get Fired? [4]
Work from L3S Research Center
Figure: FireMe! on the news
L3S Web Science 35
FireMe! Homepage 4
Figure: Fireme! Home page
4http://fireme.l3s.uni-hannover.de/fireme.php
L3S Web Science 36
Problem Definition
Study posting behaviour of individuals who tweet they hate their job or boss.
L3S Web Science 37
Observation
◮ Many users are not aware of their audience. ◮ Build an alerting system to address the danger of public online
data.
L3S Web Science 38
Data Set
◮ Collect ‘haters’ - users mentioning sentences like: ‘I hate my
job’, ‘I hate my boss’, ‘I hate the worst job’ . . .
◮ Collect ‘lovers’ - Users posting positive messages like: ‘I love
my job’, ‘my bosss is the best’.
◮ Crawled users for one week, June 18 - June 26, 2012. 21,851
haters and 44,710 lovers.
◮ Randomly select 10,000 users from each group. Get the latest
200 tweets of each user.
L3S Web Science 39
Data Analysis
Characterising groups:
◮ lovers are more connected. Three times as many followers,
and 20% more friends.
◮ haters are more active in terms of tweeting speed - twice as
many tweets per day.
◮ haters link to their facebook profile in their bio.
L3S Web Science 40
FireMe! Features
Figure: FireMe! alert page
◮ Shows user, her hate tweet and FireMeter! score ◮ FireMeter! score fraction of hate tweets in the latest 100
tweets of a user.
L3S Web Science 41
Alert Impact
◮ Sent out over 4000 alert messages to haters. ‘What are you
going to do about it?’
◮ Feedback: 42% delete hate tweet, 18%change privacy setting,
40%don’t care.
◮ Overall 60% users concerned about their personal data.
L3S Web Science 42
FireMe! Features . . .
Figure: FireMe! provides a leaderboard
L3S Web Science 43
Validation and more analysis . . .
◮ Majority of users deleted the compromising tweets. ◮ Tone of alert messages (neutral, aggressive, action) is
important.
L3S Web Science 44
Reference
- M. Pennacchiotti and A.-M. Popescu, “Democrats, republicans and
starbucks afficionados: user classification in twitter,” in Proceedings
- f the 17th ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’11, (New York, NY, USA),
- pp. 430–438, ACM, 2011.
- M. Conover, J. Ratkiewicz, M. Francisco, B. Gon¸
calves,
- A. Flammini, and F. Menczer, “Political polarization on twitter,” in
- Proc. 5th Intl. Conference on Weblogs and Social Media, 2011.
- I. Weber, V. Garimella, and A. Teka, “Political hashtag trends,” in
Advances in Information Retrieval (P. Serdyukov, P. Braslavski,
- S. Kuznetsov, J. Kamps, S. Rger, E. Agichtein, I. Segalovich, and
- E. Yilmaz, eds.), vol. 7814 of Lecture Notes in Computer Science,
- pp. 857–860, Springer Berlin Heidelberg, 2013.
- R. Kawase, B. P. Nunes, E. Herder, W. Nejdl, and M. A. Casanova,
“Who wants to get fired?,” CHI, April 27May 2 2013.
L3S Web Science 45