Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - PowerPoint PPT Presentation

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013

Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1

Web Science Definition from Web Science Conference 1 ◮ Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web. ◮ Web Science embraces the study of the Web as a vast universal information network of people and communities. ◮ Studying human behavior and social interaction contributes to our understanding of the Web, while Web data is transforming how social science is conducted. 1 http://www.websci13.org/ L3S Web Science 2

Social Media Figure: Social Media billboard 2 2 http://bit.ly/10216Jy L3S Web Science 3

Twitter ◮ Politicians use Twitter to mobilize users. ◮ Companies use Twitter for marketing products. L3S Web Science 4

User classification in twitter [1] How can we automatically construct user profiles? L3S Web Science 5

Applications ◮ Authoritative users extraction - Discovering expert users for a target topic. ◮ Personalized web search - Personalized social media posts retrieval. ◮ User recommendation - Suggesting new interesting users to a target user. L3S Web Science 6

Example tasks ◮ Political affiliation detection (Right vs Left) ◮ Ethnicity identification (African-Americans or not) ◮ Detecting affinity for a particular business (Starbucks fans) L3S Web Science 7

Machine Learning Model ◮ Feature Construction: Profile, tweeting behaviour, linguistic content, social network features. ◮ Classification Algorithm: Gradient Boosted Decision Trees, GBDT framework L3S Web Science 8

Profile features Profile information does not contain enough quality information to be directly used for user classification. L3S Web Science 9

Profile features Profile information does not contain enough quality information to be directly used for user classification. ◮ Length of name, Number of alphanumeric chars. ◮ Capitalization forms in user name ◮ Use of avatar picture ◮ Number of followers/ friends ◮ Regular expression matches in bio: ( I | i )( m | am | ′ m | [0 − 9] + ( yo | yearold ) whiteman | woman | boy | girl L3S Web Science 9

Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. L3S Web Science 10

Tweeting behaviour features A set of statistics capturing the way users interact with the micro-blogging service. ◮ Number of tweets of a user. ◮ Number and fraction of retweets of a user. ◮ Ave. number of hashtags and URLs per tweet ◮ Ave. time and std between tweets L3S Web Science 10

Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. L3S Web Science 11

Linguistic content features Linguistic content contains the user’s lexical usage and the main topics of interest to the user. ◮ prototypical words, hashtags instead of bag-of-words representation. ◮ Generic LDA, Domain-specific LDA ◮ Sentiment words L3S Web Science 11

Social network features These features contain the social connections between a user and those one follows, replies to or whose messages they retweet. ◮ Friend accounts - Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set. ◮ Number of prototypical friends, percentage number of prototypical friend ◮ Prototypical replied users, Prototypical retweeted users L3S Web Science 12

Experiments ◮ Political affiliation, more than 80% ◮ Starbucks fans ◮ Ethnicity L3S Web Science 13

Political Polarization on Twitter [2] How social media shape the networked public sphere and facilitate communication between communities with different political orientations. L3S Web Science 14

Data Set ◮ 250,000 politically relevant tweets from more than 45,000 users. ◮ Construct two networks of political communication - retweet and mention networks. ◮ Data set available at: cnets.indiana.edu/groups/nan/truthy L3S Web Science 15

Finding Figure: Political retweet network (left) and mention network(right) L3S Web Science 16

Framework ◮ Data gathering ◮ Identifying political content ◮ Political communication networks ◮ Network analysis L3S Web Science 17

Identifying Political Content ◮ Political communication - any tweet containing at least one politically relevant hashtag. ◮ Political hashtags constructed from seed hashtags #p2 and #tcot using Jaccard similarity. ◮ Let S set of tweets containing seed hashtag and T set of tweets containing another hashtag. σ ( S , T ) = S ∩ T S ∪ T L3S Web Science 18

Community Structure ◮ Community detection using a label propagation method for two communities. ◮ Label propagation - Assign an initial arbitrary cluster membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors. ◮ Modularity to measure segregation. L3S Web Science 19

Do clusters have similar content? ◮ Associate each user with a profile vector of hashtags in their tweets, weighted by frequency. ◮ Cosine similarity among users. L3S Web Science 20

Do clusters in the retweet network correspond to groups of users of similar political alignment? ◮ Qualitative content analysis from social science. ◮ One author annotates 1,000 random users as ‘left’ or ‘right’. ◮ Another user annotates 200 random users from the 1,000 users above. ◮ Inter annotator agreement measured using Cohen’s Kappa ◮ k = P ( α ) − P ( ǫ ) 1 − P ( ǫ where P ( α ) is observed rate of agreement between annotators and P ( ǫ ) is expected rate of random agreement given relative frequency of each class label. L3S Web Science 21

Political Twitter Trends, PTT [3] ◮ Analysis tool for political polarization of Twitter hashtags 3 . 3 http://politicalhashtagtrends.sandbox.yahoo.com/ L3S Web Science 22

Data Set ◮ Start with a set of seed political users such as @BarackObama and @MittRomney whose political leaning is known. ◮ Get their tweets. L3S Web Science 23

Data Set . . . ◮ Collect users that retweet seed users’ tweets. L3S Web Science 24

Filtering Users by Location ◮ We want to limit our analysis to the U.S. L3S Web Science 25

Evaluating Data Quality ◮ Against Web Directories. ◮ Precision = 0 . 98, 0 . 93 for Wefollow and Twellow respectively. ◮ Manual inspection: “greatest environmentalist. Also, despise republicans” L3S Web Science 26

Detecting Political Hashtags ◮ Look into co-occurrence with seed political hashtags ( #p2 , #tcot , #gop , #ows ) and (‘obama’, ‘romney’, ‘politic’, ‘liberal’, ‘conservative’, ‘democ’, or ‘republic’) ◮ Volume filtering to avoid rare hashtags. L3S Web Science 27

Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . L3S Web Science 28

Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H L3S Web Science 28

Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. L3S Web Science 28

Computing Trending Score ◮ trending - currently popular. Having a higher volume than expected . f ( h , w ) / � h ′∈ H f ( h ′ , w ) trend ( h , w ) := u ≤ w f ( h , u ) / � � u ≤ w f ( h ′ , u ) � h ′∈ H Examples: ◮ #obamagotosama : 01 May 2011 to 08 May 2011. ◮ #ows : 25 Sep. 2011 to 2 Oct. 2011. ◮ Non-trending hashtags: #vote , #democracy . L3S Web Science 28

Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � L3S Web Science 29

Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l L3S Web Science 29

Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � L3S Web Science 29

Assigning a Leaning to Hashtags Using Voting approach: Vote ( h , w ) L Lean(h,w) := l ∈ L Vote ( h , w ) l . � ◮ Vote(h,w) l = f ( h , w ) l ◮ Vote(h,w) l = f ( h , w ) l h ′∈ H f ( h ′ , w ) l Normalization � ◮ Vote(h,w) l = h ′∈ H f ( h ′ , w ) l f ( h , w ) l � h ′∈ H f ( h ′ , w ) l + (c.f. Laplace 40000 � Smoothing) L3S Web Science 29

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research - PowerPoint PPT Presentation

Mining the Social Web Asmelash Teka Hadgu teka@l3s.de L3S Research Center April 30, 2013 Outline Introduction User Classification Network Analysis Content Analysis Privacy Issues L3S Web Science 1 Web Science Definition from Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Semantic Web Mining Bettina Berendt Humboldt-Universitt zu Berlin Institut fr

What is Web Mining? The use of data mining techniques to automatically RECOMMENDATION MODELS

the images Egbert Mittelstdt uses an analogue/digital animation stand, at which by means of

power in the north What does the recent restructuring of NZ ODA reveal about power and

Climate Change, Agency, and Knowledge Production By Vanessa Adel Sustainability Research Panel

Overview of IPAC CO Overview of IPAC CO 2 an International an International P Performance

Internet Initiative Japan Inc. Corporate Overview September 2014 TSE1:3774 NASDAQ:IIJI Key

Leadership Group January 31, 2014 Meeting Quick History Established 1951 Six

Dr Jacqueline Baxter The Open University Walton Hall Milton Keynes MK7 6AA

Socio-Economic Opportunities in Technological Change: Collaborative Community Resilience Linton