Bayesian Modeling for Analyzing Online Content and Users Bin Bi - - PowerPoint PPT Presentation
Bayesian Modeling for Analyzing Online Content and Users Bin Bi - - PowerPoint PPT Presentation
Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu Online Content Explosion By Domo 2 Information Overload sheer amount of online content Curse
Online Content Explosion
By Domo
2
Information Overload
sheer amount of online content
Blessing Curse
- Confusion
- Sub-optimum decisions
- Dissatisfaction
- Investigating topics of interest
- Checking facts
- Getting advice about problems
3
Goal: Learning to discover high-quality information
q Two schemes
1.
Discern good content from the bad
2.
Identify users who generate high-quality content
q Two domains
1.
Social media
2.
Search engine
4
Influencer Discovery on Microblogs
healthy food
Michelle Obama Jimmy Oliver …
SKIT
Scalable Topic-specific Influence Analysis on Microblogs
Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, Junghoo Cho [WSDM 2014]
5
Motivation
q Huge amount of textual and social information produced by popular microblogging sites
- Twitter had over 500 million users creating over 340
million tweets daily, reported in March 2012
q A popular resource for marketing campaigns
- Monitor opinions of consumers
- Launch viral advertising
q Identifying social influencers is crucial for market analysis
6
Existing Social Influence Analysis
q Major existing influence analyses are only based on network structures
- e.g., Influence maximization work[Kleinberg et al, KDD’03]
- Valuable textual content is ignored
- Only one global influence score is computed for each user
q But we need
- Topic-specific influence based on both network and content
- Differentiate influence in different aspects of life (topics)
7
Existing Topic-Specific Influence Analysis
q Separate analysis of content and networks
- e.g. Topic-sensitive PageRank (TSPR), TwitterRank
Text analysis on content Topic collection Influence analysis on network
Problem: Content and links are often correlated in microblog networks
- A user tends to follow another who tweets in similar topics
8
Solution Overview
q Goal: Identify topic-specific key influencers on microblogs
q Leverage both content and network q Tight coupling of content and network analysis
q Followship-LDA (FLDA) Model
- Our topic-specific influence model for microblog networks
- Solid probabilistic foundation in Bayesian modeling
healthy food
Michelle Obama Jimmy Oliver …
SKIT
9
Review of Typical LDA
q Latent Dirichlet Allocation (LDA)[Blei et al, JMLR ’03] is a generative topic model for latent topic discovery in a text corpus q Intuition:
from Blei’s slides
10
Review of Typical LDA (cont’d)
- Each topic is a distribution over words
- Each document is a distribution over topics
- Each word is drawn from one of those topics
Per-topic word distribution: ϕ Per-document topic distribution: θ Topic assignments: z
11
Review of Typical LDA (cont’d)
- In reality, we only observe the documents
- The other structure are hidden variables
Per-topic word distribution: ϕ Per-document topic distribution: θ Topic assignments: z
12
Statistical Modeling
q Generative probabilistic modeling
- Treats data as observations
- Contains hidden variables
- Specifies a probabilistic procedure by which the observations are generated
q Inference
Input Output
Observed data
13
Values of hidden variables
Graphical Model for LDA
q For each document d
§ Sample θd from Dirichlet(α) § For each word position in d
- Sample a topic z from θd
- Sample a word w from φz
Generative Process for LDA
θd: Topic distribution for document d φz:: Word distribution for topic z
14
w z
θ α φ β
K M Nm
M: Number of documents Nm: Number of words in document m K: Number of topics
Topic-specific Influence Analysis on Microblogs
q Intuition
① Each microblog user has tweet text and a set of followees ② A user tweets in multiple topics
- Alice tweets about technology and food
③ A topic is a distribution over words
- web and cloud are more likely to appear in the technology topic
Alice
Content Followees
web, organic, veggie, cookie, cloud, …… Michelle Obama, Mark Zuckerberg, Barack Obama web, organic, veggie, cookie, cloud, ……
15
Topic-specific Influence Analysis on Microblogs
q Intuition (cont’d)
④ A user follows another for different reasons
- Content-based: follow for similar topics
- Content-independent: follow for popularity
⑤ Topics of content-based followships differ from each other
- Mark Zuckerberg is more likely to be followed for the technology topic
- Topic-specific influence: the probability of a user being followed for a
given topic
16
Followship-LDA (FLDA)
q FLDA: A fully-Bayesian generative model specifically designed for microblog networks
- Specifies a stochastic process by which the content and links of each user
are generated
- Introduces hidden structure
- Topics, reasons of a user following another, topic-specific influence, etc.
- Inference: reverse the generative process
- What hidden structure is most likely to have generated the observed data?
17
Hidden Structure in FLDA
Alice Bob Mark Michelle Barack Tech 0.1 0.05 0.7 0.05 0.1 Food 0.04 0.1 0.06 0.75 0.05 Poli 0.03 0.02 0.05 0.1 0.8 web cookie veggie
- rganic
congress … Tech 0.3 0.1 0.001 0.001 0.001 … Food 0.001 0.15 0.3 0.1 0.001 … Poli 0.005 0.001 0.001 0.002 0.25 … Follow for Topic Follow for Pop. Alice 0.75 0.25 Bob 0.5 0.5 Mark … … Michelle … … Barack … … Alice Bob Mark Michelle Barack Global 0.005 0.001 0.244 0.25 0.5
q Per-user topic distribu?on q Per-topic word distribu?on q Per-user followship preference q Per-topic followee distribu?on q Global followee distribu?on
User Topic Topic Topic Word User User Preference User
Topic-specific influence Global popularity
θ ϕ µ
Tech Food Poli Alice 0.8 0.19 0.01 Bob 0.6 0.3 0.1 Mark … … … Michelle … … … Barack … … …
σ π
Graphical Model for FLDA
For the mth user:
q Pick a topic distribu?on θm q Pick a followship preference µm q For the nth word posi?on
- Pick a topic z from topic distribu?on θm
- Pick a word w from word distribu?on φz
q For the lth followee
- Choose the cause based on followship preference µm
- If content-related
§ Pick a topic z from topic distribu?on θm § Pick a followee from per-topic followee distribu?on σz
- Otherwise
§ Pick a followee from global followee distribu?on π
Content: Followees: Alice
Plate Notation
user tweet link Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: 0.75, not: 0.25 web Michelle Obama , organic, … , Barack Obama, … Tech Follow for content Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: 0.75, not: 0.2
19
Bayesian Learning for FLDA
q Gibbs Sampling
- A Markov chain Monte Carlo algorithm
Begin with some initial value for each latent variable Iteratively sample each variable conditioned on the current values of the other variables The samples are used to approximate the posterior distribution
20
Gibbs Sampler for FLDA
q Derived conditionals for FLDA Gibbs Sampler
- Prob. topic of nth word of mth user given the current values of all others
- Prob. topic of lth content-based followship of mth user given the current
values of all others
- Prob. lth followship of mth user is independent of content given the current
values of all others
21
Gibbs Sampler for FLDA (cont’d)
q In each pass of data, for the mth user
- Sample latent variables from respec?ve condi?onals
- Keep counters while sampling
- cm
w,z: # ?mes w is assigned to z for mth user
- dm
e,z: # ?mes e is assigned to z for mth user
q Es?mate distribu?ons for hidden structure
per-user topic distribution per-topic followee distribution (influence)
22
Distributed Gibbs Sampling for FLDA
q Challenge: Gibbs Sampling process is sequential
- Each sample step relies on the most recent values of all other variables.
q Observation: dependency between variable assignments is weak q Solution: Distributed Gibbs Sampling for FLDA
- Relax the sequential requirement of Gibbs Sampling
- Implemented on Spark
23
Search Framework for Topic-Specific Influencers
q SKIT: search framework for topic-specific key influencers
- Input: free text
- Output: ranked list of key influencers
healthy food
Michelle Obama Jimmy Oliver …
SKIT
⓵ Derivation of interested topics from input text ⓶ Compute influence score for each user ⓷ Sort users by influence scores and return the top influencers
24
Experiments – Effectiveness on Twitter Data
q Twitter Dataset crawled in 2010
- 1.76M users, 2.36B words, 183M links, 159K distinct words
q Interesting findings
- Overall ~ 15% of followships are content-independent
- Top-5 globally popular users: Pete Wentz (singer), Ashton Kutcher (actor),
Greg Grunberg (actor), Britney Spears (singer), Ellen Degeneres (comedian)
Topic Top-10 keywords Top-5 influencers
“InformaKon Technology” data, web, cloud, soSware, open, windows, MicrosoS, server, security, code Tim O’ Reilly, Gartner Inc, ScoX Hanselman (soSware blogger), Jeff Atwood (soSware blogger, co-founder of stackoverflow.com), Elijah Manor (soSware blogger) “Jobs” business, job, jobs, management, manger, sales, services, company, service, small, hiring Job-hunt.org, jobsguy.com, integritystaffing.com/blog (job search Ninja), jobConcierge.com, careerealism.com “Cycling and Running” bike, ride, race, training, running, miles, team, workout, marathon, fitness Lance Armstrong, Levi Leipheimer, Geroge Hincapie (all 3 US Postal pro cycling team members), Johan Bruyeel (US Postal team director), RSLT (a pro cycling team)
25
Experiments – Effectiveness on Weibo Data
q Tencent Weibo Dataset from KDD Cup 2012
- 2.33M users, 492M words, 51M links, 714K distinct words
- VIP users
- Organized in categories
- Manually labeled “key influencers” in the corresponding categories
- Used as “ground truth”
26
Efficiency of Distributed FLDA
q Setup
- SequenKal: a high end server (192 GB RAM, 3.2GHz)
- Distributed: 27 servers (32GB RAM, 2.5GHz, 8 cores) connected by 1Gbit
Ethernet, 1 master and 200 workers
q Dataset
- Tencent Weibo: 2.33M users, 492M words, 51M links
- TwiSer: 1.76M users, 2.36B words, 183M links
q Sequen?al vs. Distributed
SequenKal Distributed Weibo 4.6 days 8 hours TwiSer 21 days 1.5 days
27
Conclusion for FLDA work
q FLDA model for topic-specific influence analysis
- Content + Network
- Differentiates various reasons why one follows another
q Distributed Gibbs sampling algorithm for FLDA
- Comparable results as sequential Gibbs sampling algorithm
- Scales nicely with data sizes
q Significantly higher quality results than existing work
28
Authority Analysis on Content Sharing Websites
Who Are Experts Specializing in Landscape Photography? Analyzing Topic-specific Authority on Content Sharing Services
Bin Bi, Ben Kao, Chang Wan, Junghoo Cho [KDD 2014]
Who’s a master of landscape photography? Who’s expert in shooting city lights? Expert Expert
29
Topic-specific Authority Analysis on Content Sharing Websites
Resource
- Video
- Photo
- Travel blog
- ……
How to find a topic- specific authority who creates high- quality resources?
30
Why Topic-specific?
City Lights Landscape
Expert Expert
No one is authoritative on every topic Who’s a master of sunset photography? Who’s expert in portrait photography? Users have different topical needs
31
LDA-based Naïve Solution
q Adapt LDA to data in sharing log
- User à Document
- Tag à Word
q Two implications:
- A user is interested in topic T
ó She frequently posts resources with tags specific to T
- More frequently a user uses tags covering T
ó More authoritative she should be on T Sharing Log LDA
32
Like Click
Like Log
q A supplementary source about content quality is needed q Like log provides valuable signal
Vote
q Challenge: Users don’t specify topical causes behind like clicks
33
Topic-specific Authority Analysis (TAA) Model
Topical interest Topical authority
q Jointly model topical interest and topical authority:
Sharing Log Like Log TAA
34
Intuition for Characterizing Authoritativeness
q Users’ authority
- Different from each other
q Each user’s authority
- Specific to individual topics
q Introduce to characterize topical authority
- K-dimensional random vector over topics
- Specific to individual user u
- ……
ηu1 ηu2 ηu3 ηu4 ηuK
K-dimensional vector User u’s authority:
ηu ηu
35
Intuition for Characterizing Like Clicks
q Introduce to represent like feedback
- Binary random variable
- Specific to user u and resource r
q u likes r è Topical authority exhibited by r matches u’s topical interest fur
fur = ( 1 if u favorited r
- therwise
36
Intuition for Characterizing Like Clicks (cont’d)
q Identify hidden topical causes behind like clicks
- Designed a model for topic discovery
q With the topics, we specify the likelihood:
San Francisco? City Lights?
37
Intuition behind Topic Discovery for Like Clicks
Logistic likelihood function: captures similarity between topic distributions for resource r and u’s interest poster ’s topical authority
u'
- indicates that poster u’ should be expert in the topics prominent
in both u’s interest and resource r, so the weights of these topics have to be boosted
fur =1
38
For user u:
q Pick a topic distribu?on θu q Pick an authority vector ƞu from MVN(µ,Ʃ) q For the nth tag posi?on
- Pick a topic z from topic distribu?on θu
- Pick a tag t from word distribu?on φz
For each (u,r) pair:
q Pick a like response fur
from Bernoulli( )
Topic-specific Authority Analysis (TAA)
Graphical Model Generative Process
users posters w/ resources
39
Quantities of Interest
40
q quantifies user u’s unique authoritativeness over topics q characterizes user u’s topical interest q indicates probabilities of tag t belonging to individual topics
ηu
θu
ϕt
Bayesian Learning for TAA
q Bayesian learning for a model with logistic likelihood has long been recognized as a hard problem[Gramacy et al. 2012] q We extend recent work[Polson et al. 2013] for inference of TAA
- Introduce Polya-Gamma variables to posterior distribution
q Derived conditionals for Gibbs sampling
⋅ η : ⋅ z : ⋅ δ :
41
Experiments
q Data collections q Metrics for effectiveness
- (Mean Reciprocal Rank) MRR
- Spearman’s correlation coefficient
Data #users #photos #tag asgmts # like clicks Flickr 21,054 204,335 3,014,813 1,562,805 500px 33,581 318,906 3,520,179 1,837,049
Most-tagged LDA Most-favorited TwitterRank Link-PLSA-LDA TAA
MRR
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Most-tagged LDA Most-favorited TwitterRank Link-PLSA-LDA TAA
Spearman's rank correlation coefficient
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Flickr 500px
42
Predictive Power
Number of topics Perplexity
20 40 60 80 100
3000 6000 9000 12000 15000 18000 21000
Link-PLSA-LDA TAA Number of topics Perplexity
20 40 60 80 100
3000 6000 9000 12000 15000 18000 21000 24000 27000
Link-PLSA-LDA TAA
Flickr 500px
q Perplexity metric
43
Case visualization
44
Conclusion for TAA work
q Propose a novel TAA model for topic-specific authority analysis on content sharing websites
- Leverage both sharing log and like log
q Propose a new inference algorithm to learn the parameters in TAA
- Gibbs sampling with data augmentation
q Higher quality results and stronger predictive power than the state of the art
45
Future Work
q Bayesian modeling can be used in a wide range of applications q Recommendation
- Extending FLDA and TAA to recommender systems in various domains
q Feature Extraction
- Including posterior distributions of latent topics as features for a learning
model
q User Modeling
- Dealing with data sparsity by borrowing information across users
46
References
q Bin Bi and Junghoo Cho. Modeling a Retweet Network via an Adaptive Bayesian Approach. Proceedings of the 25th International World Wide Web Conference (WWW), 2016. q Bin Bi, Hao Ma, Paul Hsu, Wei Chu, Kuansan Wang, and Junghoo Cho. Learning to Recommend Related Entities to Search Users. Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM), 2015. q Youngchul Cha, Keng-hao Chang, Hari Bommaganti, Ye Chen, Tak Yan, Bin Bi, and Junghoo Cho. A Universal Topic Framework (UniZ) and Its Application in Online Search. Proceedings of the 30th ACM SIGAPP Symposium On Applied Computing (SAC), 2015. q Bin Bi, Ben Kao, Chang Wan, and Junghoo Cho. Who Are Experts Specializing in Landscape Photography? - Analyzing Topic-specific Authority on Content Sharing Services. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014. q Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, and Junghoo Cho. Scalable Topic-Specific Influence Analysis on
- Microblogs. Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM), 2014.
47
References
q Bin Bi, Milad Shokouhi, Michal Kosinski, and Thore Graepel. Inferring the Demographics of Search Users - Social Data Meets Search Queries. Proceedings of the 22nd International World Wide Web Conference (WWW), 2013. q Bin Bi, and Junghoo Cho. Automatically Generating Descriptions for Resources by Tag Modeling. Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM), 2013. q Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, and Junghoo Cho. Incorporating Popularity in Topic Models for Social Network
- Analysis. Proceedings of the 36th International ACM SIGIR Conference (SIGIR), 2013.
q Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, and Eric Lo. DQR: A Probabilistic Approach to Diversified Query
- Recommendation. Proceedings of the 21st ACM International Conference on Information and Knowledge Management
(CIKM), 2012. q Bin Bi, Sau Dan Lee, Ben Kao, and Reynold Cheng. CubeLSI: An Effective and Efficient Method for Searching Resources in Social Tagging Systems. Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE), 2011. q Bin Bi, Lifeng Shang, and Ben Kao. Collaborative Resource Discovery in Social Tagging Systems. Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM), 2009.
48