Bayesian Modeling for Analyzing Online Content and Users Bin Bi - - PowerPoint PPT Presentation

bayesian modeling for analyzing online content and users
SMART_READER_LITE
LIVE PREVIEW

Bayesian Modeling for Analyzing Online Content and Users Bin Bi - - PowerPoint PPT Presentation

Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu Online Content Explosion By Domo 2 Information Overload sheer amount of online content Curse


slide-1
SLIDE 1

Bayesian Modeling for Analyzing Online Content and Users

Bin Bi

Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu

slide-2
SLIDE 2

Online Content Explosion

By Domo

2

slide-3
SLIDE 3

Information Overload

sheer amount of online content

Blessing Curse

  • Confusion
  • Sub-optimum decisions
  • Dissatisfaction
  • Investigating topics of interest
  • Checking facts
  • Getting advice about problems

3

slide-4
SLIDE 4

Goal: Learning to discover high-quality information

q Two schemes

1.

Discern good content from the bad

2.

Identify users who generate high-quality content

q Two domains

1.

Social media

2.

Search engine

4

slide-5
SLIDE 5

Influencer Discovery on Microblogs

healthy food

Michelle Obama Jimmy Oliver …

SKIT

Scalable Topic-specific Influence Analysis on Microblogs

Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, Junghoo Cho [WSDM 2014]

5

slide-6
SLIDE 6

Motivation

q Huge amount of textual and social information produced by popular microblogging sites

  • Twitter had over 500 million users creating over 340

million tweets daily, reported in March 2012

q A popular resource for marketing campaigns

  • Monitor opinions of consumers
  • Launch viral advertising

q Identifying social influencers is crucial for market analysis

6

slide-7
SLIDE 7

Existing Social Influence Analysis

q Major existing influence analyses are only based on network structures

  • e.g., Influence maximization work[Kleinberg et al, KDD’03]
  • Valuable textual content is ignored
  • Only one global influence score is computed for each user

q But we need

  • Topic-specific influence based on both network and content
  • Differentiate influence in different aspects of life (topics)

7

slide-8
SLIDE 8

Existing Topic-Specific Influence Analysis

q Separate analysis of content and networks

  • e.g. Topic-sensitive PageRank (TSPR), TwitterRank

Text analysis on content Topic collection Influence analysis on network

Problem: Content and links are often correlated in microblog networks

  • A user tends to follow another who tweets in similar topics

8

slide-9
SLIDE 9

Solution Overview

q Goal: Identify topic-specific key influencers on microblogs

q Leverage both content and network q Tight coupling of content and network analysis

q Followship-LDA (FLDA) Model

  • Our topic-specific influence model for microblog networks
  • Solid probabilistic foundation in Bayesian modeling

healthy food

Michelle Obama Jimmy Oliver …

SKIT

9

slide-10
SLIDE 10

Review of Typical LDA

q Latent Dirichlet Allocation (LDA)[Blei et al, JMLR ’03] is a generative topic model for latent topic discovery in a text corpus q Intuition:

from Blei’s slides

10

slide-11
SLIDE 11

Review of Typical LDA (cont’d)

  • Each topic is a distribution over words
  • Each document is a distribution over topics
  • Each word is drawn from one of those topics

Per-topic word distribution: ϕ Per-document topic distribution: θ Topic assignments: z

11

slide-12
SLIDE 12

Review of Typical LDA (cont’d)

  • In reality, we only observe the documents
  • The other structure are hidden variables

Per-topic word distribution: ϕ Per-document topic distribution: θ Topic assignments: z

12

slide-13
SLIDE 13

Statistical Modeling

q Generative probabilistic modeling

  • Treats data as observations
  • Contains hidden variables
  • Specifies a probabilistic procedure by which the observations are generated

q Inference

Input Output

Observed data

13

Values of hidden variables

slide-14
SLIDE 14

Graphical Model for LDA

q For each document d

§ Sample θd from Dirichlet(α) § For each word position in d

  • Sample a topic z from θd
  • Sample a word w from φz

Generative Process for LDA

θd: Topic distribution for document d φz:: Word distribution for topic z

14

w z

θ α φ β

K M Nm

M: Number of documents Nm: Number of words in document m K: Number of topics

slide-15
SLIDE 15

Topic-specific Influence Analysis on Microblogs

q Intuition

① Each microblog user has tweet text and a set of followees ② A user tweets in multiple topics

  • Alice tweets about technology and food

③ A topic is a distribution over words

  • web and cloud are more likely to appear in the technology topic

Alice

Content Followees

web, organic, veggie, cookie, cloud, …… Michelle Obama, Mark Zuckerberg, Barack Obama web, organic, veggie, cookie, cloud, ……

15

slide-16
SLIDE 16

Topic-specific Influence Analysis on Microblogs

q Intuition (cont’d)

④ A user follows another for different reasons

  • Content-based: follow for similar topics
  • Content-independent: follow for popularity

⑤ Topics of content-based followships differ from each other

  • Mark Zuckerberg is more likely to be followed for the technology topic
  • Topic-specific influence: the probability of a user being followed for a

given topic

16

slide-17
SLIDE 17

Followship-LDA (FLDA)

q FLDA: A fully-Bayesian generative model specifically designed for microblog networks

  • Specifies a stochastic process by which the content and links of each user

are generated

  • Introduces hidden structure
  • Topics, reasons of a user following another, topic-specific influence, etc.
  • Inference: reverse the generative process
  • What hidden structure is most likely to have generated the observed data?

17

slide-18
SLIDE 18

Hidden Structure in FLDA

Alice Bob Mark Michelle Barack Tech 0.1 0.05 0.7 0.05 0.1 Food 0.04 0.1 0.06 0.75 0.05 Poli 0.03 0.02 0.05 0.1 0.8 web cookie veggie

  • rganic

congress … Tech 0.3 0.1 0.001 0.001 0.001 … Food 0.001 0.15 0.3 0.1 0.001 … Poli 0.005 0.001 0.001 0.002 0.25 … Follow for Topic Follow for Pop. Alice 0.75 0.25 Bob 0.5 0.5 Mark … … Michelle … … Barack … … Alice Bob Mark Michelle Barack Global 0.005 0.001 0.244 0.25 0.5

q Per-user topic distribu?on q Per-topic word distribu?on q Per-user followship preference q Per-topic followee distribu?on q Global followee distribu?on

User Topic Topic Topic Word User User Preference User

Topic-specific influence Global popularity

θ ϕ µ

Tech Food Poli Alice 0.8 0.19 0.01 Bob 0.6 0.3 0.1 Mark … … … Michelle … … … Barack … … …

σ π

slide-19
SLIDE 19

Graphical Model for FLDA

For the mth user:

q Pick a topic distribu?on θm q Pick a followship preference µm q For the nth word posi?on

  • Pick a topic z from topic distribu?on θm
  • Pick a word w from word distribu?on φz

q For the lth followee

  • Choose the cause based on followship preference µm
  • If content-related

§ Pick a topic z from topic distribu?on θm § Pick a followee from per-topic followee distribu?on σz

  • Otherwise

§ Pick a followee from global followee distribu?on π

Content: Followees: Alice

Plate Notation

user tweet link Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: 0.75, not: 0.25 web Michelle Obama , organic, … , Barack Obama, … Tech Follow for content Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: 0.75, not: 0.2

19

slide-20
SLIDE 20

Bayesian Learning for FLDA

q Gibbs Sampling

  • A Markov chain Monte Carlo algorithm

Begin with some initial value for each latent variable Iteratively sample each variable conditioned on the current values of the other variables The samples are used to approximate the posterior distribution

20

slide-21
SLIDE 21

Gibbs Sampler for FLDA

q Derived conditionals for FLDA Gibbs Sampler

  • Prob. topic of nth word of mth user given the current values of all others
  • Prob. topic of lth content-based followship of mth user given the current

values of all others

  • Prob. lth followship of mth user is independent of content given the current

values of all others

21

slide-22
SLIDE 22

Gibbs Sampler for FLDA (cont’d)

q In each pass of data, for the mth user

  • Sample latent variables from respec?ve condi?onals
  • Keep counters while sampling
  • cm

w,z: # ?mes w is assigned to z for mth user

  • dm

e,z: # ?mes e is assigned to z for mth user

q Es?mate distribu?ons for hidden structure

per-user topic distribution per-topic followee distribution (influence)

22

slide-23
SLIDE 23

Distributed Gibbs Sampling for FLDA

q Challenge: Gibbs Sampling process is sequential

  • Each sample step relies on the most recent values of all other variables.

q Observation: dependency between variable assignments is weak q Solution: Distributed Gibbs Sampling for FLDA

  • Relax the sequential requirement of Gibbs Sampling
  • Implemented on Spark

23

slide-24
SLIDE 24

Search Framework for Topic-Specific Influencers

q SKIT: search framework for topic-specific key influencers

  • Input: free text
  • Output: ranked list of key influencers

healthy food

Michelle Obama Jimmy Oliver …

SKIT

⓵ Derivation of interested topics from input text ⓶ Compute influence score for each user ⓷ Sort users by influence scores and return the top influencers

24

slide-25
SLIDE 25

Experiments – Effectiveness on Twitter Data

q Twitter Dataset crawled in 2010

  • 1.76M users, 2.36B words, 183M links, 159K distinct words

q Interesting findings

  • Overall ~ 15% of followships are content-independent
  • Top-5 globally popular users: Pete Wentz (singer), Ashton Kutcher (actor),

Greg Grunberg (actor), Britney Spears (singer), Ellen Degeneres (comedian)

Topic Top-10 keywords Top-5 influencers

“InformaKon Technology” data, web, cloud, soSware, open, windows, MicrosoS, server, security, code Tim O’ Reilly, Gartner Inc, ScoX Hanselman (soSware blogger), Jeff Atwood (soSware blogger, co-founder of stackoverflow.com), Elijah Manor (soSware blogger) “Jobs” business, job, jobs, management, manger, sales, services, company, service, small, hiring Job-hunt.org, jobsguy.com, integritystaffing.com/blog (job search Ninja), jobConcierge.com, careerealism.com “Cycling and Running” bike, ride, race, training, running, miles, team, workout, marathon, fitness Lance Armstrong, Levi Leipheimer, Geroge Hincapie (all 3 US Postal pro cycling team members), Johan Bruyeel (US Postal team director), RSLT (a pro cycling team)

25

slide-26
SLIDE 26

Experiments – Effectiveness on Weibo Data

q Tencent Weibo Dataset from KDD Cup 2012

  • 2.33M users, 492M words, 51M links, 714K distinct words
  • VIP users
  • Organized in categories
  • Manually labeled “key influencers” in the corresponding categories
  • Used as “ground truth”

26

slide-27
SLIDE 27

Efficiency of Distributed FLDA

q Setup

  • SequenKal: a high end server (192 GB RAM, 3.2GHz)
  • Distributed: 27 servers (32GB RAM, 2.5GHz, 8 cores) connected by 1Gbit

Ethernet, 1 master and 200 workers

q Dataset

  • Tencent Weibo: 2.33M users, 492M words, 51M links
  • TwiSer: 1.76M users, 2.36B words, 183M links

q Sequen?al vs. Distributed

SequenKal Distributed Weibo 4.6 days 8 hours TwiSer 21 days 1.5 days

27

slide-28
SLIDE 28

Conclusion for FLDA work

q FLDA model for topic-specific influence analysis

  • Content + Network
  • Differentiates various reasons why one follows another

q Distributed Gibbs sampling algorithm for FLDA

  • Comparable results as sequential Gibbs sampling algorithm
  • Scales nicely with data sizes

q Significantly higher quality results than existing work

28

slide-29
SLIDE 29

Authority Analysis on Content Sharing Websites

Who Are Experts Specializing in Landscape Photography? Analyzing Topic-specific Authority on Content Sharing Services

Bin Bi, Ben Kao, Chang Wan, Junghoo Cho [KDD 2014]

Who’s a master of landscape photography? Who’s expert in shooting city lights? Expert Expert

29

slide-30
SLIDE 30

Topic-specific Authority Analysis on Content Sharing Websites

Resource

  • Video
  • Photo
  • Travel blog
  • ……

How to find a topic- specific authority who creates high- quality resources?

30

slide-31
SLIDE 31

Why Topic-specific?

City Lights Landscape

Expert Expert

No one is authoritative on every topic Who’s a master of sunset photography? Who’s expert in portrait photography? Users have different topical needs

31

slide-32
SLIDE 32

LDA-based Naïve Solution

q Adapt LDA to data in sharing log

  • User à Document
  • Tag à Word

q Two implications:

  • A user is interested in topic T

ó She frequently posts resources with tags specific to T

  • More frequently a user uses tags covering T

ó More authoritative she should be on T Sharing Log LDA

32

slide-33
SLIDE 33

Like Click

Like Log

q A supplementary source about content quality is needed q Like log provides valuable signal

Vote

q Challenge: Users don’t specify topical causes behind like clicks

33

slide-34
SLIDE 34

Topic-specific Authority Analysis (TAA) Model

Topical interest Topical authority

q Jointly model topical interest and topical authority:

Sharing Log Like Log TAA

34

slide-35
SLIDE 35

Intuition for Characterizing Authoritativeness

q Users’ authority

  • Different from each other

q Each user’s authority

  • Specific to individual topics

q Introduce to characterize topical authority

  • K-dimensional random vector over topics
  • Specific to individual user u
  • ……

ηu1 ηu2 ηu3 ηu4 ηuK

K-dimensional vector User u’s authority:

ηu ηu

35

slide-36
SLIDE 36

Intuition for Characterizing Like Clicks

q Introduce to represent like feedback

  • Binary random variable
  • Specific to user u and resource r

q u likes r è Topical authority exhibited by r matches u’s topical interest fur

fur = ( 1 if u favorited r

  • therwise

36

slide-37
SLIDE 37

Intuition for Characterizing Like Clicks (cont’d)

q Identify hidden topical causes behind like clicks

  • Designed a model for topic discovery

q With the topics, we specify the likelihood:

San Francisco? City Lights?

37

slide-38
SLIDE 38

Intuition behind Topic Discovery for Like Clicks

Logistic likelihood function: captures similarity between topic distributions for resource r and u’s interest poster ’s topical authority

u'

  • indicates that poster u’ should be expert in the topics prominent

in both u’s interest and resource r, so the weights of these topics have to be boosted

fur =1

38

slide-39
SLIDE 39

For user u:

q Pick a topic distribu?on θu q Pick an authority vector ƞu from MVN(µ,Ʃ) q For the nth tag posi?on

  • Pick a topic z from topic distribu?on θu
  • Pick a tag t from word distribu?on φz

For each (u,r) pair:

q Pick a like response fur

from Bernoulli( )

Topic-specific Authority Analysis (TAA)

Graphical Model Generative Process

users posters w/ resources

39

slide-40
SLIDE 40

Quantities of Interest

40

q quantifies user u’s unique authoritativeness over topics q characterizes user u’s topical interest q indicates probabilities of tag t belonging to individual topics

ηu

θu

ϕt

slide-41
SLIDE 41

Bayesian Learning for TAA

q Bayesian learning for a model with logistic likelihood has long been recognized as a hard problem[Gramacy et al. 2012] q We extend recent work[Polson et al. 2013] for inference of TAA

  • Introduce Polya-Gamma variables to posterior distribution

q Derived conditionals for Gibbs sampling

⋅ η : ⋅ z : ⋅ δ :

41

slide-42
SLIDE 42

Experiments

q Data collections q Metrics for effectiveness

  • (Mean Reciprocal Rank) MRR
  • Spearman’s correlation coefficient

Data #users #photos #tag asgmts # like clicks Flickr 21,054 204,335 3,014,813 1,562,805 500px 33,581 318,906 3,520,179 1,837,049

Most-tagged LDA Most-favorited TwitterRank Link-PLSA-LDA TAA

MRR

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Most-tagged LDA Most-favorited TwitterRank Link-PLSA-LDA TAA

Spearman's rank correlation coefficient

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Flickr 500px

42

slide-43
SLIDE 43

Predictive Power

Number of topics Perplexity

20 40 60 80 100

3000 6000 9000 12000 15000 18000 21000

Link-PLSA-LDA TAA Number of topics Perplexity

20 40 60 80 100

3000 6000 9000 12000 15000 18000 21000 24000 27000

Link-PLSA-LDA TAA

Flickr 500px

q Perplexity metric

43

slide-44
SLIDE 44

Case visualization

44

slide-45
SLIDE 45

Conclusion for TAA work

q Propose a novel TAA model for topic-specific authority analysis on content sharing websites

  • Leverage both sharing log and like log

q Propose a new inference algorithm to learn the parameters in TAA

  • Gibbs sampling with data augmentation

q Higher quality results and stronger predictive power than the state of the art

45

slide-46
SLIDE 46

Future Work

q Bayesian modeling can be used in a wide range of applications q Recommendation

  • Extending FLDA and TAA to recommender systems in various domains

q Feature Extraction

  • Including posterior distributions of latent topics as features for a learning

model

q User Modeling

  • Dealing with data sparsity by borrowing information across users

46

slide-47
SLIDE 47

References

q Bin Bi and Junghoo Cho. Modeling a Retweet Network via an Adaptive Bayesian Approach. Proceedings of the 25th International World Wide Web Conference (WWW), 2016. q Bin Bi, Hao Ma, Paul Hsu, Wei Chu, Kuansan Wang, and Junghoo Cho. Learning to Recommend Related Entities to Search Users. Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM), 2015. q Youngchul Cha, Keng-hao Chang, Hari Bommaganti, Ye Chen, Tak Yan, Bin Bi, and Junghoo Cho. A Universal Topic Framework (UniZ) and Its Application in Online Search. Proceedings of the 30th ACM SIGAPP Symposium On Applied Computing (SAC), 2015. q Bin Bi, Ben Kao, Chang Wan, and Junghoo Cho. Who Are Experts Specializing in Landscape Photography? - Analyzing Topic-specific Authority on Content Sharing Services. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014. q Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, and Junghoo Cho. Scalable Topic-Specific Influence Analysis on

  • Microblogs. Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM), 2014.

47

slide-48
SLIDE 48

References

q Bin Bi, Milad Shokouhi, Michal Kosinski, and Thore Graepel. Inferring the Demographics of Search Users - Social Data Meets Search Queries. Proceedings of the 22nd International World Wide Web Conference (WWW), 2013. q Bin Bi, and Junghoo Cho. Automatically Generating Descriptions for Resources by Tag Modeling. Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM), 2013. q Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, and Junghoo Cho. Incorporating Popularity in Topic Models for Social Network

  • Analysis. Proceedings of the 36th International ACM SIGIR Conference (SIGIR), 2013.

q Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, and Eric Lo. DQR: A Probabilistic Approach to Diversified Query

  • Recommendation. Proceedings of the 21st ACM International Conference on Information and Knowledge Management

(CIKM), 2012. q Bin Bi, Sau Dan Lee, Ben Kao, and Reynold Cheng. CubeLSI: An Effective and Efficient Method for Searching Resources in Social Tagging Systems. Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE), 2011. q Bin Bi, Lifeng Shang, and Ben Kao. Collaborative Resource Discovery in Social Tagging Systems. Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM), 2009.

48

slide-49
SLIDE 49

Thank you!