Bayesian Modeling for Analyzing Online Content and Users Bin Bi - PowerPoint PPT Presentation

Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu

Online Content Explosion By Domo 2

Information Overload sheer amount of online content Curse Blessing • Investigating topics of interest • Confusion • Checking facts • Sub-optimum decisions • Getting advice about problems • Dissatisfaction 3

Goal : � Learning to discover high-quality information q Two schemes Discern good content from the bad 1. Identify users who generate high-quality content 2. q Two domains Social media 1. Search engine 2. 4

Influencer Discovery on Microblogs healthy food SKIT Michelle Obama Jimmy Oliver … Scalable Topic-specific Influence Analysis on Microblogs Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, Junghoo Cho [WSDM 2014] 5

Motivation q Huge amount of textual and social information produced by popular microblogging sites • Twitter had over 500 million users creating over 340 million tweets daily, reported in March 2012 q A popular resource for marketing campaigns • Monitor opinions of consumers • Launch viral advertising q Identifying social influencers is crucial for market analysis 6

Existing Social Influence Analysis q Major existing influence analyses are only based on network structures • e.g., Influence maximization work [Kleinberg et al , KDD’03] • Valuable textual content is ignored • Only one global influence score is computed for each user q But we need • Topic-specific influence based on both network and content • Differentiate influence in different aspects of life (topics) 7

Existing Topic-Specific Influence Analysis q Separate analysis of content and networks • e.g. Topic-sensitive PageRank (TSPR), TwitterRank Text analysis on content Topic collection Influence analysis on network Problem : Content and links are often correlated in microblog networks • A user tends to follow another who tweets in similar topics 8

Solution Overview q Goal : Identify topic-specific key healthy food SKIT influencers on microblogs Michelle Obama Jimmy Oliver q Leverage both content and network … q Tight coupling of content and network analysis q Followship-LDA (FLDA) Model • Our topic-specific influence model for microblog networks • Solid probabilistic foundation in Bayesian modeling 9

Review of Typical LDA q Latent Dirichlet Allocation (LDA) [Blei et al , JMLR ’03] is a generative topic model for latent topic discovery in a text corpus q Intuition: from Blei’s slides 10

Review of Typical LDA (cont’d) Topic assignments: z Per-document topic distribution: θ Per-topic word distribution: ϕ • Each topic is a distribution over words • Each document is a distribution over topics • Each word is drawn from one of those topics 11

Review of Typical LDA (cont’d) Topic assignments: z Per-document topic distribution: θ Per-topic word distribution: ϕ • In reality, we only observe the documents • The other structure are hidden variables 12

Statistical Modeling q Generative probabilistic modeling • Treats data as observations • Contains hidden variables • Specifies a probabilistic procedure by which the observations are generated q Inference Input Output Observed data Values of hidden variables 13

Graphical Model for LDA α Generative Process for LDA θ q For each document d § Sample θ d from Dirichlet ( α ) § For each word position in d z • Sample a topic z from θ d • Sample a word w from φ z w φ β N m K M θ d : Topic distribution for document d M : Number of documents φ z : : Word distribution for topic z N m : Number of words in document m K : Number of topics 14

Topic-specific Influence Analysis on Microblogs q Intuition ① Each microblog user has tweet text and a set of followees Content Followees Alice web, organic, veggie, Michelle Obama, web, organic, veggie, cookie, cloud, …… Mark Zuckerberg, cookie, cloud, …… Barack Obama ② A user tweets in multiple topics • Alice tweets about technology and food ③ A topic is a distribution over words • web and cloud are more likely to appear in the technology topic 15

Topic-specific Influence Analysis on Microblogs q Intuition (cont’d) ④ A user follows another for different reasons • Content-based : follow for similar topics • Content-independent : follow for popularity ⑤ Topics of content-based followships differ from each other • Mark Zuckerberg is more likely to be followed for the technology topic • Topic-specific influence : the probability of a user being followed for a given topic 16

Followship-LDA (FLDA) q FLDA : A fully-Bayesian generative model specifically designed for microblog networks • Specifies a stochastic process by which the content and links of each user are generated • Introduces hidden structure • Topics, reasons of a user following another, topic-specific influence, etc. • Inference: reverse the generative process • What hidden structure is most likely to have generated the observed data? 17

Hidden Structure in FLDA q Per-user topic distribu?on Topic θ Tech Food Poli ϕ q Per-topic word distribu?on Alice 0.8 0.19 0.01 q Per-user followship preference µ Bob 0.6 0.3 0.1 User Mark … … … q Per-topic followee distribu?on σ Michelle … … … Barack … … … q Global followee distribu?on π Preference Word web cookie veggie organic congress … Follow for Topic Follow for Pop. Topic Tech 0.3 0.1 0.001 0.001 0.001 … Alice 0.75 0.25 Food 0.001 0.15 0.3 0.1 0.001 … User Bob 0.5 0.5 Poli 0.005 0.001 0.001 0.002 0.25 … Mark … … Michelle … … User Barack … … Alice Bob Mark Michelle Barack Topic User Tech 0.1 0.05 0.7 0.05 0.1 Alice Bob Mark Michelle Barack Food 0.04 0.1 0.06 0.75 0.05 Global 0.005 0.001 0.244 0.25 0.5 Poli 0.03 0.02 0.05 0.1 0.8 Topic-specific influence Global popularity

Graphical Model for FLDA Plate Notation For the m th user: user q Pick a topic distribu?on θ m link tweet q Pick a followship preference µ m q For the n th word posi?on • Pick a topic z from topic distribu?on θ m • Pick a word w from word distribu?on φ z q For the l th followee • Choose the cause based on followship preference µ m • If content-related Tech: 0.8, Food: 0.19, Poli: 0.01 Tech Tech: 0.8, Food: 0.19, Poli: 0.01 § Pick a topic z from topic distribu?on θ m Follow for content: 0.75, not: 0.25 Follow for content Follow for content: 0.75, not: 0.2 § Pick a followee from per-topic followee distribu?on σ z Alice • Otherwise Content : , organic, … web § Pick a followee from global followee distribu?on π Followees : , Barack Obama, … Michelle Obama 19

Bayesian Learning for FLDA q Gibbs Sampling • A Markov chain Monte Carlo algorithm Begin with some initial value for each latent variable Iteratively sample each variable conditioned on the current values of the other variables The samples are used to approximate the posterior distribution 20

Gibbs Sampler for FLDA q Derived conditionals for FLDA Gibbs Sampler • Prob. topic of n th word of m th user given the current values of all others • Prob. topic of l th content-based followship of m th user given the current values of all others • Prob. l th followship of m th user is independent of content given the current values of all others 21

Gibbs Sampler for FLDA (cont’d) q In each pass of data, for the m th user • Sample latent variables from respec?ve condi?onals • Keep counters while sampling w,z : # ?mes w is assigned to z for m th user • c m e,z : # ?mes e is assigned to z for m th user • d m q Es?mate distribu?ons for hidden structure per-user topic distribution per-topic followee distribution (influence) 22

Distributed Gibbs Sampling for FLDA q Challenge : Gibbs Sampling process is sequential • Each sample step relies on the most recent values of all other variables. q Observation : dependency between variable assignments is weak q Solution : Distributed Gibbs Sampling for FLDA • Relax the sequential requirement of Gibbs Sampling • Implemented on Spark 23

Search Framework for Topic-Specific Influencers q SKIT: search framework for topic-specific key influencers Input: free text • SKIT healthy food Output: ranked list of key influencers • Michelle Obama Jimmy Oliver … ⓵ Derivation of interested topics from input text ⓶ Compute influence score for each user ⓷ Sort users by influence scores and return the top influencers 24

Bayesian Modeling for Analyzing Online Content and Users Bin Bi - PowerPoint PPT Presentation

Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu Online Content Explosion By Domo 2 Information Overload sheer amount of online content Curse

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Bayesian regression with a categorical predictor Alicia Johnson Associate Professor, Macalester

The prior model Alicia Johnson Associate Professor, Macalester College DataCamp Bayesian

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Outline Background: What is an Online Problem? Maximizing Revenue for the Analyzing Online

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Applying geographical clustering methods to analyze geo-located open micro-blog posts Andy Turner 1

Placing images on the world map: a microblog- based enrichment approach Claudia Hau ff &

microblogging posts Jasmina Smailovi Joef Stefan Institute Department of Knowledge Technologies

Information Extraction from Microblogs Posted during Disasters Saptarshi Ghosh 1 Kripabandhu Ghosh

AKC Education Webinar Series: Social Media Resources Buffer: What Is It? Buffer is a software

Functional breakdown of Student: Wouter Miltenburg decentralised social networks Supervisor:

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

Social Media -- Understanding it and Making it Work Preliminary Guidance on Social Media

Bayesian Modeling for Analyzing Online Content and Users Bin Bi - PowerPoint PPT Presentation

Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu Online Content Explosion By Domo 2 Information Overload sheer amount of online content Curse

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Bayesian regression with a categorical predictor Alicia Johnson Associate Professor, Macalester

The prior model Alicia Johnson Associate Professor, Macalester College DataCamp Bayesian

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Outline Background: What is an Online Problem? Maximizing Revenue for the Analyzing Online

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Applying geographical clustering methods to analyze geo-located open micro-blog posts Andy Turner 1

Placing images on the world map: a microblog- based enrichment approach Claudia Hau ff &amp;

microblogging posts Jasmina Smailovi Joef Stefan Institute Department of Knowledge Technologies

Information Extraction from Microblogs Posted during Disasters Saptarshi Ghosh 1 Kripabandhu Ghosh

AKC Education Webinar Series: Social Media Resources Buffer: What Is It? Buffer is a software

Functional breakdown of Student: Wouter Miltenburg decentralised social networks Supervisor:

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

Social Media -- Understanding it and Making it Work Preliminary Guidance on Social Media

Placing images on the world map: a microblog- based enrichment approach Claudia Hau ff &