Discovering the multifaceted information hidden within large - - PowerPoint PPT Presentation

discovering the multifaceted information hidden within
SMART_READER_LITE
LIVE PREVIEW

Discovering the multifaceted information hidden within large - - PowerPoint PPT Presentation

Discovering the multifaceted information hidden within large user-generated text streams Daniel Preotiuc-Pietro daniel@dcs.shef.ac.uk 23.04.2014 Context vast increase in user generated content Online Social Networks most


slide-1
SLIDE 1

Discovering the multifaceted information hidden within large user-generated text streams

Daniel Preotiuc-Pietro

daniel@dcs.shef.ac.uk 23.04.2014

slide-2
SLIDE 2

Context

  • vast increase in user generated content
  • Online Social Networks

most time-consuming activity on Internet

  • multiple modalities: text, time, location, user

info, images, etc.

  • social network structure
  • Challenges:
  • Engeneering: data volume
  • Algorithmic: restricted information,

grounded in context, streaming, noise

slide-3
SLIDE 3

Motivation

Assumption: Text has different use conditioned on factors such as time, location, etc. Aim: Build models which incorporate these factors Tasks:

  • Supervised prediction applications
  • internal, external
  • Study the effect of these factors in text use
  • Improve performance of downstream applications
slide-4
SLIDE 4

Outline

i. Introduction ii. Data processing iii. Temporal patterns iv. Text forecasting real-world outcomes v. Spatio-temporal clustering vi. User level properties

slide-5
SLIDE 5

TrendMiner project

  • `Large scale, cross-lingual trend mining and

summarization of real time media streams’

  • 6+4 organisations; we work with University of

Southampton and SORA on machine learning

  • application to predicting political polls and

aiding political analysts to make sense of social media data www.trendminer-project.eu

slide-6
SLIDE 6

Text Processing

RT @MediaScotland greeeat!!!lvly speech by cameron on scott's indy :) #indyref

unorthodox capitalisation OOV words creative spellings shortenings new conventions lack of context

slide-7
SLIDE 7

Processing Architecture

  • Fast: real time processing, Hadoop MapReduce (I/O

bound), online and batch processing

  • Scalable: adding more machines
  • Modular: easy to add new modules
  • Pipeline: the user specifies his needs
  • Extensible: different sources of data (USMF format)
  • Data consistency: JSON format, append to ‘analysis’
  • Reusable: open-source

(ICWSM 2012)

slide-8
SLIDE 8

Components

slide-9
SLIDE 9

Gaussian Processes

(EMNLP 2013)

Task: Forecast hashtag frequency in Social Media

  • identify and categorise complex temporal patterns

Non-parametric Bayesian framework

  • kernelised
  • probabilistic formulation
  • propagation of uncertainty
  • exact posterior inference for regression
  • Non-parametric extension of Bayesian regression
  • very good results, but hardly used in NLP
slide-10
SLIDE 10

Gaussian Processes

Define prior over functions Compute posterior

(ACL 2014 Tutorial)

slide-11
SLIDE 11

Extrapolation

slide-12
SLIDE 12

Examples of time series

#FAIL #RAW #SNOW #FYI

SE

slide-13
SLIDE 13

Experimental results

slide-14
SLIDE 14

Experimental results

Compared to Mean prediction

slide-15
SLIDE 15

Text classification

Task: Assign the hashtag to a given tweet

  • Most frequent (MF)
  • Naive Bayes model (NB-E)
  • Naive Bayes with GP forecast as prior (NB-P)

MF NB-E NB-P Match@1 7.28% 16.04% 17.39% Match@5 19.90% 29.51% 31.91% Match@50 44.92% 59.17% 60.85% MRR 0.144 0.237 0.252

slide-16
SLIDE 16

User behaviour

Task: Predict venue check-in frequencies

  • Modelled using GPs
  • Compared to Mean
  • 150
  • 100
  • 50

50 100 Linear SE PER PS Select

slide-17
SLIDE 17

Individual user behaviour

Task: Predict venue type of user check-in

  • highly periodic
  • compared to standard

Markov predictors

Method Accuracy Random 11.11% M.Freq Categ. 35.21% Markov-1 36.13% Markov-2 34.21% Daily period 38.92% Weekly period 40.65% (WebScience 2013)

slide-18
SLIDE 18

(ACL 2013)

Text based forecasting

Task: predicting real world outcomes Aim: replace expensive polls with streaming text

  • predict political voting intention (not elections!)
  • based on social media (Twitter) text
  • strong baselines (last day, mean)
  • 2 different use cases (UK and Austria)
  • UK: 42k users, 60m tweets, 3 parties, 2 years
slide-19
SLIDE 19

Linear regression

w xt + β = yt

slide-20
SLIDE 20

Linear regression

w, β = argmin (𝑥𝑦𝑗 + 𝛾 − 𝑧𝑗)2

𝑜 𝑗=1

slide-21
SLIDE 21

Linear regression

w, β = argmin (𝑥𝑦𝑗 + 𝛾 − 𝑧𝑗)2+ 𝜔𝑓𝑚(𝑥, 𝜍)

𝑜 𝑗=1

LEN – Elastic Net

slide-22
SLIDE 22

Bilinear regression

  • main issue is noise:

many non-informative users

  • we look for a model of

sparse words & sparse users

  • bi-convex optimisation problem
  • solved by alternatively fixing each set of

weights and iterating until convergence

slide-23
SLIDE 23

Bilinear regression

u Xt wT + β = yt

slide-24
SLIDE 24

Bilinear regression

w, u, β = argmin (𝑣𝑌𝑗𝑥𝑈 + 𝛾 − 𝑧𝑗)2

𝑜 𝑗=1

slide-25
SLIDE 25

Bilinear regression

w, u, β = argmin (𝑣𝑌𝑗𝑥𝑈 + 𝛾 − 𝑧𝑗)2+ 𝜔𝑓𝑚 𝑥, 𝜍1 +

𝑜 𝑗=1

𝜔𝑓𝑚(𝑣, 𝜍2)

BEN – Bilinear Elastic Net

slide-26
SLIDE 26

Bilinear regression

𝑥𝑢, 𝑣𝑢, β = argmin (𝑣𝑢𝑌𝑗𝑥𝑢 + 𝛾 − 𝑧𝑢𝑗)2+ 𝜔𝑓𝑚 𝑥𝑢, 𝜍1 +

𝑜 𝑗=1

𝜔𝑓𝑚(𝑣𝑢, 𝜍2)

slide-27
SLIDE 27

Bilinear regression

𝑥𝑢, 𝑣𝑢, β = argmin (𝑣𝑢𝑌𝑗𝑥𝑢 + 𝛾 − 𝑧𝑢𝑗)2+ 𝜔𝑓𝑚 𝑥𝑢, 𝜍1 +

𝑜 𝑗=1

𝜔𝑓𝑚(𝑣𝑢, 𝜍2)

slide-28
SLIDE 28

Bilinear regression

𝑥𝑢, 𝑣𝑢, β = argmin (𝑣𝑢𝑌𝑗𝑥𝑢 + 𝛾 − 𝑧𝑢𝑗)2+ 𝜔𝑓𝑚 𝑥𝑢, 𝜍1 +

𝑜 𝑗=1

𝜔𝑓𝑚(𝑣𝑢, 𝜍2)

slide-29
SLIDE 29

Bilinear regression

w, u, β = argmin (𝑣𝑢𝑌𝑗𝑥𝑢 + 𝛾 − 𝑧𝑢𝑗)2+ 𝜔𝑚1𝑚2 𝑥, 𝜍1 +

𝑜 𝑗=1 𝜐 𝑢=1

𝜔𝑚1𝑚2(𝑣, 𝜍2)

BGL – Bilinear Group LASSO

slide-30
SLIDE 30

Quantitative results

Root Mean Squared Error (RMSE) forecasting results over 50 testing polls (in VI %) BGL BEN Polls

slide-31
SLIDE 31

Quantitative results

Party Tweet Score Author CON PM in friendly chat with top EU mate, Sweden’s Fredrik Reinfeldt, before family photo 1.334 Journalist Have Liberal Democrats broken electoral rules? Blog on Labour complaint to cabinet secretary

  • 0.991

Journalist LAB Blog Post Liverpool: City of Radicals Website now Live <link> #liverpool #art 1.954 Art Fanzine I am so pleased to head Paul Savage who worked for the Labour group has been Appointed the Marketing manager for the baths hall GREAT NEWS

  • 0.552

Politicial (Labour) LBD RT @user: Must be awful for TV bosses to keep getting knocked back by all the women they ask to host election night (via @user) 0.874 LibDem MP Blog Post Liverpool: City of Radicals 2011 – More Details Announced #liverpool #art

  • 0.521

Art Fanzine

slide-32
SLIDE 32
  • The real-world outcome and users share:

i. region info: London (L), South England (S), Midlands & Wales (MW), North (N), Scotland (Sc) - observed

  • ii. gender: Male (M), Female (F) - inferred using

statistical text-based classifier

  • iii. age: 18-24, 25-39, 40-59, 60+ - unknown

User features

slide-33
SLIDE 33

Recap: Bilinear regression

w, u, β = argmin (𝑣𝑢𝑌𝑗𝑥𝑢 + 𝛾 − 𝑧𝑢𝑗)2+ 𝜔𝑚1𝑚2 𝑥, 𝜍1 +

𝑜 𝑗=1 𝜐 𝑢=1

𝜔𝑚1𝑚2(𝑣, 𝜍2)

BGL – Bilinear Group LASSO

slide-34
SLIDE 34

Region & Demographics

w, u, β = argmin (𝑣𝑢𝑠𝑌𝑗𝑠𝑥𝑢𝑠 + 𝛾𝑢𝑠 − 𝑧𝑢𝑗𝑠)2

𝑜 𝑗=1 𝜖 𝑠=1 𝜐 𝑢=1

+ 𝜔𝑚1𝑚2 𝑥𝑠, 𝜍1 + 𝜔𝑚1𝑚2 𝑥𝑢, 𝜍1 + 𝜔𝑚1𝑚2(𝑣𝑠, 𝜍2)

𝜖 𝑠=1

BGGR

slide-35
SLIDE 35

Region & Demographics

S L MW N Sc 𝝂 𝑪𝝂 2.9 3.9 3.2 3.2 3.8 3.4 𝑪𝒎𝒃𝒕𝒖 3.0 4.9 4.3 4.0 5.3 4.3 BGGR 2.6 3.9 3.2 3.0 3.7 3.3 M F 𝝂 𝑪𝝂 2.6 2.1 2.4 𝑪𝒎𝒃𝒕𝒖 2.6 2.4 2.5 BGGR 2.1 2.1 2.1

Regional model Gender model

slide-36
SLIDE 36

Region & Demographics

London Predictions Female Predictions

slide-37
SLIDE 37

Region & Demographics

Conservatives, Positive London

slide-38
SLIDE 38

Task: Predict socioeconomic EU indicators Dataset:

  • News summaries from Open Europe think tank
  • Daily summaries of EU and member states

related news together with their news source

  • Feb 2006 – Nov 2013; 1,913 days; 94 months
  • 296 news outlets (with >10 summaries)
  • Features: unigrams + bigrams

NewsSummaries dataset

(LACSS 2014)

slide-39
SLIDE 39

Predictions

ESI (Economic Sentiment Indicator) Unemployment ESI Unemployment LEN 9.253 (9.89%) 0.9275 (8.75%) BEN 8.209 (8.77%) 0.9047 (8.52%)

slide-40
SLIDE 40

Economic Sentiment Indicator

slide-41
SLIDE 41

Unemployment

slide-42
SLIDE 42

Deep linguistic features

  • Unigrams (8,912) (cameron)
  • Bigrams (33,206) (david__cameron)
  • POS (10,277): Unigrams together with their

part-of-speech (cameron/NNP)

  • NE (1,013): Entities - Location, Person or

Organisation (Person:David_Cameron)

  • Annotations (3,392): Link entities to DBpedia

e.g. political party (Org:Conservative_Party),

  • ffice held (Office:Prime_minister)
slide-43
SLIDE 43

Deep linguistic features

Features ESI Unempl. Unigrams 8.21 1.27 Bigrams 9.66 1.61 Unigrams + Bigrams 8.91 1.47 POS 7.87 1.14 Entities 9.59 1.45 POS + NE 8.09 1.12 NE + Annotations 12.67 1.62 POS + NE + Annotations 10.50 1.31 Unigrams + NE + Annotations 10.92 1.31 Unigrams + Bigrams + NE + Annotations 10.81 1.53

slide-44
SLIDE 44

Dimensionality reduction is used to aid browsing large data collections Topic models:

  • find `topics’ in a collection of documents
  • `topic’ = a set of semantically coherent words
  • each document is assigned to a few `topics’
  • each word is assigned with a probability to each

`topic’ (soft clustering)

  • extra factors can be accomodated, e.g. spatio-

temporal dependencies and evolution

Clustering

slide-45
SLIDE 45

Temporal topic models

Latent Dirichlet Allocation (LDA) Dirichlet Multinomial Regression (DMR)

slide-46
SLIDE 46
  • LDA: Documents analysed over time, no temporal

conditioning

  • Temporal DMR (MId): Documents authored in the

same interval share similar topics

  • Temporal DMR (TimeRBF): Neighbouring time

intervals influence each others

  • Regional DMR (OutletId): Documents with similar

news source share similar topics

  • Regional DMR (DomainId): Documents with similar

domain name share similar topics

Temporal & Regional models

slide-47
SLIDE 47

Spatio-temporal experiments

Method Perplexity LDA 4,597 DMR MId 4,575 DMR TimeRBF 4,262 DMR TimeRBF+OutletId 4,086 DMR TimeRBF+OutletId+DomainId 4,036

slide-48
SLIDE 48

Experiments: temporal & regional

Top domains: .it 3.44 .fr 0.09 .tv 0.08 .ee 0.06 .ir 0.05 Top outlets: ft.com 0.79 corriere.it 0.68 repubblica.it 0.49 elpais.com 0.45

slide-49
SLIDE 49

Experiments: temporal & regional

slide-50
SLIDE 50

Experiments: temporal & regional

Top domains: .fr 0.27 .org 0.10 .es 0.08 .ca 0.06 .ch 0.03 Top outlets:

guardian.co.uk 0.61 diplomatie.gouv.fr 0.60 bluesstatedigital.com 0.55 dw-world.de 0.49

slide-51
SLIDE 51

Experiments: temporal & regional

slide-52
SLIDE 52

User-level properties

  • User-level properties:

age, gender, location, social grade, impact

  • Aim: understand text use in context of these

features - `profile’ users

  • Task:
  • build a model with good predictive value on

held-out users

  • interpret the features of this model
slide-53
SLIDE 53

User impact

Impact score:

ln listings ∗ followers2 followees

Data: 38k UK users, 48m deduplicated messages, all tweets from 1 year Features: profile info and text under the user’s control

(EACL 2014)

slide-54
SLIDE 54

User impact

  • Models:

Linear Regression (LIN) Gaussian processes (GP)

with ARD kernel

  • Features:

User account (18) Topics from user text (100): derived using spectral clustering on word co-occurrence matrix

Pearson correlation

slide-55
SLIDE 55

User impact

Feature Importance Using default profile image 0.73 Total number of tweets (entire history) 1.32 Number of unique @-mentions in tweets 2.31 Number of tweets (in dataset) 3.47 Links ratio in tweets 3.57 T1 (Weather): mph, humidity, barometer, gust, winds 3.73 T2 (Healthcare, Housing): nursing, nurse, rn, registered, bedroom, clinical, #news, estate, #hospital 5.44 T3 (Politics): senate, republican, gop, police, arrested, voters, robbery, democrats, presidential, elections 6.07 Proportion of days with non-zero tweets 6.96 Proportion of tweets with @-replies 7.10

slide-56
SLIDE 56

User impact

Impact distribution for users with high (H) values of this feature as opposed to low (L). Red line is the mean impact score. Number of tweets Number of unique @-mentions

slide-57
SLIDE 57

User impact

damon, potter, #tvd, harry elena, kate, portman pattinson, hermione, jennifer senate, republican, gop police, arrested, voters robbery, democrats presidential, elections

Impact distribution for users with high (H) values of this feature. Red line is the mean impact score.

slide-58
SLIDE 58

User impact

User scenario:

  • 1. high number of tweets
  • 2. talk about T3 (showbiz)
  • 3. talk about T4 (politics)
  • 4. use links (L)
  • 5. do not use links (NL)
slide-59
SLIDE 59

Vasileios Lampos UCL www.lampos.net Trevor Cohn Melbourne

http://dcs.shef.ac.uk/~tcohn/

Sina Samangooei Southampton www.sinjax.net Nikos Aletras Sheffield

http://dcs.shef.ac.uk/~nikos/

Collaborators

slide-60
SLIDE 60

References

(ICWSM 2012) Trendminer: An Architecture for Real Time Analysis of Social Media Text.

  • D. Preotiuc-Pietro, S. Samangooei, T. Cohn, N. Gibbins, M. Niranjan

(HT 2013) Where’s @wally: A classification approach to Geolocating users based on their social ties.

  • D. Rout, D. Preotiuc-Pietro, K.Bontcheva, T. Cohn (`Ted Nelson’ award)

(WebScience 2013) Mining User Behaviours: A study of check-in patterns in Location Based Social Networks.

  • D. Preotiuc-Pietro, T. Cohn

(ACL 2013) A user-centric model of voting intention from Social Media.

  • V. Lampos, D. Preotiuc-Pietro, T. Cohn

(EMNLP 2013) A temporal model of text periodicities using Gaussian Processes.

  • D. Preotiuc-Pietro, T. Cohn

(EACL 2014) Predicting and Characterising User Impact on Twitter.

  • V. Lampos, N. Aletras, D. Preotiuc-Pietro, T.Cohn

(LACSS 2014) Extracting Socioeconomic Patterns from the News: Modelling Text and Outlet Importance Jointly.

  • V. Lampos, D. Preotiuc-Pietro, S. Samangooei, D. Gelling, T. Cohn
slide-61
SLIDE 61

Thank you !