Mining the social web: A series of statistical NLP case studies
Vasileios Lampos
Department of Computer Science University College London May, 2014
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
1/49
1/49
Mining the social web: A series of statistical NLP case studies - - PowerPoint PPT Presentation
Mining the social web: A series of statistical NLP case studies Vasileios Lampos Department of Computer Science University College London May, 2014 1 / 49 v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy 1/49 Key assumptions about social
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
1/49
1/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
2/49
2/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
3/49
3/49
(Lansdall et al., 2012)
(Lampos, Cristianini, 2010 & 2012)
(Lampos et al., 2013)
(Lampos et al., 2014) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
4/49
4/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
5/49
5/49
, e d by st is.
ied d ying location s,
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 −2 2 4 6 8 10 933 Day Time Series for Joy in Twitter Content Date Normalised Emotional Valence
* RIOTS * CUTS * XMAS * XMAS * XMAS * roy.wed. * halloween * halloween * halloween * valentine * valentine * easter * easter
raw joy signal 14−day smoothed joy
happy, enjoy, love, glad, joyful, elated...
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12 −1 −0.5 0.5 1 1.5 Date Difference in mean Anger Fear Date of Budget Cuts Date of Riots
(Lansdall et al., 2012), (Strapparava, Valitutti, 2004) → WordNet Affect v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
6/49
6/49
−1.5 −1 −0.5 0.5 1 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 Saturday Sunday Monday Tuesday Wednesday Thursday Friday 1st Principal Component 2nd Principal Component Days of the Week −8 −6 −4 −2 2 4 6 8 −2 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5253 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 8687 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 1st Principal Component 2nd Principal Component Days in 2011
New Year (1), Valentine’s (45), Christmas Eve (358), New Year’s Eve (365) O.B. Laden’s death (122), Winehouse’s death & Breivik (204), UK riots (221)
(Lampos, 2012), (Strapparava, Valitutti, 2004) → WordNet Affect v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
7/49
7/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
8/49
8/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
w w w∗
ℓ2 ⇒ w
∗ X
−1 X
∗ y
∗ X
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
9/49
9/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
w w w∗
ℓ2 + λw
ℓ2
(Hoerl, Kennard, 1970) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
10/49
10/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
w w w∗
ℓ2 + λw
(Efron et al., 2004)
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
11/49
11/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
w w w∗
ℓ2 + λw
Slides: http://bit.ly/1v3Jeiy
12/49
12/49
160 180 200 220 240 260 280 300 320 340 −2 −1 2 3 4 5 −2
Day Number (2009) Flu rate / score (z−scores)
Twitter’s Flu−score (region D) HPA’s Flu rate (region D)
(Lampos, Cristianini, 2010) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
13/49
13/49
‘unwel’, ‘temperatur’, ‘headach’, ‘appetit’, ‘symptom’, ‘diarrhoea’, ‘muscl’, ‘feel’, ‘flu’, ‘cough’, ‘nose’, ‘vomit’, ‘diseas’, ‘sore’, ‘throat’, ‘fever’, ‘ach’, ‘runni’, ‘sick’, ‘ill’, ...
180 200 220 240 260 280 300 320 340 50 100 150
Day Number (2009) Flu rate HPA Inferred
(Lampos, Cristianini, 2010) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
14/49
14/49
Bristol
5 10 15 20 25 30 2 4 6 8 10 12 14 16
Days Rainfall rate (mm) − Bristol Actual Inferred
5 10 15 20 25 30 2 4 6 8 10 12 14 16
Days Rainfall rate (mm) − London Actual Inferred
London (Lampos, Cristianini, 2012) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
15/49
15/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
w w w∗
ℓ2
ℓ2
(Zhou, Hastie, 2005) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
16/49
16/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
17/49
17/49
i w
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
i ∈ {1, ..., n}
i ∈ {1, ..., n}
k ∈ {1, ..., p}
j ∈ {1, ..., m}
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
18/49
18/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
k ∈ {1, ..., p}
j ∈ {1, ..., m}
× × + β
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
19/49
19/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
k ∈ {1, ..., p}
j ∈ {1, ..., m}
u u u,w w w,β
n
2 + ψ(u
ℓ2 + λ2v
(Lampos et al., 2013) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
20/49
20/49
u u u,w w w,β
2 + λu1u
ℓ2 + λu2u
ℓ2 + λw2w
u u, learn w w w and vice versa Iterating through convex optimisation tasks: convergence
(Al-Khayyal, Falk, 1983; Horst, Tuy, 1996)
FISTA (Beck, Teboulle, 2009) implemented in SPAMS (Mairal et al., 2010) Large-scale optimisation solver, quick convergence
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0.4 0.8 1.2 1.6 2 2.4
Step
Global Objective RMSE
RMSE on held-out data vs Obj. function through iterations
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
21/49
21/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
22/49
22/49
¸iuc-Pietro et al., 2012)
5 30 55 80 105 130 155 180 205 230 5 10 15 20 25 30 35 40 45
Voting Intention % Time
CON LAB LIB
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
23/49
23/49
¸iuc-Pietro et al., 2012)
5 20 35 50 65 80 95 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
24/49
24/49
µ µ: constant prediction based on µ(y
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
25/49
25/49
µ µ
µ µ
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
26/49
26/49
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40
Voting Intention % Time
CON LAB LIB
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40
Voting Intention % Time
CON LAB LIB
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
27/49
27/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
28/49
28/49
m
W W W,β β β
X
ℓF + λ m
et al., 2008; Liu et al., 2009) extends the notion of group lasso (Yuan, Lin, 2006)
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
29/49
29/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
× ×
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
30/49
30/49
i ∈ {1, ..., n}
i ∈ {1, ..., n}
j ∈ {1, ..., m}
U U U,W W W,β β β
n
t Q
2
p
m
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
31/49
31/49
U U U,W W W,β β β
n
t Q
2
p
m
×
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
32/49
32/49
µ µ
µ µ
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
33/49
33/49
Polls BEN BGL UK
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40
Voting Intention % Time
CON LAB LIB
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40
Voting Intention % Time
CON LAB LIB
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40
Voting Intention % Time
CON LAB LIB
Austria
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30
Voting Intention % Time
SPÖ ÖVP FPÖ GRÜ
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
34/49
34/49
Party Tweet Score Author CON PM in friendly chat with top EU mate, Sweden’s Fredrik Re- infeldt, before family photo 1.334 Journalist LAB I am so pleased to hear Paul Savage who worked for the Labour group has been Appointed the Marketing manager for the baths hall GREAT NEWS −0.552 Politician (Labour) LBD RT @user: Must be awful for TV bosses to keep getting knocked back by all the women they ask to host election night (via @user) 0.874 LibDem MP SP¨ O Inflationsrate in ¨
Teurer wurde Wohnen, Wasser, Energie. Translation: Inflation rate in Austria slightly down in July from 2,2 to 2,1%. Accommodation, Water, Energy more expensive. 0.745 Journalist ¨ OVP kann das buch “res publica” von johannes #voggenhuber wirklich empfehlen! so zum nachdenken und so... #europa #demokratie Translation: can really recommend the book “res publica” by johannes #voggenhuber! Food for thought and so on #europe #democracy −2.323 User GR¨ U Protestsong gegen die Abschaffung des Bachelor-Studiums Internationale Entwicklung: <link> #IEbleibt #unibrennt #uniwut Translation: Protest songs against the closing-down of the bachelor course of International Development: <link> #IDremains #uniburns #unirage 1.45 Student Union v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
35/49
35/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
36/49
36/49
¸iuc-Pietro et al., 2012) (Lampos et al., 2014) v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
37/49
37/49
φ2
in/φout
Histogram of the user impact scores in our data set µ(S) = 6.776
−5 5 10 15 20 25 30 0.05 0.1 0.15
Impact Score (S) Probability Density
@guardian @David_Cameron @PaulMasonNews @lampos @nikaletras @spam?
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
38/49
38/49
a1 # of tweets a2 proportion of retweets a3 proportion of non-duplicate tweets a4 proportion of tweets with hashtags a5 hashtag-tokens ratio in tweets a6 proportion of tweets with @-mentions a7 # of unique @-mentions in tweets a8 proportion of tweets with @-replies a9 links ratio in tweets a10 # of favourites the account made a11 total # of tweets (entire history) a12 using default profile background (binary) a13 using default profile image (binary) a14 enabled geolocation (binary) a15 population of account’s location a16 account’s location latitude a17 account’s location longitude a18 proportion of days with nonzero tweets v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
39/49
39/49
Label Cluster’s words ranked by centrality Weather (τ1) mph, humidity, barometer, gust, winds, hpa, temperature, kt Healthcare, Finance, Housing (τ2) nursing, nurse, rn, registered, bedroom, clinical, #news, es- tate, #hospital, rent, healthcare, therapist, condo, invest- ment, furnished, medical, #nyc, occupational, investors, #ny Politics (τ3) senate, republican, gop, police, arrested, voters, robbery, democrats, presidential, elections, charged, election, charges, #religion, arrest, repeal, dems, #christian, reform Showbiz, Movies (τ4) damon, potter, #tvd, harry, elena, kate, portman, pattinson, hermione, jennifer, kristen, stefan, robert, catholic, stewart, katherine, lois, jackson, vampire, natalie, #vampirediaries Commerce (τ5) chevrolet, inventory, coupon, toyota, mileage, sedan, nissan, adde, jeep, 4x4, 2002, #coupon, enhanced, #deal, dodge Twitter hashtags (τ6) #teamfollowback, #500aday, #tfb, #instantfollowback, #ifollowback, #instantfollow, #followback Social unrest (τ7) #egypt, #tunisia, #iran, #israel, #palestine, tunisia, arab, #jan25, iran, israel, protests, egypt, #yemen, #iranelection, israeli, #jordan, regime, yemen, #gaza, protesters, #lebanon ... ... v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
40/49
40/49
— user attributes (A), A + top-words (AW), A + n clusters (AC)
Determination (ARD) (Rasmussen and Williams, 2006) Linear (RR) Nonlinear (GP) Model r RMSE r RMSE A .667 2.642 .759 2.298 AW .712 2.529 .768 2.263 AC, |τ| = 50 .703 2.518 .774 2.234 AC, |τ| = 100 .714 2.480 .780 .780 .780 2.210 2.210 2.210 Most predictive / relevant features default profile image, # of historical tweets, # of unique @-mentions, # of tweets (last year), links (ratio), topic:weather, topic:healthcare-finance, topic:politics, days with nonzero tweets (ratio), @-replies (ratio)
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
41/49
41/49
100 100 100 100 10 20 30 100 10 20 30
L H L H L H L H L H Tweetszinzentirezhistoryz(α11) Uniquez@-mentionsz(α7) Linksz(α9) @-repliesz(α8) Dayszwithznonzeroztweetsz(α18)
Impact score distribution for user accounts with high (H) or low (L) values for the most relevant user attributes solid line: µ(S) in our data dashed line: µ(S) in user class
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
42/49
42/49
10 20 30 150 300 450 600 750 900 IA NIA 10 20 30 100 200 300 400 IA IAC 10 20 30 100 200 300 400 500 L NL 10 20 30 100 200 300 400 500 TO TF 10 20 30 50 100 150 200 LT ST A B C D E
A: Interactive (IA) vs non Interactive (NIA) users — interactive: tweet regularly, do many @-mentions and @-replies, mention many different users B: IA vs clique-Interactive (CIA) — CIA: interactive but not mentioning many different users C: Use links (L) vs does not (NL) when discussing most prevalent topics (Politics, Showbiz) D: Topic focused (TF) vs topic overall (TO) E: ‘Serious’ (ST) vs ‘light’ (LT) topics
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
43/49
43/49
v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
44/49
44/49
http://www.i-sense.org.uk/ v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
45/49
45/49
http://www.lampos.net/research/talks-posters v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
46/49
46/49
Al-Khayyal and Falk. Jointly Constrained Biconvex Programming. MOR, 1983. Argyriou, Evgeniou and Pontil. Convex multi-task feature learning. Machine Learning, 2008.
Beck and Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. J. Imaging Sci., 2009.
2009.
Efron, Hastie, Johnstone and Tibshirani. Least Angle Regression. The Annals of Statistics, 2004. Gayo-Avello. A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter
Gayo-Avello, Metaxas and Mustafaraj. Limits of Electoral Predictions using Twitter. ICWSM, 2011. Hastie, Tibshirani and Friedman. The Elements of Statistical Learning. 2009. Hoerl and Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 1970. Horst and Tuy. Global Optimization: Deterministic Approaches. 1996. v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
47/49
47/49
Lampos and Cristianini. Tracking the flu pandemic by monitoring the Social Web. CIP, 2010. Lampos and Cristianini. Nowcasting Events from the Social Web with Statistical
Lampos, Preot ¸iuc-Pietro and Cohn. A user-centric model of voting intention from Social
Lampos, Aletras, Preot ¸iuc-Pietro and Cohn. Predicting and Characterising User Impact
Liu, Ji and Ye. Multi-task feature learning via efficient ℓ2,1 ℓ2,1 ℓ2,1-norm minimization. UAI, 2009. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 2007. Mairal, Jenatton, Obozinski and Bach. Network Flow Algorithms for Structured Sparsity. NIPS, 2010. Metaxas, Mustafaraj and Gayo-Avello. How (not) to predict elections. SocialCom, 2011. O’Connor, Balasubramanyan, Routledge and Smith. From Tweets to polls: Linking text sentiment to public opinion time series. ICWSM, 2010. Preot ¸iuc-Pietro, Samangooei, Cohn, Gibbins and Niranjan. Trendminer: An architecture for real time analysis of social media text. ICWSM, 2012. Rasmussen and Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
48/49
48/49
Strapparava and Valitutti. Wordnet-Affect: An affective extension of WordNet. LREC, 2004. Tausczik and Pennebaker. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. JLSP, 2010.
Tumasjan, Sprenger, Sandner and Welpe. Predicting elections with Twitter: What 140 characters reveal about political sentiment. ICWSM, 2010. Yuan and Lin. Model selection and estimation in regression with grouped variables. JRSS, 2006. Zhao and Yu. On model selection consistency of LASSO. JMLR, 2006. Zhou and Hastie. Regularization and variable selection via the elastic net. JRSS, 2005. v.lampos@ucl.ac.uk Slides: http://bit.ly/1v3Jeiy
49/49
49/49