user generated content mining from collective disease
play

User-generated content mining: From collective disease rates to - PowerPoint PPT Presentation

User-generated content mining: From collective disease rates to individual demographics Vasileios Lampos Computer Science @ UCL @lampos | lampos.net Language Technology Lab University of Cambridge Oct. 27, 2016 Structure of the presentation


  1. User-generated content mining: From collective disease rates to individual demographics Vasileios Lampos Computer Science @ UCL @lampos | lampos.net Language Technology Lab University of Cambridge Oct. 27, 2016

  2. Structure of the presentation 1. Introductory remarks 2. Collective disease surveillance from search query data 
 — Google Flu Trends and inference inaccuracies 
 — Steps towards improvement 3. Mining socio-economic demographics from social media users 
 — Occupational class 
 — Income 
 — Socioeconomic status 4. Concluding remarks

  3. Context and Motivation

  4. Context and Motivation How can we use online 
 user-generated content (UGC) to our benefit?

  5. User-generated content for health. WHY? + Online content can potentially access a larger and more representative part of the population 
 Note: Health surveillance systems are based on the subset of people who actively seek medical attention + More timely information ( almost instant ) + Geographical regions with less established health monitoring systems could benefit + Small cost when data access and modelling expertise are in place

  6. Google Flu Trends — The idea Can we turn online search query statistics 
 to estimates about the rate of influenza-like illness (ILI) in the real-world population?

  7. Google Flu Trends — Supervised learning Flu rates from a health search query frequency agency representing time series doctor consultations Bing 0.03 0.02 0.01 0 M x N M X ∈ ℝ y ∈ ℝ logit ( y ) = β 0 + β 1 ✕ logit ( q ) + ε ( Ginsberg et al., 2009 )

  8. Google Flu Trends — Supervised learning Flu rates from a health search query frequency agency representing time series doctor consultations Bing 0.03 q is the aggregate frequency 
 0.02 of a selected subset of the N 
 0.01 candidate search queries 0 M x N M X ∈ ℝ y ∈ ℝ logit ( y ) = β 0 + β 1 ✕ logit ( q ) + ε ( Ginsberg et al., 2009 )

  9. Google Flu Trends — Failure 10 Lagged CDC Google Flu Google Flu + CDC CDC 8 Google estimates more ( Lazer et al., 2014 ) than double CDC estimates 6 % ILI 4 2 0 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 The estimates of the online Google Flu Trends tool were approx. two times larger than the ones from the CDC in 2012/13

  10. Google Flu Trends — Hypotheses for failure - “ Big Data ” criticism - The statistical learning model was not good enough - Feature selection was not good enough bringing in spurious search queries - Media hype about flu significantly affects inference accuracy - The ground truth is not perfect; it is rather a “silver” standard

  11. Google Flu Trends — Hypotheses for failure X “ Big Data ” criticism The statistical learning model was not ✓ good enough Feature selection was not good enough ✓ bringing in spurious search queries ? Media hype about flu significantly affects inference accuracy ✓ ? The ground truth is not perfect; it is rather a “silver” standard

  12. Advances in nowcasting influenza-like illness rates using online search logs Lampos, Miller, Crossan & Stefansen (Nature Scientific Reports, 2015)

  13. Data Google search logs - weekly search counts of 49,708 search queries - corresponding total volume of weekly searches - user search sessions geolocated in the US - anonymised & aggregate data - Jan. 2004 to Dec. 2013 (521 weeks, ~ decade ) ILI rates from CDC

  14. Elastic Net for linear regularised regression x i ∈ R m , i ∈ { 1 , . . . , n } query frequency — X ILI rates y i ∈ R , i ∈ { 1 , . . . , n } — y weights, bias w j , β ∈ R , j ∈ { 1 , . . . , m } — w ∗ = [ w ; β ] 2 8 9 0 1 n m m m < = X X X X w 2 argmin + λ 1 | w j | + λ 2 @ y i − β − x ij w j A j w , β : ; i =1 j =1 j =1 j =1 L1-norm L2-norm a sparse set of weights ( w ) is encouraged ( Zou & Hastie, 2005 )

  15. Nonlinearities in the data (1) logit space 1 1 0.9 0.8 0.8 0.7 “ flu symptoms 0.6 0.6 0.5 in children ” 0.4 0.4 0.3 0.2 0.2 ILI rate 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 1 0.9 0.8 0.8 0.7 “ flu symptoms 0.6 0.6 0.5 in adults ” 0.4 0.4 0.3 0.2 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Query frequency

  16. Nonlinearities in the data (2) logit space 1 1 0.9 0.8 0.8 0.7 0.6 0.6 “ flu remedies ” 0.5 0.4 0.4 0.3 0.2 0.2 ILI rate 0.1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 1 0.9 0.8 0.8 0.7 0.6 0.6 “ tamiflu dosage ” 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Query frequency

  17. Gaussian Processes for nonlinear modelling R → R Formally, GP and we want to learn Say x ∈ R d : f : R d → R x inputs x R d : inputs x 0 )) f ( x x ) ∼ GP ( m ( x x ) , k ( x x x x x x,x mean function covariance function (kernel) drawn on inputs drawn on pairs of inputs Formally: Sets of random variables any finite number of which have a multivariate Gaussian distribution Why do we use Gaussian Processes? + Kernelised, models nonlinearities + Interpretability ( A uto R elevance D etermination) + Performance ( Rasmussen & Williams, 2006 )

  18. Common covariance functions (kernels) Kernel name: Squared-exp ( SE ) Periodic ( Per ) Linear ( Lin ) f ( x − c )( x Õ − c ) − ( x ≠ x Õ ) 2 1 2 1 ¸ 2 sin 2 1 22 − 2 π x ≠ x Õ σ 2 σ 2 σ 2 k ( x, x Õ ) = f exp f exp 2 ¸ 2 p Plot of k ( x, x Õ ) : 0 0 0 x (with x Õ = 1 ) x − x Õ x − x Õ ↓ ↓ ↓ Functions f ( x ) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions ( Duvenaud, 2014 )

  19. Combining kernels in a GP it is possible to add or multiply kernels (among other operations) Lin × Lin SE × Per Lin × SE Lin × Per 0 0 0 0 x (with x Õ = 1 ) x (with x Õ = 1 ) x (with x Õ = 1 ) x − x Õ ↓ ↓ ↓ ↓ quadratic functions locally periodic increasing variation growing amplitude ( Duvenaud, 2014 )

  20. Exploring nonlinearities with Gaussian Processes. � � � � � � GP kernel on query clusters   C   ∑ ′ ) 2  ⋅ ′  ′ ( , ) = ( ,  + ( , ), k x x k c c σ δ x x   SE n i i   = 1 i + protects inferences from radical changes in the � frequency of isolated queries + models the contribution of various themes (clusters) to the final prediction ( bi-product: interpretability ) + learns a sum of lower-dimensional functions: smaller � input space, easier learning task , fewer samples required, more statistical traction obtained - [ trade-off ] assumption that relationships between queries in separate clusters provide no information about ILI � �

  21. Inference performance Google Flu Trends old model Elastic Net Gaussian Process (10 clusters) 25 24.8% 20.4% MAPE (%) 15 15.8% 11.9% 11% 10.8% 5 Test data Test data; peaking moments Mean absolute percentage (%) of error (MAPE) in flu rate estimates (2008-2013)

  22. Comparative inference plots

  23. Comparative inference plots What happened here?

  24. From 4 Dec. 2011 to 28 Apr. 2012… rsv flu symptoms benzonatate GFT original model symptoms of pneumonia upper respiratory infection ear thermometer musinex Elastic Net how to break a fever flu like symptoms fever reducer 0% 8% 17% 25% Top-5 most influential search queries for flu rate inferences

  25. I am skipping… (1) How, and, hence, why the GP-clustering works (2) The obvious auto-regressive extensions (3) How we incorporated statistical NLP to further improve models ( submitted paper )

  26. Inferring user-level information 
 from user-generated content occupational class income socio-economic status (SES) Preotiuc-Pietro, Lampos & Aletras (ACL 2015) Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras (PLOS ONE, 2015) Lampos, Aletras, Geyti, Zou & Cox (ECIR 2016)

  27. About Twitter

  28. About Twitter > 140 characters per published status ( tweet ) > users can follow and be followed > embedded usage of topics (using #hashtags) > user interaction (re-tweets, @mentions, likes) > real-time nature > biased demographics (13-15% of UK’s population, age bias etc.) > information is noisy and not always accurate

  29. Linguistic expression and demographics “ Socioeconomic variables are influencing language use. ” ( Bernstein, 1960 ; Labov, 1972/2006 ) + Validate this hypothesis on a broader, larger data set using social media + Applications > research, as in computational social science, health, and psychology > commercial

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend