Mining online data for public health surveillance Vasileios Lampos ( - - PowerPoint PPT Presentation

mining online data for public health surveillance
SMART_READER_LITE
LIVE PREVIEW

Mining online data for public health surveillance Vasileios Lampos ( - - PowerPoint PPT Presentation

Mining online data for public health surveillance Vasileios Lampos ( a.k.a. Bill ) Computer Science University College London @lampos Structure Using online data for health applications From web searches to syndromic surveillance Google


slide-1
SLIDE 1

Mining online data for public health surveillance

Vasileios Lampos (a.k.a. Bill)

Computer Science University College London

@lampos

slide-2
SLIDE 2

Structure

➡ Using online data for health applications ➡ From web searches to syndromic surveillance

i. Google Flu Trends: original failure and correction

  • ii. Better feature selection using semantic concepts
  • iii. Snapshot: Multi-task learning for disease models
slide-3
SLIDE 3

Online data

22/03/2017 wikipedia-logo-v2-en.svg file:///Users/vlampos/Downloads/wikipedia-logo-v2-en.svg 1/1

slide-4
SLIDE 4

Online data for health (1/3)

— coverage — speed — cost

When & why? How? Evaluation?

— collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time

slide-5
SLIDE 5

Online data for health (1/3)

— coverage — speed — cost

When & why? How? Evaluation?

— collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time

slide-6
SLIDE 6

Online data for health (1/3)

— coverage — speed — cost

When & why? How? Evaluation?

— collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time

slide-7
SLIDE 7

Online data for health (1/3)

— coverage — speed — cost

When & why? How? Evaluation?

— collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time

slide-8
SLIDE 8

Online data for health (2/3)

Google Flu Trends (discontinued)

slide-9
SLIDE 9

Online data for health (2/3)

Flu Detector, fludetector.cs.ucl.ac.uk

slide-10
SLIDE 10

Online data for health (2/3)

Health Map, healthmap.org

slide-11
SLIDE 11

Online data for health (3/3)

Health intervention Impact?

slide-12
SLIDE 12

Online data for health (3/3)

Health intervention Impact?

slide-13
SLIDE 13

Online data for health (3/3)

Vaccinations against flu Impact?

Lampos, Yom-Tov, Pebody, Cox (2015) doi:10.1007/s10618-015-0427-9 Wagner, Lampos, Yom-Tov, Pebody, Cox (2017) doi:10.2196/jmir.8184

slide-14
SLIDE 14

Google Flu Trends (GFT)

2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILI percentage 12

slide-15
SLIDE 15

Google Flu Trends (GFT)

2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILI percentage 12

f

?

slide-16
SLIDE 16

GFT — Supervised learning

Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝ

n⨯m ⟶ y ∈ ℝ n

slide-17
SLIDE 17

GFT — Supervised learning

Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝ

n⨯m ⟶ y ∈ ℝ n

count of qi total count of all queries frequency of qi =

slide-18
SLIDE 18

GFT v.1 — Model

logit(P) = β0 + β1 × logit(Q) + ε

Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise

) = β0 +

+ β1 ) + ε

Q P

slide-19
SLIDE 19

GFT v.1 — Model

logit(P) = β0 + β1 × logit(Q) + ε

Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise

) = β0 +

+ β1 ) + ε

Q P

slide-20
SLIDE 20

GFT v.1 — Model

logit(P) = β0 + β1 × logit(Q) + ε

Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise

) = β0 +

+ β1 ) + ε

Q P

slide-21
SLIDE 21

GFT v.1 — Model

logit(P) = β0 + β1 × logit(Q) + ε

Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise

) = β0 +

+ β1 ) + ε

Q P

slide-22
SLIDE 22

GFT v.1 — “Logit”, why?

the logit function

Modeling search queries for nowcasting disease rates

  • ere logit(α) =

log(α/(1 − α)), arly X denotes the logit-transf

α ∈ (0,1)

2 4 6 8 10 12

x

0.2 0.4 0.6 0.8 1

y

(x,y) pair values 2 4 6 8 10 12

x

  • 10
  • 5

5 10

logit(y)

(x,logit(y)) pair values

slide-23
SLIDE 23

GFT v.1 — “Logit”, why?

the logit function

Modeling search queries for nowcasting disease rates

  • ere logit(α) =

log(α/(1 − α)), arly X denotes the logit-transf

α ∈ (0,1)

2 4 6 8 10 12

x

0.2 0.4 0.6 0.8 1

y

(x,y) pair values 2 4 6 8 10 12

x

  • 10
  • 5

5 10

logit(y)

(x,logit(y)) pair values

2 4 6 8 10 12

x

  • 3
  • 2
  • 1

1 2

y or logit(y)

y logit(y)

values close to 0.5 are “squashed” border values (close to 0 or 1) are “emphasised”

z-scored

slide-24
SLIDE 24

GFT v.1 — Data

9 US regions considered 50 million search queries (most frequent) geolocated in these 9 US regions Weekly ILI rates from CDC 170 weeks, 28/9/2003 to 11/5/2008 with ILI rate > 0 First 128 weeks: Training, 9 x 128 = 1,152 samples Last 42 weeks: Testing (per region)

slide-25
SLIDE 25

GFT v.1 — Feature selection (1/2)

  • 1. Single query flu models are trained for each US region


50 million queries x 9 US regions = 450 million models

  • 2. Inference accuracy is estimated for each query using

linear correlation (Pearson) as the metric

  • 3. Starting from the best performing query, adding up one

query each time, a new model is trained and evaluated

slide-26
SLIDE 26
  • 1. Single query flu models are trained for each US region


50 million queries x 9 US regions = 450 million models

  • 2. Inference accuracy is estimated for each query using

linear correlation (Pearson) as the metric

  • 3. Starting from the best performing query, adding up one

query each time, a new model is trained and evaluated

GFT v.1 — Feature selection (1/2)

10 20 30 40 50 60 70 80 90 100 0.90 Number of queries 45 queries 0.85 0.95 Mean correlation

slide-27
SLIDE 27

GFT v.1 — Feature selection (2/2)

Search query topic Top 45 queries n Weighted

Influenza complication 11 18.15 Cold/flu remedy 8 5.05 General influenza symptoms 5 2.60 Term for influenza 4 3.74 Specific influenza symptom 4 2.54 Symptoms of an influenza complication 4 2.21 Antibiotic medication 3 6.23 General influenza remedies 2 0.18 Symptoms of a related disease 2 1.66 Antiviral medication 1 0.39 Related disease 1 6.66 Unrelated to influenza 0.00 Total 45 49.40

slide-28
SLIDE 28

GFT v.1 — Performance (1/2)

Evaluated on 42 weeks (per region) from 2007-2008 Evaluation metric: Pearson correlation μ(r) = .97 with min(r) = .92 and max(r) = .99 Performance looked great at the time, but this is not a proper performance evaluation!
 
 Why?
 Potentially misleading metric (not the loss function here) and rather small testing time span (< 1 flu season)

slide-29
SLIDE 29

GFT v.1 — Performance (2/2)

2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILI percentage 12

Mid-Atlantic US region Pearson correlation, r = .96

slide-30
SLIDE 30

GFT v.2 — Data & evaluation

weekly frequency of 49,708 search queries (US) filtered by a relaxed health topic classifier, intersection of frequent queries across all US regions from 4/1/2004 to 28/12/2013 (521 weeks) corresponding weekly US ILI rates from CDC test on 5 flu seasons, 5 year-long test sets (2008-13) train on increasing data sets starting from 2004, using all data prior to a test period

slide-31
SLIDE 31

GFT v.1 was simple to a (significant) fault

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

ILI rate (US)

CDC ILI rates GFT

slide-32
SLIDE 32

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

ILI rate (US)

CDC ILI rates GFT

“rsv” — 25% “flu symptoms” — 18% “benzonatate” — 6% “symptoms of pneumonia” — 6% “upper respiratory infection” — 4%

GFT v.1 was simple to a (significant) fault

slide-33
SLIDE 33

GFT v.2 — Linear multivariate regression

X argmin

w,β

n X

i=1

(xiw + yi)2 ! X ∈ Rn×m

i ∈ R

w ∈ Rm ∈ R β ∈ R ∈ R y ∈ Rn R ∈ R yi ∈ R Rm

ILI rates from CDC for n weeks … for week i

∈ R xi ∈ Rm, i ∈ {1, . . . , n} Rn

frequency of m search queries for n weeks … for week i weights for the m search queries intercept term

Least squares

slide-34
SLIDE 34

GFT v.2 — Linear multivariate regression

X argmin

w,β

n X

i=1

(xiw + yi)2 ! X ∈ Rn×m

i ∈ R

w ∈ Rm ∈ R β ∈ R ∈ R y ∈ Rn R ∈ R yi ∈ R Rm

ILI rates from CDC for n weeks … for week i

∈ R xi ∈ Rm, i ∈ {1, . . . , n} Rn

frequency of m search queries for n weeks … for week i weights for the m search queries intercept term

Least squares

slide-35
SLIDE 35

GFT v.2 — Linear multivariate regression

X argmin

w,β

n X

i=1

(xiw + yi)2 ! X ∈ Rn×m

i ∈ R

w ∈ Rm ∈ R β ∈ R ∈ R y ∈ Rn R ∈ R yi ∈ R Rm

ILI rates from CDC for n weeks … for week i

∈ R xi ∈ Rm, i ∈ {1, . . . , n} Rn

frequency of m search queries for n weeks … for week i weights for the m search queries intercept term

Least squares

slide-36
SLIDE 36

GFT v.2 — Linear multivariate regression

X argmin

w,β

n X

i=1

(xiw + yi)2 ! X ∈ Rn×m

i ∈ R

w ∈ Rm ∈ R β ∈ R ∈ R y ∈ Rn R ∈ R yi ∈ R Rm

ILI rates from CDC for n weeks … for week i

∈ R xi ∈ Rm, i ∈ {1, . . . , n} Rn

frequency of m search queries for n weeks … for week i weights for the m search queries intercept term

Least squares

slide-37
SLIDE 37

GFT v.2 — Linear multivariate regression

X argmin

w,β

n X

i=1

(xiw + yi)2 ! X ∈ Rn×m

i ∈ R

w ∈ Rm ∈ R β ∈ R ∈ R y ∈ Rn R ∈ R yi ∈ R Rm

ILI rates from CDC for n weeks … for week i

∈ R xi ∈ Rm, i ∈ {1, . . . , n} Rn

frequency of m search queries for n weeks … for week i weights for the m search queries intercept term

Least Squares

⚠ ☣ ⚠

Least squares regression is not applicable here because we have very few training samples (n) but many features (search queries; m). Models derived from least squares will tend to overfit the data, resulting to bad solutions.

slide-38
SLIDE 38

GFT v.2 — Regularisation with elastic net

@ A argmin

w,β

@

n

X

i=1

(xiw + β − yi)2 + λ1

m

X

j=1

|wj| + λ2

m

X

j=1

w2

j

1 A

slide-39
SLIDE 39

GFT v.2 — Regularisation with elastic net

@ A argmin

w,β

@

n

X

i=1

(xiw + β − yi)2 + λ1

m

X

j=1

|wj| + λ2

m

X

j=1

w2

j

1 A

least squares

slide-40
SLIDE 40

GFT v.2 — Regularisation with elastic net

@ A argmin

w,β

@

n

X

i=1

(xiw + β − yi)2 + λ1

m

X

j=1

|wj| + λ2

m

X

j=1

w2

j

1 A

least squares

∈ R λ1 ∈ R+ λ2 ∈ R+

L1 & L2-norm regularisers for the weights

slide-41
SLIDE 41

GFT v.2 — Regularisation with elastic net

@ A argmin

w,β

@

n

X

i=1

(xiw + β − yi)2 + λ1

m

X

j=1

|wj| + λ2

m

X

j=1

w2

j

1 A

least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number

  • f samples (n)

∈ R λ1 ∈ R+ λ2 ∈ R+

L1 & L2-norm regularisers for the weights

slide-42
SLIDE 42

GFT v.2 — Regularisation with elastic net

@ A argmin

w,β

@

n

X

i=1

(xiw + β − yi)2 + λ1

m

X

j=1

|wj| + λ2

m

X

j=1

w2

j

1 A

least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number

  • f samples (n)

∈ R λ1 ∈ R+ λ2 ∈ R+

L1 & L2-norm regularisers for the weights many weights will be set to zero!

slide-43
SLIDE 43

GFT v.2 — Feature selection

1st layer: Keep search queries that their frequency time series has a ≥ 0.5 Pearson correlation with the CDC ILI rates (in the training data) 2nd layer: Elastic net will assign weights equal to 0 to features (search queries) that are identified as statistically irrelevant to our task

μ (σ) # queries selected across all training data sets

# queries r ≥ 0.5 GFT Elastic net 49,708 937 (334) 46 (39) 278 (64)

slide-44
SLIDE 44

GFT v.2 — Evaluation (1/2)

and ˆ y = ˆ y1, . . . , ˆ yN set y = y1, . . . , yN e performance:

Target variable: Estimates:

MAE (ˆ y, y) = 1 N

N

X

t=1

|ˆ yt − yt| MAPE (ˆ y, y) = 1 N

N

X

t=1

  • ˆ

yt − yt yt

  • MSE (ˆ

y, y) = 1 N

N

X

t=1

(ˆ yt − yt)2

Mean Squared Error: Mean Absolute Error: Mean Absolute Percentage of Error:

slide-45
SLIDE 45

GFT v.2 — Evaluation (2/2)

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

ILI rate (US)

CDC ILI rates Elastic Net

slide-46
SLIDE 46

GFT v.2 — Evaluation (2/2)

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

ILI rate (US)

CDC ILI rates Elastic Net GFT

GFT r = .89, MAE = 3.81·10-3, MAPE = 20.4% Elastic net r = .92, MAE = 2.60·10-3, MAPE = 11.9%

slide-47
SLIDE 47

GFT v.2 — Nonlinearities in the data

US ILI rates (CDC) ~ freq. of query ‘flu’

0.5 1 1.5 2 2.5 3 3.5

Frequency of query 'flu'

10-5 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

ILI rates Raw data Linear fit

slide-48
SLIDE 48

GFT v.2 — Nonlinearities in the data

US ILI rates (CDC) ~ freq. of query ‘flu medicine’

1 2 3 4 5 6

Frequency of query 'flu medicine'

10-7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

ILI rates Raw data Linear fit

slide-49
SLIDE 49

GFT v.2 — Nonlinearities in the data

US ILI rates (CDC) ~ freq. of query ‘how long is flu contagious’

1 2 3 4 5 6

Frequency of query 'how long is flu contagious'

10-7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

ILI rates Raw data Linear fit

slide-50
SLIDE 50

GFT v.2 — Nonlinearities in the data

US ILI rates (CDC) ~ freq. of query ‘how to break a fever’

1 2 3 4 5 6 7 8 9 10

Frequency of query 'how to break a fever'

10-7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

ILI rates Raw data Linear fit

slide-51
SLIDE 51

GFT v.2 — Nonlinearities in the data

US ILI rates (CDC) ~ freq. of query ‘sore throat treatment’

0.5 1 1.5 2 2.5 3

Frequency of query 'sore throat treatment'

10-7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

ILI rates Raw data Linear fit

slide-52
SLIDE 52

GFT v.2 — Gaussian Processes (1/4)

A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution

f(x) ⇠ GP

  • m(x), k(x, x0)
  • , x, x0 2 Rm, f : Rm ! R
slide-53
SLIDE 53

GFT v.2 — Gaussian Processes (1/4)

A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution

N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp

  • −1

2(x − µ)TΣ−1(x − µ)

  • f(x) ⇠ GP
  • m(x), k(x, x0)
  • , x, x0 2 Rm, f : Rm ! R
slide-54
SLIDE 54

GFT v.2 — Gaussian Processes (1/4)

A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution

N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp

  • −1

2(x − µ)TΣ−1(x − µ)

  • f(x) ⇠ GP
  • m(x), k(x, x0)
  • , x, x0 2 Rm, f : Rm ! R
  • f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj)

inference

slide-55
SLIDE 55

GFT v.2 — Gaussian Processes (2/4)

Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, xÕ) = σ2

f exp

1

−(x≠xÕ)2

2¸2

2

σ2

f exp

1

− 2

¸2 sin2 1

π x≠xÕ

p

22

σ2

f(x − c)(xÕ − c)

Plot of k(x, xÕ): x − xÕ x − xÕ x (with xÕ = 1)

↓ ↓ ↓

Functions f(x) sampled from

GP prior:

x x x Type of structure: local variation repeating structure linear functions

Common GP kernels (covariance functions)

slide-56
SLIDE 56

GFT v.2 — Gaussian Processes (3/4)

Lin × Lin SE × Per Lin × SE Lin × Per

x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1)

↓ ↓ ↓ ↓

quadratic functions locally periodic increasing variation growing amplitude

Adding or multiplying GP kernels produces a new valid GP kernel

slide-57
SLIDE 57

GFT v.2 — Gaussian Processes (4/4)

(x,y) pairs with obvious nonlinear relationship

10 20 30 40 50 60

x (predictor variable)

5 10 15 20

y (target variable)

x,y pairs

slide-58
SLIDE 58

GFT v.2 — Gaussian Processes (4/4)

least squares regression (poor solution)

10 20 30 40 50 60

x (predictor variable)

5 10 15 20

y (target variable)

x,y pairs OLS fit (train) OLS fit

slide-59
SLIDE 59

GFT v.2 — Gaussian Processes (4/4)

sum of 2 GP kernels (periodic + squared exponential)

10 20 30 40 50 60

x (predictor variable)

5 10 15 20

y (target variable)

x,y pairs OLS fit OLS fit (train) GP fit (train) GP fit

slide-60
SLIDE 60

Clustering queries selected by elastic net into C clusters with k-means Clusters are determined by using cosine similarity as the distance metric (on query frequency time series) Groups queries with similar topicality & usage patterns

GFT v.2 — k-means and GP regression

k(x, x0) = C X

i=1

kSE(xci, x0

ci)

! + σ2 · δ(x, x0) x = {xc1, . . . , xc10}

  • kSE(x, x0) = 2 exp

✓ kx x0k2

2

2`2 ◆

clusters noise

slide-61
SLIDE 61

GFT v.2 — Performance

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07

ILI rate (US)

CDC ILI rates GP

slide-62
SLIDE 62

GFT v.2 — Performance

2009 2010 2011 2012 2013

Time (weeks)

0.01 0.02 0.03 0.04 0.05 0.06 0.07

ILI rate (US)

CDC ILI rates GP Elastic Net

Elastic net r = .92, MAE = 2.60·10-3, MAPE = 11.9% GP r = .95, MAE = 2.21·10-3, MAPE = 10.8%

slide-63
SLIDE 63

GFT v.2 — Queries’ added value

yt =

p

X

i=1

iyt−i +

J

X

i=1

!iyt−52−i | {z }

AR and seasonal AR

+

q

X

i=1

✓i✏t−i +

K

X

i=1

⌫i✏t−52−i | {z }

MA and seasonal MA

+

D

X

i=1

wiht,i | {z }

regression

+ ✏t

Autoregression: Combine CDC ILI rates from the previous week(s) with the ILI rate estimate from search queries for the current week Various week lags explored (1, 2,…, 6 weeks)

slide-64
SLIDE 64

GFT v.2 — Performance

slide-65
SLIDE 65

GFT v.2 — Performance

1-week lag for the CDC data AR(CDC) r = .97, MAE = 1.87·10-3, MAPE = 8.2% AR(CDC,GP) r = .99, MAE = 1.05·10-3, MAPE = 5.7% — 2-week lag for the CDC data AR(CDC) r = .87, MAE = 3.36·10-3, MAPE = 14.3% AR(CDC,GP) r = .99, MAE = 1.35·10-3, MAPE = 7.3% — GP r = .95, MAE = 2.21·10-3, MAPE = 10.8%

slide-66
SLIDE 66

GFT v.2 — Non-optimal feature selection

Queries irrelevant to flu are still maintained, e.g. “nba injury report” or “muscle building supplements” Feature selection is primarily based on correlation, then

  • n a linear relationship

Introduce a semantic feature selection
 — enhance causal connections (implicitly)
 — circumvent the painful training of a classifier

slide-67
SLIDE 67

GFT v.3 — Word embeddings

Word embeddings are vectors of a certain dimensionality (usually from 50 to 1024) that represent words in a corpus Derive these vectors by predicting contextual word

  • ccurrence in large corpora (word2vec) using a shallow

neural network approach:
 — Continuous Bag-Of-Words (CBOW): Predict centre word from surrounding ones
 — skip-gram: Predict surrounding words from centre one Other methods available: GloVe, fastText

slide-68
SLIDE 68

GFT v.3 — Word embedding data sets

Use tweets geolocated in the UK to learn word embeddings that may capture
 — informal language used in searches
 — British English language / expressions
 — cultural biases (a)215 million tweets (February 2014 to March 2016), CBOW, 512 dimensions, 137,421 words covered


https://doi.org/10.6084/m9.figshare.4052331.v1

(b)1.1 billion tweets (2012 to 2016), skip-gram, 512 dimensions, 470,194 words covered


https://doi.org/10.6084/m9.figshare.5791650.v1

slide-69
SLIDE 69

GFT v.3 — Cosine similarity

cos(v, u) = v · u kvkkuk =

n

X

i=1

viui v u u t

n

X

i=1

v2

i

v u u t

n

X

i=1

u2

i

word embeddings

slide-70
SLIDE 70

GFT v.3 — Cosine similarity

cos(v, u) = v · u kvkkuk =

n

X

i=1

viui v u u t

n

X

i=1

v2

i

v u u t

n

X

i=1

u2

i

word embeddings

max

v

(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’

slide-71
SLIDE 71

GFT v.3 — Cosine similarity

cos(v, u) = v · u kvkkuk =

n

X

i=1

viui v u u t

n

X

i=1

v2

i

v u u t

n

X

i=1

u2

i

word embeddings

max

v

(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ max

v

✓cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ cos(·, ·) = (cos(·, ·) + 1) /2

where

  • r
slide-72
SLIDE 72

GFT v.3 — Cosine similarity

cos(v, u) = v · u kvkkuk =

n

X

i=1

viui v u u t

n

X

i=1

v2

i

v u u t

n

X

i=1

u2

i

word embeddings

max

v

(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ max

v

✓cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ cos(·, ·) = (cos(·, ·) + 1) /2

where

  • r

Positive context Negative context

slide-73
SLIDE 73

GFT v.3 — Analogies in Twitter embd.

The … for … not the … is … ? woman king man ? him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?

slide-74
SLIDE 74

GFT v.3 — Analogies in Twitter embd.

The … for … not the … is … ? woman king man queen him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?

slide-75
SLIDE 75

The … for … not the … is … ? woman king man queen him she he her better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?

GFT v.3 — Analogies in Twitter embd.

slide-76
SLIDE 76

The … for … not the … is … ? woman king man queen him she he her better bad good worse England Rome London Italy Messi basketball football Lebron Guardian Conservatives Labour Telegraph Trump Europe USA Farage rsv fever skin flu

GFT v.3 — Analogies in Twitter embd.

slide-77
SLIDE 77

GFT v.3 — Better query selection (1/3)

  • 1. Query embedding = Average token embedding
  • 2. Derive a concept by specifying a positive (P) and a

negative (N) context (sets of n-grams)

  • 3. Rank all queries using their similarity score with this

concept

slide-78
SLIDE 78

GFT v.3 — Better query selection (1/3)

  • 1. Query embedding = Average token embedding
  • 2. Derive a concept by specifying a positive (P) and a

negative (N) context (sets of n-grams)

  • 3. Rank all queries using their similarity score with this

concept

S (Q, C) = k

i=1 cos (eQ, ePi)

z

j=1 cos

  • eQ, eNj
  • + γ
slide-79
SLIDE 79

GFT v.3 — Better query selection (1/3)

  • 1. Query embedding = Average token embedding
  • 2. Derive a concept by specifying a positive (P) and a

negative (N) context (sets of n-grams)

  • 3. Rank all queries using their similarity score with this

concept

S (Q, C) = k

i=1 cos (eQ, ePi)

z

j=1 cos

  • eQ, eNj
  • + γ

query embedding embedding of a negative concept n-gram constant to avoid division by 0

slide-80
SLIDE 80

GFT v.3 — Better query selection (2/3)

Positive context Negative context Most similar queries

#flu fever flu flu medicine gp hospital bieber ebola wikipedia cold flu medicine flu aches cold and flu cold flu symptoms colds and flu flu flu gp flu hospital flu medicine ebola wikipedia flu aches flu colds and flu cold and flu cold flu medicine

slide-81
SLIDE 81

GFT v.3 — Better query selection (3/3)

Similarity score (S)

1.6 1.8 2 2.2 2.4 2.6 2.8

Number of search queries

500 1000 1500 2000

Given that the distribution of concept similarity scores appears to be unimodal, we standard deviations from the mean (μS+θσS) to determine the selected queries

slide-82
SLIDE 82

GFT v.3 — Hybrid feature selection

Embedding based feature selection is an unsupervised technique, thus non optimal If we combine it with the previous ways for selecting features, will we obtain better inference accuracy? We test 7 feature selection approaches: similarity → elastic net (1) correlation → elactic net (2) → GP (3) similarity → correlation → elastic net (4) → GP (5) similarity → correlation → GP (6) correlation → GP (7)

slide-83
SLIDE 83

k(ν)

M (x, x′) = σ2 21−ν

Γ(ν) √ 2ν ℓ r ν Kν √ 2ν ℓ r

  • kSE(x, x′) = σ2 exp
  • − r2

2ℓ2

  • k(x, x′) =

2

  • i=1
  • k(ν=3/2)

M

(x, x′; σi, ℓi)

  • + kSE(x, x′; σ3, ℓ3) + σ2

4δ(x, x′)

GFT v.3 — GP model details

Skipped in the interest of time! If you’re interested, check Section 3.1 of https://doi.org/10.1145/3038912.3052622

slide-84
SLIDE 84

GFT v.3 — Data & evaluation

weekly frequency of 35,572 search queries (UK) from 1/1/2007 to 9/08/2015 (449 weeks) access to a private Google Health Trends API for health-

  • riented research

corresponding ILI rates for England (Royal College of General Practitioners and Public Health England) test on the last 3 flu seasons in the data (2012-2015) train on increasing data sets starting from 2007, using all data prior to a test period

slide-85
SLIDE 85

GFT v.3 — Performance (1/3)

(a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net

(a) (b) (c) 36.23% 47.15% 61.05% 0.19 0.21 0.30 0.91 0.88 0.87

r MAE x 0.1 MAPE

slide-86
SLIDE 86

GFT v.3 — Performance (1/3)

(a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net

(a) (b) (c) 36.23% 47.15% 61.05% 0.19 0.21 0.30 0.91 0.88 0.87

r MAE x 0.1 MAPE

slide-87
SLIDE 87

GFT v.3 — Performance (2/3)

Time (weeks)

2013 2014 2015

ILI rate per 100,000 people

10 20 30

RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection)

Elastic net with and without word embeddings filtering

slide-88
SLIDE 88

GFT v.3 — Performance (2/3)

Time (weeks)

2013 2014 2015

ILI rate per 100,000 people

10 20 30

RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection)

Elastic net with and without word embeddings filtering

  • prof. surname (70.3%), name surname (27.2%),

heal the world (21.9%), heating oil (21.2%), name surname recipes (21%), tlc diet (13.3%), blood game (12.3%), swine flu vaccine side effects (7.2%) ratio over highest weight

slide-89
SLIDE 89

GFT v.3 — Performance (3/3)

(a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP

(a) (b) (c) (d) 25.81% 30.30% 35.88% 34.17% 0.16 0.17 0.23 0.22 0.94 0.93 0.92 0.89

r MAE x 0.1 MAPE

slide-90
SLIDE 90

GFT v.3 — Performance (3/3)

(a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP

(a) (b) (c) (d) 25.81% 30.30% 35.88% 34.17% 0.16 0.17 0.23 0.22 0.94 0.93 0.92 0.89

r MAE x 0.1 MAPE

slide-91
SLIDE 91

Multi-task learning

m tasks (problems) t1,…,tm

  • bservations Xt1,yt1,…,Xtm,ytm

learn models fti: Xti → yti jointly (and not independently) Why? When tasks are related, multi-task learning is expected to perform better than learning each task independently Model learning possible even with a few training samples

slide-92
SLIDE 92

Multi-task learning for disease modelling

m tasks (problems) t1,…,tm

  • bservations Xt1,yt1,…,Xtm,ytm

learn models fti: Xti → yti jointly (and not independently) Can we improve disease models (flu) from online search: when sporadic training data are available? across the geographical regions of a country? across two different countries?

slide-93
SLIDE 93

Multi-task learning GFT (1/5)

surveil-

  • -fold. Firstly,

various ge- countries — can to assist eillance

  • data. We ex-

multi-task cess, and

  • formulations. We use

eriments on health and indicate national mod- absolute reduced

HHS Regions

CA OR WA MT ID NV UT AZ WY CO NM TX OK KS NE SD ND MN WI IA IL MO AR LA MS AL GA FL SC NC TN KY IN MI OH WV VA MD DE PA NJ NY CT MA VT NH ME RI AK HI

Region 1 Region 2 Region 3 Region 4 Region 5 Region 6 Region 7 Region 8 Region 9 Region 10

Can multi-task learning across the 10 US regions help us improve the national ILI model?

slide-94
SLIDE 94

Multi-task learning GFT (1/5)

Can multi-task learning across the 10 US regions help us improve the national ILI model?

Elastic Net MT Elastic Net GP MT GP 0.25 0.25 0.35 0.35 0.97 0.97 0.96 0.96

r MAE

5 years of training data

slide-95
SLIDE 95

Multi-task learning GFT (1/5)

Can multi-task learning across the 10 US regions help us improve the national ILI model?

Elastic Net MT Elastic Net GP MT GP 0.44 0.51 0.46 0.53 0.88 0.85 0.87 0.85

r MAE

1 year of training data

slide-96
SLIDE 96

Multi-task learning GFT (2/5)

Can multi-task learning across the 10 US regions help us improve the regional ILI models?

Elastic Net MT Elastic Net GP MT GP 0.47 0.54 0.49 0.53 0.87 0.84 0.86 0.85

r MAE

1 year of training data

slide-97
SLIDE 97

Multi-task learning GFT (3/5)

Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample)

slide-98
SLIDE 98

Multi-task learning GFT (3/5)

Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample)

learning mod- elonging to Train- sampling GP MAE .460 .465 .465 .467

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

γ

0.5 0.6 0.7 0.8 0.9 1.0

MAE

EN MTEN GP MTGP

slide-99
SLIDE 99

Multi-task learning GFT (4/5)

Correlations between US regions induced by the covariance matrix of the MT GP model Multi-task learning model seems to be capturing existing geographical relations

assumption and have search and as-

  • rts for

described in assume

  • ears. The

follow- dication were quencies “z pak” previous statistically learning ones. ements

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 US 0.64 0.72 0.80 0.88 0.96

A1 A2

slide-100
SLIDE 100

Multi-task learning GFT (5/5)

Can multi-task learning across countries (US, England) help us improve the ILI model for England?

Elastic Net MT Elastic Net GP MT GP 0.47 0.60 0.49 0.70 0.90 0.89 0.90 0.89

r MAE

5 years of training data

slide-101
SLIDE 101

Multi-task learning GFT (5/5)

Can multi-task learning across countries (US, England) help us improve the ILI model for England?

Elastic Net MT Elastic Net GP MT GP 0.59 0.98 0.60 1.00 0.86 0.85 0.86 0.84

r MAE

1 year of training data

slide-102
SLIDE 102

Conclusions

Online (user-generated) data can help us improve our current understanding about public health matters The original Google Flu Trends was based on a good idea, but on very limited modelling effort, resulting to major errors Subsequent models improved the statistical modelling as well as the semantic disambiguation between possible features and delivered better / more robust performance Multi-task learning improves disease models further Future direction: Models without strong supervision

slide-103
SLIDE 103

Acknowledgements

Industrial partners — Microsoft Research (Elad Yom-Tov) — Google Public health organisations — Public Health England — Royal College of General Practitioners Funding: EPSRC (“i-sense”) Collaborators: Andrew Miller, Bin Zou, Ingemar J. Cox

slide-104
SLIDE 104

Thank you.

Vasileios Lampos (a.k.a. Bill)

Computer Science University College London @lampos

slide-105
SLIDE 105

References

Ginsberg et al. Detecting influenza epidemics using search engine query data. Nature 457, pp. 1012—1014 (2009). Lampos, Miller, Crossan and Stefansen. Advances in nowcasting influenza-like illness rates using search query

  • logs. Scientific Reports 5, 12760 (2015).

Lampos, Zou and Cox. Enhancing feature selection using word embeddings: The case of flu surveillance. WWW ’17,

  • pp. 695–704 (2017).

Zou, Lampos and Cox. Multi-task learning improves disease models from Web search. WWW ’18, In Press (2018). GFT v.1 GFT v.2 GFT v.3 MTL