Machine Learning Prediction of Blood Alcohol Content: A Digital - - PowerPoint PPT Presentation

machine learning prediction
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Prediction of Blood Alcohol Content: A Digital - - PowerPoint PPT Presentation

Machine Learning Prediction of Blood Alcohol Content: A Digital Signature of Behavior KIRSTIN ASCHBACHER, PH.D. A S S O C I AT E P RO F ES S O R , D I V I S I O N O F C A R D I O LO GY, S C H O O L O F M E D I C I N E U N I V E RS I T Y


slide-1
SLIDE 1

Machine Learning Prediction

  • f Blood Alcohol Content:

A Digital Signature of Behavior

KIRSTIN ASCHBACHER, PH.D.

A S S O C I AT E P RO F ES S O R , D I V I S I O N O F C A R D I O LO GY, S C H O O L O F M E D I C I N E U N I V E RS I T Y O F C A L I FO R N I A , SA N F R A N C I S CO K I RST I N . A S C H BAC H E R @ U C S F. E D U

slide-2
SLIDE 2

The Cost of Excessive Alcohol Use

§ A leading cause of preventable death § 1 in 10 deaths among adults, ages 20-64 § US costs: $224 billion in 2006 § High comorbidity with other mental health disorders (e.g., PTSD/MDD)

https://www.cdc.gov/features/alcohol-deaths/index.html

slide-3
SLIDE 3

How a BACtrack Works

§ Consumed alcohol is absorbed into the bloodstream § Alcohol in the bloodstream moves across the membranes of the lung’s air sacs (alveoli). § The concentration of the alcohol in the alveolar air is directly related to the concentration in the blood. § As the alveolar air is exhaled, the alcohol in it can be detected by the breath alcohol testing device.

https://www.bactrack.com/pages/bactrack-consumption-report

slide-4
SLIDE 4
slide-5
SLIDE 5

External Validation of Accuracy

Summary: Compared the accuracy of 3 smartphone-paired breathalyzers against a police-grade breathalyzer and against blood alcohol levels. Conclusions: Two devices – including BACtrack – were deemed accurate relative to the police- grade device with differences in BAC +/- 0.01. BACtrack was as closely related to blood alcohol levels as the police-grade device.

http://injuryprevention.bmj.com/content/injuryprev/23/Suppl_1/A15.1.full.pdf

slide-6
SLIDE 6

The Business Need à The Data Product

  • 1. Target Markets & Pain Points:

§ Some users/health providers would like tools to help make alcohol use safer

  • 2. Data Product:

§ If we could predict when a given user will have a BAC >=.08, we could target them with messaging, or offer a chat-bot/coach

  • 3. BAC Detection à Real-time Messaging
slide-7
SLIDE 7

Mac Machine Learning Pr hine Learning Predic edictio tion o n of B f Blo lood A d Alc lcoho hol C l Conten ent: t:

K Aschbacher, R Avram, G Tison, K Rutledge, M Pletcher, J Olgin, G Marcus

  • Objective:
  • To identify a digital signature of self-monitored

BAC levels that predicts the times, locations, and circumstances under which a user is likely to exceed the legal BAC driving limit of 0.08%.

  • Methods:
  • >1 million observations from 33,452 distinct

users of the BACtrack device (accuracy comparable to police-grade devices).

  • Behavioral, timestamp, geolocation data
  • Machine learning was conducted by fitting data

to a Gradient Boosted Classification Tree (GBCT), using train/cv/test partitions

slide-8
SLIDE 8

Are BACtrack data relevant to health at scale?

slide-9
SLIDE 9

Is there an association between BAC levels and Motor Vehicle Death Rates?

slide-10
SLIDE 10

Some of the states with the highest death rates have the fewest BACtrack users… more rural?

slide-11
SLIDE 11

Higher BAC levels are associated with a Higher Death Rate, but more so in states with fewer users

Significance

slide-12
SLIDE 12

Predicted MV death rate for any given value of BAC and n-users

slide-13
SLIDE 13

Data Wrangling

slide-14
SLIDE 14

Data Management

Machine Learning Clean & Organize ssh + conda + tmux + jupyter

slide-15
SLIDE 15

Data Security

  • Data is collected anonymously from users of the BACtrack app, which syncs

with BACtrack smartphone-enabled breathalyzers

  • Data is viewed in aggregate only and is from users with with data storage

activated, location services turned on, and does not represent data from all users

  • We use APS Redshift VPC security groups and S3 data encryption methods, and

the data itself is deidentified

  • We analyze data on a secure cluster and interface with the data via ssh
slide-16
SLIDE 16

Machine Learning:

Gradient Boosted Classification with XGBoost

slide-17
SLIDE 17

Gradient Boosted Classification

  • 1. Weak learners (trees) are combined to make strong learners
  • 2. Generalization capability is high
  • 3. Overfitting is low – especially with cross-validation & tuning
  • 4. Handles missing data well
  • 5. Models non-linearities
  • 6. Can be productionized
  • 7. Our Label/Outcome: BAC < .08 versus BAC >= .08
slide-18
SLIDE 18

An Unfortunate Example of a Decision Tree

https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

slide-19
SLIDE 19

Ensemble Learning Methods combine weak learners to create strong learners

https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

  • Boosting Methods compute a set
  • f weights for each training

example at each level of the tree

  • Higher weights are given to

incorrectly classified examples

  • Hence the tree attempts to find

features to explain those examples at the next round

slide-20
SLIDE 20

Problem: We don’t have a lot of features!

  • 1. BAC level
  • 2. Timestamps
  • 3. User-entered data
  • BAC guess
  • a user’s subjective guess prior to measuring
  • “validated”
  • Photos (sparse)
  • Notes (sparse)
  • 4. Geolocation (lat/lon) and zip codes
slide-21
SLIDE 21

The Neurocircuitry of Reward & Addiction

  • Neuroadaptations drive behavior

change over time, characterized by:

  • Reactivity to cues/ triggers
  • Loss of pleasure/ seeking stress relief
  • Deficits in self-regulatory systems
  • When, where, and for whom?
  • Circadian variation in self-monitoring
  • Geographic variation in boredom/stress
  • The longer you’re engaging the more

entrenched this pattern may be for you

Substance Abuse and Mental Health Services Administration (US); Office of the Surgeon General (US). Facing Addiction in America: The Surgeon General's Report on Alcohol, Drugs, and Health [Internet]. Washington (DC): US Department of Health and Human Services; 2016 Nov. Figure 2.3, The Three Stages of the Addiction Cycle and the Brain Regions Associated with Them. Available from: https://www.ncbi.nlm.nih.gov/books/NBK424849/figure/ch2.f3/ Volkow et al., N Engl J Med 2016.

slide-22
SLIDE 22

What’s the Digital Signature of a Habit?

Definition:

  • “An acquired behavior pattern…

regularly followed until it has become almost involuntary.”

Pattern:

  • Frequency à Time
  • Triggers à Time/Location
  • Engagement with self-monitoring

(reflects reward value of tracking)

slide-23
SLIDE 23

Patterns in Time

  • Temporal patterns are not investigated

as often in traditional scientific studies

  • Self-monitoring has a temporal signature
  • API-connected devices capture time-

based signatures

slide-24
SLIDE 24

When do People Monitor?

  • As expected, users are more likely to self-monitor on

weekends and in the evenings

  • Specifically, users monitor their BAC about 5-6 times

more often on Friday and Saturday nights, compared to weekday work-hours

  • Surprise! There’s a self-monitoring bump on weekdays

around 7am… Eye-openers?

(Note: values inside the graph are in units of thousands – e.g., 1.6 = 1.6k or 1600 measurements for that day and hour).

slide-25
SLIDE 25

When is BAC highest?

  • Supporting data validity, users have

higher measured BAC levels on weekends

  • And in the evenings…
  • Interestingly, measured BAC levels peak

around 1-2am… when the bars tend to close…

  • BAC self-monitoring peaks in the “wee

hours” of the evening

  • Tuesday is the “soberest” day
slide-26
SLIDE 26

Is Location a Trigger for Drinking?

  • Many animal studies of alcohol use

“conditioned place preference” (CPP).

  • When you pair alcohol with a certain

place, an animal learns to prefer that place.

  • This suggests that places (locations) can

be cues for alcohol consumption.

slide-27
SLIDE 27

Getting distances from Geolocation:

  • To scale efficiently – do as much as you can with tools like Redshift
  • AWS Redshift does a lot of things… even trigonometry
slide-28
SLIDE 28

Big Data can suffer from the problem of: Garbage In – Garbage Out

slide-29
SLIDE 29

Striking but Accurate

slide-30
SLIDE 30
slide-31
SLIDE 31

Shorter Prior Distances since the prior measurement predicts higher BAC levels

  • The distance a user has traveled

between subsequent BAC measures helps predict subsequent BAC levels

  • Conditioned Place Preference

suggests that short distances will be associated with higher BAC values

  • However, we may need to restrict

this to distances between drinking episodes rather than measurements

  • INTERPRETATION: The highest BAC is predicted if,

since your last measure, you traveled less than 1.5 km (and your distance data was not missing (>-999).

slide-32
SLIDE 32

Evaluating & Optimizing Performance of Gradient Boosted Classification Trees with XGBoost

  • 1. Performance under default settings
  • 2. Class Balancing
  • 3. Tuning Learning Rate along with The

number of trees & max depth

  • 4. Iterative Feature Engineering
  • 5. Final Model Performance &

Interpretation

slide-33
SLIDE 33

Balancing Classes

slide-34
SLIDE 34

Default Model Performance & Impact of Class Balancing

DEV SET RESULTS (N=97,327) Imbalanced Classes Balanced Classes ROC-AUC 82.65% 82.70% Accuracy 77.97% 73.28% High BAC F1-Score 55% 63% Precision 69% 53% Recall 46% 77%

  • Default settings are: 10 estimators, learning_rate=.3, max_depth=6
  • Balancing: positive scale weight = 2.38

Imbalanced Balanced

slide-35
SLIDE 35

Tuning Hyperparameters

Learning Rate Max_Depth Best # of Trees CV-AUC 1.0 12 4 81.38% 1.0 6 30 82.63% 0.3 12 82 84.27% 0.3 6 ~489 84.30% 0.1 12 >500 84.93% 0.1 6 >500 84.21%

  • Used a 3-fold CV to evaluate
  • Higher Learning rates à fewer

trees (n_estimators); faster!!

  • However, maybe worse AUC
  • Tune them together
  • Also consider complexity of

trees – ‘max_depth’

slide-36
SLIDE 36

Model Development is Iterative: Using Feature Importances to inform feature engineering

slide-37
SLIDE 37

Highly Ranked Features (weight)

  • 1. User’s ”guess” or subjective estimate of their BAC level
  • 2. The hour of day
  • 3. The average BAC level for all the prior BACs that user has measured
  • 4. The number of times a user has previously measured his/her BAC level
slide-38
SLIDE 38

Prior Behavior Predicts Future Behavior

  • Prior Experience with BACs

near the limit + guessing your BAC is over the limit

  • bac_cumulative_avg is the

average of all prior BAC measures for that user prior to the given measurement.

  • bac_guess is the user’s

subjective guess about their BAC level

slide-39
SLIDE 39

What About Repeated Measures? Time Delta Features Provide Insights

DEV SET RESULTS (N=97,327) Without Time Deltas With Time Deltas ROC-AUC 82.70% 86.70% Accuracy 73.28% 75.44% High BAC F1-Score 63% 67% Precision 53% 56% Recall 77% 83%

  • Parameters: 10 estimators, learning_rate=.3, max_depth=6
  • Balancing: positive scale weight = 2.38
  • Users sometimes measure their BAC

repeatedly over a period of time because they are waiting for it to come down.

  • Minutes since last measurement may

therefore be a very predictive feature

  • Also, the distance feature is hard to

correctly interpret because it may simply be a proxy for measurements taken close together in time and space

slide-40
SLIDE 40

Final Model Performance in Test Set

TEST SET RESULTS (N=194,653) Default Settings Tuned Settings ROC-AUC 86.70% 88.16% Accuracy 75.44% 77.43% High BAC F1-Score 67% 69% Precision 56% 58% Recall 83% 84%

ROC AUC: 88.16%

Final Model Params: learning_rate 0.03, max_depth 8, min_child_weight 1, subsample 0.8, colsample_bytree 0.8, num_boost_round 500

slide-41
SLIDE 41

Final Model: Feature Importance

GAIN (Improvement in Accuracy):

  • Time since last measurement (min, days)
  • Your guess about your BAC
  • Your average prior BAC
  • How many times you’ve measured BAC
  • Hour of Day
  • The max and range of your prior BACs
  • Average BAC levels over last few times

WEIGHT (Number of splits):

  • Time since last measurement (min)
  • Your average prior BAC
  • How many times you’ve measured BAC
  • Hour of Day
  • Your guess about your BAC
  • The average distance you’ve traveled between

prior BAC measures

  • Length of engagement with the device
  • Population of zip code
slide-42
SLIDE 42

Big Picture Conclusions

  • 1. BAC levels exceeding the safe legal driving limit of 0.08% can be

predicted with good accuracy using machine learning to quantify a digital phenotype.

  • 2. BAC prediction from minimal information establishes the foundation to

conduct precision medicine behavioral interventions using a digital app and BAC tracking device.

slide-43
SLIDE 43

Big Picture Conclusions

  • 1. Feature engineering enhanced

performance more than hypertuning

  • 2. Important to leverage time series

measurements and behavioral neuropsychology content knowledge

  • 3. The impact and optimal handling of

repeated measures on ML models seems not well-defined in contrast to GLM – potentially an important area for future algorithms research.

slide-44
SLIDE 44

Implications for Digital Interventions and Product Development

  • 1. Habits vs Goals

2. Beyond Self-Monitoring

  • States of mood, energy, self-consciousness
  • Thoughts as triggers
  • Reflections on BAC levels (self-relevant meaning of the data)

3. Breaking the Vicious Reward Cycle

  • Trigger management strategies
  • Anticipatory stress relief
  • Willpower is a limited resource – known relationships to time/glucose
  • Psychoeducation about withdrawal – emptiness, negative emotions, and

lack of joy can be withdrawal symptoms driven by reward neurocircuitry

  • Brain-train behavioral inhibition
  • Mindfulness for social/emotional distress tolerance
  • Substitution strategies
  • Information not advice (Motivational Interviewing
  • Abstinence Violation strategies
slide-45
SLIDE 45

Thank you. Kirstin.Aschbacher@ucsf.edu