Effectiveness of Deep Learning Vs. Machine Learning in a Health Care - - PowerPoint PPT Presentation

effectiveness of deep learning vs machine
SMART_READER_LITE
LIVE PREVIEW

Effectiveness of Deep Learning Vs. Machine Learning in a Health Care - - PowerPoint PPT Presentation

Effectiveness of Deep Learning Vs. Machine Learning in a Health Care Use Case RxToDx A Data Science Machine Learning/Deep Learning Show Case Dima Rekesh/Julie Zhu/Ravi Rajagopalan Optum Technology November, 2017 Analytics Machine


slide-1
SLIDE 1

Analytics Machine Learning/Deep Learning Show Cases

RxToDx

A Data Science Machine Learning/Deep Learning Show Case

Effectiveness of Deep Learning Vs. Machine Learning in a Health Care Use Case

Dima Rekesh/Julie Zhu/Ravi Rajagopalan Optum Technology November, 2017

slide-2
SLIDE 2

2

Evaluate Deep Learning on a well-known, production Machine Learning problem using an typical time series data set. Seek best practices from an industry leader (Nvidia) Impute and predict the likelihood of an individual having a medical condition using members’ previous two years prescription pharmacy claims data. Objective: What we have learned from a Deep Learning Health care use case?

slide-3
SLIDE 3

3

Deep Learning

How is it different?

  • Multiple layers in neural network with intermediate data representations to

facilitate dimensional reduction.

  • Interpret non-linear relationships in the data.
  • Derive patterns from data with very high dimensionality.

Why do we care?

  • Ability to create value with little or no domain knowledge required.
  • Ability to incorporate data from across multiple, seemingly unrelated

sources.

  • Ability to tolerate very noisy data.
slide-4
SLIDE 4

4

What have we learned?

Doesn’t need SME’s inputs. Eliminate manual feature engineering Predict multiple targets at a time. Higher performance with Higher volume data Capable to automate the model development

slide-5
SLIDE 5

5

5

Results: summary and take-aways

  • DL proved to be more accurate than conventional ML methods
  • Neural nets required no manual feature engineering, pointing to a reduction in

person-hours required to create and maintain them

  • A deep neural network was capable of predicting at least four different

diseases more accurately than conventional ML models (conventional ML models can predict only one disease at a time). This points to drastic reduction in costs.

  • Modern GPUs are required: It takes ~24 hours to train on the full data set of

4.5 M records on the latest (Nvidia P100) GPU

slide-6
SLIDE 6

6

Depression Impact -- $126M Rx cost per year

www.slideshare.net/psychiatryjfn/disorders-of-mood1

Total annual claims at Optum - $ 965 Million

(for the cohort)

Depression Related claims - $ 126 Million

(for the cohort)

slide-7
SLIDE 7

7

Deep Learning doesn’t rely on SME’s inputs

SME inputs+ ML Features Fpr XGB Model

Old Markers Feature Importance DEPR_327 0.0601 number_prescribers_(10, inf] 0.0315 sum_amt_standard_cost_(621.48, inf] 0.0286 DEPR_31702 0.0258 DEPR_31705 0.0258 number_rx_(9, inf] 0.0200 DEPR_31700 0.0186 DEPR_31500 0.0172 number_rx_(4, 9] 0.0172 number_rx_(3, 4] 0.0157 tot_drug_units_(5651, inf] 0.0143 DEPR_29702 0.0143 number_prescribers_(5, 7] 0.0129 tot_days_supply_(7, 27] 0.0129 number_prescribers_(7, 10] 0.0129 sum_amt_standard_cost_(270.11, 477.88] 0.0114 DEPR_57018 0.0114 tot_days_supply_(1027, 1551] 0.0100 DEPR_321 0.0100 DEPR_606 0.0100

Deep Learning Raw Data

Drug Codes Deep Learing Model takes raw data without SME inputs and Feature Engineering

Old Markers Feature Importance DEPR_320_32005 0.2155 DEPR_322 0.1178 DEPR_2 0.0977 DEPR_32306_18_19_20_24_26 0.0512

SME inputs For Logistic Model

slide-8
SLIDE 8

8 8

Machine Learning Model Process Flow Chat

slide-9
SLIDE 9

The revolution: Machine learn vs. Deep Learning Feature Creation Machine Learning Deep Learning Model

Class-specific Feature Creation Feature Engineering Predict one Class at a time

Auto Encode Embedding

Predict multiple classes at a time By directly using raw data without Feature Engineering, and by predicting multiple targets at a time, Deep Learning model approach saves > 50% of model development time and resources.

For decades, ML relied on human-engineered features in fields as diverse as image processing (e.g. edge detection), NLP (linguistics, stop words). DL renders feature engineering obsolete. Depression: 1 or 0 Depression 1 Asthma 1 ATDD 1 ….

slide-10
SLIDE 10

10

Higher performance with Higher volume of data

XGboost Model LSTM with 4.5 million records. RNN with 0.5 million records

slide-11
SLIDE 11

11

3.2K

22K

Deep Learning vs ML model(XGBoost) – Cost Analysis (Annual Cost)

Count Comparison

Deep Learning ML Model

Cost Comparison

259K 806M

56M

9M

Deep Learning ML model

* Includes non-depression related claims

  • Deep Learning identifies

additional USD 56 Million claims that are not identified by ML model(XGBoost)

  • Deep Learning identifies additional

22K patients that are not identified by ML model(XGBoost)

slide-12
SLIDE 12

12

Automated ML/DL platform in AI System 1 2

  • Autonomous, instant learning
  • Hyper parameter tuning
  • Feature Engineering/model free
  • Transfer sparse and highly-dimension data
  • Data Driven Results
  • Multiple Targets at a time.
  • Able to explain results – How & Why

Machine Learning/Deep Learning Robot

slide-13
SLIDE 13

13

Multi-disease predictions

slide-14
SLIDE 14

14

Hypertension – 4.5M records

Specialist 4 diseases

slide-15
SLIDE 15

15

ATDD – 4.5M records

Specialist 4 diseases

slide-16
SLIDE 16

16

Depression – 4.5M records

Specialist 4 diseases

slide-17
SLIDE 17

17

Asthma – 4.5M records

Specialist 4 diseases

slide-18
SLIDE 18

18

18

1D Convolutional Networks: simple, fast, local

time Short range

f1 f2

  • Kernel size: 4
  • Stride: 2

Input=16 Feature map = 7

f3 f4 f5 f6 f7

Fewer weights. Observe that in images, objects are ”local”

slide-19
SLIDE 19

19

19

RNNs: n inputs, m outputs

r

r

r r r r r r

time He went to school state has enough information to generate one time or sequential output él fue a la esquela

r

RNNs are pervasively used for NLP and language to language translation

slide-20
SLIDE 20

20

20

Zero paddings

helpful with inputs consisting of different length sequences

time Input 1 Input 2 Input 3 Input 1 Input 2 Input 3 3 x 8 irregular zero padding To neural network

slide-21
SLIDE 21

21

21

Embeddings

helpful with categorical, non-contiguous inputs

time Input=1x4 Each input is a number from 1 to 1,000 779 and 780 are not close (e.g. codes for prescription drugs) One-hot: 1000x4 Embedding dimension = 3 becomes 3x4 To neural network Each input is a vector of 1,000 ones or zeros. Better, but a lot of numbers / lot of memory Each input is a vector of 3 numbers Hopefully ”close” vectors are really “close” Embedding transformation

slide-22
SLIDE 22

22

22

Keras Learned Embeddings

use a fully connected layer, learn together with rest of model

One-hot: 1000x4 Embedding dimension = 3 becomes 3x4 To neural network Each input is a vector of 1,000 ones or zeros. Better, but a lot of numbers / lot of memory Each input is a vector of 3 numbers Hopefully ”close” vectors are really “close” Embedding transformation Fully connected layer

Learn these weights

slide-23
SLIDE 23

23

23

Word2vec embeddings: unsupervised

CBOW: Continuous Bag of Words: predict the word given its context Skip-grams: predict the context (including far away words) given a word

32 11 45

Input: 11 [11,32] [11,45] Outputs Window = 3 Training for co-occurrence

32 11 45

Input: [32,45] time 11 Output Window = 3 Skipgrams CBOW

slide-24
SLIDE 24

24

24

Word2vec Embeddings (CBOW)

1

Context DCC

∑ ∑ ∑

33907 45501 DCC Sequence Input

∑ ∑ ∑

. . . .

Hidden Layer Linear Neurons Output Layer Softmax Classifier

DCC = 33907 Probability that DCC code at the nearby location is “45501” (Target DCC) “45502” “45503” “83600”

wij wij wij

Word2vec Output

slide-25
SLIDE 25

25

25

  • Approach 1:
  • Build Word2Vec model on drug sequences using gensim
  • Replace the drug codes with their respective vectors
  • Use the vectorized inputs for the LSTM model
  • Approach 2:
  • Build Word2Vec model on drug sequences using gensim
  • Initialize the weights of Keras Embedding layer using Word2vec output
  • Run Embedding + LSTM model in Keras
  • Observations:
  • Approach 1 though gave promising results wasn’t scalable – Memory

constraints kicked in during Vectorization

  • Approach 2 gave good enough results – not enough to beat a model of pure

Keras Embedding + LSTM

Embeddings: Word2Vec + LSTM

slide-26
SLIDE 26

26

Rx2dx project: evaluated Network architectures

26

Zero padded time sequences (up to 256) time

Embedding

concatenate Static vars FC (up to 64) RNN (up to 256) Or 1-D CNN Classifier (1..N classes) Specialist as well as multi- disease networks examined

slide-27
SLIDE 27

27

Hardware: IBM Minsky: 4x GPU server

27

The only server offering NVLink between CPUs and GPUs on Power Architecture

  • 20 cores Power 8 3.25 GHz (x8 HT)
  • 1024 GB RAM
  • 2 x 2.5” 1 TB SSDs
  • Mellanox QDR Infiniband

This architecture will make a difference on mixed workloads with a lot

  • f CPU to GPU communication (real time batch generation)
slide-28
SLIDE 28

28

Swift for long term reference data sets, inputs and results

The Software stack

28

nvidia-docker with frameworks enabled docker containers We predominantly used Keras + Theano

Swift for long term reference data sets, inputs and results

GPU 0 GPU 1 GPU 2 GPU 3

Nvidia-docker (base docker image with device pass-through)

TensorFlow Theano Mxnet

docker registry

Cuda drivers – The bare metal machine is loaded with cuda drivers; then one installs docker and then nvidia-docker – At this point, hundreds of open source DL – enabled containers become available for instant download – At Optum, we already have an internal docker registry that we can utilize to store and manage internal images

Torch

GPU N

Command line or web / GUI access..

Jupyter

slide-29
SLIDE 29

29

29

  • Issue 1: Contradicting Labels
  • There were several cases with same sequence but different labels
  • Issue 2: Small Sample Representation
  • There were several sequences which had only one sample in the data
  • 99% of misclassifications were from these small sample sequences

Can we do better? Analyzing Data Limitations

ID DCC Sequence 1 20 220 575 12 700 12 220 575 2 20 220 575 12 700 12 220 575 Label 101 20 220 575 12 700 12 220 575

1

slide-30
SLIDE 30

30

30

  • Identify the sub-sequences that have the most impact on prediction

Can we do better: What has the model learned?

27 10 27 27 30 5 85 35 40 27 30 75 27 27 30 50

Original Sequence

27 27 27 30 40 27 30 27 27 30 50

Most Effective Sub-sequence

  • Sequentially eliminate codes from a sequence and estimate its impact on the

estimated probability

  • Identify the sub-sequence that maximizes the predicted probability
slide-31
SLIDE 31

31

31

Summary of approaches attempted: ▪ Static variables: age, gender, costs etc ▪ Zero padding to a fixed length ▪ Specialist model: each disease gets a separate model or a generalist model (all diseases are imputed by the same model) ▪ Have explored Multi Disease models as well ▪ Explored GRU or Conv1D networks – LSTMs came out to be the best ▪ Regularization (Weight Decay and Dropout) ▪ Automatic hyper-parameter grid search ▪ Termination by patience (once validation loss no longer decreases) ▪ Saving checkpoints / restarting from them

slide-32
SLIDE 32

32

32

Future ▪ More work on summary of what the model learned ▪ Multi-GPU training ▪ Accelerated LSTM libraries (PyTorch, Keras) ▪ Attention ▪ Multi-network ensembles ▪ More complex model that better accommodates rare sequences ▪ Have explored Multi Disease models as well