A k-means approach to clustering disease progressions Duc Thanh Anh - - PowerPoint PPT Presentation

a k means approach to clustering disease progressions
SMART_READER_LITE
LIVE PREVIEW

A k-means approach to clustering disease progressions Duc Thanh Anh - - PowerPoint PPT Presentation

A k-means approach to clustering disease progressions Duc Thanh Anh Luong Varun Chandola Department of Computer Science & Engineering University at Buffalo IEEE ICHI 2017 August 26, 2017 Outline Motivation K-means approach An


slide-1
SLIDE 1

A k-means approach to clustering disease progressions

Duc Thanh Anh Luong Varun Chandola Department of Computer Science & Engineering University at Buffalo IEEE ICHI 2017 August 26, 2017

slide-2
SLIDE 2

Outline

  • Motivation
  • K-means approach
  • An application for Chronic Kidney Disease
  • Generating patient-specific disease profiles
slide-3
SLIDE 3

Motivation

  • Find subgroup of patients have similar disease progression
  • Identify the underlying mechanism of the subgroup
  • Provide better treatment for each subgroup
slide-4
SLIDE 4

Motivation

  • Different patients have different disease progressions
  • Consider the case of Chronic Kidney Disease

Can we group patients by their progressions into few groups?

  • 30

50 70 90 1000 2000 3000

days from first clinical record eGFR patient ID

  • 8562280

8563881 8567589 8571050 8582794 8587204 8587950 8601598 8602147 8602554

Are there few general trends of disease progressions?

slide-5
SLIDE 5

Motivation

25 50 75 300 600 900

days from first clinical record eGFR

Trajectories of 500 patients Trajectories after being clustered

slide-6
SLIDE 6

Clustering problem and k-means algorithm

  • Cluster a set of data points into k clusters
  • Can be solved by K-means approach

Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.

slide-7
SLIDE 7

K-means approach

Distance metric Centroid Data object Patient disease progression

  • 40

45 50 55 500 1000 1500

days from first clinical record eGFR

  • 30

40 50 60 70 400 800 1200

days from first clinical record eGFR centroid

red

Regression line

  • 30

35 40 45 50 300 600 900

days from first clinical record eGFR

slide-8
SLIDE 8

K-means approach

Initial step randomly assign patient into k clusters Update step Assignme nt step Perform regression for each cluster to obtain “centroid” Assign patient to the the cluster that has closest centroid

No patient move to another group?

End Yes No

slide-9
SLIDE 9

Dataset & Preprocessing

DARTNet patients (n = 69,817) Invalid birth year and sex value (n = 6,418) Number of serum creatinine records < 1 (n = 181) Invalid data records (n = 9) “Preprocessed” DARTNet patients (n = 63,209) Having eGFR values less than 60 for more than three months (n = 29,585) Final CKD cohort (n = 7,142) Number of serum creatinine records < 10 (n = 17,158) Excluded Excluded Excluded Observation duration < 1 year (n = 5,285) Excluded Excluded

slide-10
SLIDE 10

Clustering result

slide-11
SLIDE 11

Demographic distribution in clusters

  • 20

40 60 80 2.5 5.0 7.5 10.0

cluster age

5.91% (422) 14.86% (1061) 13.72% (980) 14.59% (1042) 14.98% (1070) 9.61% (686) 11.55% (825) 9.49% (678) 4.56% (326) 0.73% (52) Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10

slide-12
SLIDE 12

Other clinical markers

slide-13
SLIDE 13

Generating patient-specific disease profiles

  • 30

35 40 300 600 900

days from first clinical record eGFR

residuals Gaussian processes

Rasmussen, Carl Edward, and Christopher KI Williams. Gaussian processes for machine learning. Vol. 1. Cambridge: MIT press, 2006.

slide-14
SLIDE 14

Generating patient-specific disease profiles

200 400 600 800 1000 20 40 60 80 100

Patient 391 Cluster 5

day

  • cluster's trajectory

individual predicted trajectory upper and lower limit actual eGFR value 200 400 600 800 1000 20 40 60 80 100

day eGFR

  • cluster's trajectory

individual predicted trajectory upper and lower limit actual eGFR value

slide-15
SLIDE 15

Conclusion & Future Work

  • Clustering disease progressions – k-means approach
  • Generating individual prediction – Gaussian processes
  • Extend the approach to cope with multiple clinical markers
  • Give quantitative evaluation of clusters
  • Tightness
  • Separation
slide-16
SLIDE 16

Thank you