1
Health Cloud Project Integrated Media Systems Center University of - - PowerPoint PPT Presentation
Health Cloud Project Integrated Media Systems Center University of - - PowerPoint PPT Presentation
Health Cloud Project Integrated Media Systems Center University of Southern California Dimitrios Stripelis stripeli@usc.edu 1 Purpose Compute Machine Learning models from independent Spark clusters Combine partial models to construct
2
- Compute Machine Learning models from independent
Spark clusters
- Combine partial models to construct a unified ML
model
Purpose
3
Framework Schematically
- 1 Main Portal for submitting requests
- Independent Spark clusters each
residing on a remote hospital network
4
- User accesses Portal (Server 1) and
requests the construction of a ML model from each remote Spark Cluster
- The Cluster receives the request and
computes the model through Spark MLlib
- Once computation finishes every model
along with algorithmic-specific auxiliary data are returned to the Portal in jSON format for unification
Framework Operations
5
Currently the Framework supports two principal Algorithms: Naive Bayes Linear Regression with Stochastic Gradient Descent (SGD) Extensible for: classification & regression: SVM, decision trees collaborative filtering: alternating least squares (ALS) clustering: k-means, Gaussian Mixture, Latent Dirichlet Allocation (LDA)
- ptimization: limited-memory BFGS (L-BFGS)
ML Algorithms
6
We evaluated the Framework’s efficiency against Medical datasets available at the UCI Machine Learning repository. The datasets were related to:
- Single Proton Emission Computed Tomography images
- Diabetes 130
- Parkinsons Telemonitoring Data Set
Datasets
7
We developed the Health Cloud Framework’s infrastructure on Microsoft Azure Service on three type D1 servers. Portal Role We use the main Portal (server 1) to submit the machine learning computation requests on each remote server (servers 2, 3) by passing the following arguments: 1. Accessible External Hostname for each server 2. Name of the Machine Learning Algorithm to be computed 3. Path to the training data file in the remote server 4. Path to the testing data file in the remote server 5. Aglorithmic-specific parameters for model computation
Implementation
8
- After we have submitted the request to the Framework, we initialize a Spark cluster on
each server, i.e. a single Master and a single Worker on top of each machine, and we execute the appropriate jar file for the Machine Learning Algorithm (currently NaiveBayes or LinearRegression) we need to compute.
- Synchronous Execution
Once the model is computated, the jSON file is constructed and sent to the main Portal. Thereinafter, we terminate the Spark cluster operation on the server and we proceed with the computation of the ML model in the second machine.
Implementation
9
- One of the main contributions of the Framework is that we can configure separately on
each server the computation of an ML Algorithm using different training and testing datasets and experiment with the algorithmic specific parameters so that we can
- ptimize the requested results without tranfering any data between the servers.
- Furthermore, this implementation gives us the flexibility to combine same or even
different Machine Learning models that can be produced from dissimilar datasets and domains in order to construct a unified model which can in turn lead us to a more generic ML model with almost the same accuracy as the initial models.
Significance
10
Real Execution
Server 2 - NaiveBayes parameters: Type: Bernoulli – dataset was 0s,1s Additive Smoothing: 0.01 Server 3 - LinearRegression parameters: Number of Iterations: 3 Step Size of Gradient Descent: 3.0 We call the following script from the Portal (Server 1) for NaiveBayes and LinearRegression computation and we receive the subsequent jSON files
./ml_cluster_exec.sh --server instance-trans2.cloudapp.net --algorithm NaiveBayes
- -training-file /u01/health_data/2servers_data/SPECT.train.part1.csv
- -testing-file /u01/health_data/SPECT.test.csv
- -parameters type=bernoulli smoothing=0.01
- -server instance-trans3.cloudapp.net --algorithm LinearRegression
- -training-file /u01/health_data/2servers_data/parkinsons_updrs.data.part1.csv
- -testing-file /u01/health_data/2servers_data/parkinsons_updrs.data.test.csv
- -parameters iterations=3 stepsize=3
11
Development
- Distribute requests and retrieve results asynchronously
- Extend Health Cloud Framework to support all the spectrum of the
Spark MLlib Algorithms Research Oriented
- Based on current experimental features continue exploring novel
ML models by combining information derived from intermediate
- nes