Accelerate Innovation in the Enterprise Solutions and Reference - - PowerPoint PPT Presentation
Accelerate Innovation in the Enterprise Solutions and Reference - - PowerPoint PPT Presentation
Accelerate Innovation in the Enterprise Solutions and Reference with Distributed ML / DL architecture Nanda Vijaydev BlueData (recently acquired by HPE) The AI Conference, New York Agenda AI, Machine Learning (ML), and Deep Learning
- AI, Machine Learning (ML), and Deep Learning (DL)
- Example Enterprise Use Cases
- Deployment Challenges for Distributed ML / DL
- Distributed TensorFlow and Horovod on Containers
with Intel Xeon processors and Intel MKL
- Lessons Learned and Key Takeaways
Agenda
AI, Machine Learning, and Deep Learning
Let’s Get Grounded…What is AI?
What are Machine Learning and Deep Learning?
Artificial intelligence Machine learning Deep learning Artificial intelligence (AI)
Mimics human behavior. Any technique that enables machines to solve a task in a way like humans do.
Machine learning (ML)
Algorithms that allow computers to learn from examples without being explicitly programmed.
Deep learning (DL)
Subset of ML, using deep artificial neural networks as models, inspired by the structure and function of the human brain.
Example:
Siri
Example:
Google Maps
Example:
Self-driving car
Why Should You Be Interested in AI / ML / DL?
Everyone wants AI / ML / DL and advanced analytics…. ….but face many challenges
Use cases New roles, skill gaps Culture and change Data preparation Legacy infrastructure AI and advanced analytics infrastructure could constitute
15-20% of the market by 20211
AI and advanced analytics represent
2 of top 3 CIO priorities
Enterprise AI adoption
2.7X growth in last 4 years2
1 IDC. Goldman Sachs. HPE Corporate Strategy.2018 2 Gartner - “2019 CIO Survey: CIOs Have Awoken to the Importance of AI”
Key Questions Remain …
How do you integrate your AI and data ecosystem for ML / DL and advanced analytics? How do you modernize, consume, and prepare your EDW or Hadoop big data foundation for AI? How do you get started with gaining intelligence with your data? What is the best way to prepare your company for a data-centric and AI future? What opportunities does AI bring to your business? What are the major use cases?
AI / ML / DL Adoption in the Enterprise
Health
Personalized medicine, image analytics
Manufacturing
Predictive and prescriptive maintenance
Consumer tech
Chatbots
Financial services
Fraud detection, ID verification
Government
Cyber-security, smart cities and utilities
Energy
Seismic and reservoir modeling
Service providers
Media delivery
Retail
Video surveillance, shopping patterns
Example Enterprise Use Cases
Financial Services Use Cases
Fraud Detection
- Real-Time Transactions
- Credit Card
- Merchant
- Collusion
- Impersonation
- Social Engineering
Fraud
Risk Modeling & Credit Worthiness Check
- Loan Defaults
- Delayed Payments
- Liquidity
- Market & Currencies
- Purchases and
Payments
- Time Series
CLV Prediction and Recommendation
- Historical Purchase
View
- Pattern Recognition
- Retention Strategy
- Upsell
- Cross-Sell
- Nurturing
Customer Segmentation
- Behavioral Analysis
- Understanding
Customer Quadrant
- Effective Messaging &
Improved Engagement
- Targeted Customer
Support
- Enhanced Retention
Other
- Image Recognition
- NLP
- Security
- Video Analysis
Wide Range of ML / DL Use Cases for Wholesale / Commercial Banking, Credit Card / Payments, Retail Banking, etc.
CLV: Customer Lifetime Value
Fraud Detection Use Case
- One of the most common use cases for ML / DL in
Financial Services is to detect and prevent fraud
- This requires:
– Distributed Big Data processing frameworks such as Spark – ML / DL tools such as TensorFlow, H2O, and others – Continuous model training and deployment – Multiple large data sets
Fraud Detection Use Case (cont’d)
- Data science teams need the ability to create
distributed ML / DL environments for sandbox as well as trial and error experimentation
- This requires:
– Hardware acceleration (e.g. Xeon, MKL) – Multiple different ML / DL and data science tools – Fast and repeatable deployment of clusters
- Precision Medicine and Personal Sensing
– Disease prediction, diagnosis, and detection (e.g. genomics research) – Using data from local sensors (e.g. mobile phones) to identify human behavior
- Electronic Health Record (EHR) correlation
– “Smart” health records
- Improved Clinical Workflow
– Decision support for clinicians
- Claims Management and Fraud Detection
– Identify fraudulent claims
- Drug Discovery and Development
ML / DL in Healthcare – Use Cases
- Many types of data
– Genomic – Microbiome – Epigenome – Etc.
- Huge volumes of data
(petabytes > exabytes)
Use Case: Precision Medicine
360° View of the Patient
Visit Care Site Rx
Patient
Demographics
Studies Genomics Diagnosis Labs
Deployment Challenges for Distributed ML / DL
Why Distributed ML / DL?
Speed Large Data Volumes Fault Tolerance
- Complexity, lack of repeatability and
reproducibility across environments
- Sharing data, not duplicating data
- Need agility to scale up and down compute
resources
- Deploying multiple distributed platforms,
libraries, applications, and versions
- One size environment fits none
- Need a flexible and future-proof solution
Laptop On-Prem Cluster Off-Prem Cluster
Distributed ML / DL – Challenges
Example Deployment Challenges
- How to run clusters on heterogeneous host hardware
– CPUs and GPUs, including multiple GPU versions
- How to maximize use of expensive hardware resources
- How to minimize manual operations
– Automating the cluster creation and and deployment process – Creating reproducible clusters and reproducible results – Enabling on-demand provisioning and elasticity
Example Deployment Challenges
- How to support the latest versions of software
– Deployment complexity and upgrades – Version compatibility
- How to ensure enterprise-class security
– Network, storage, user authentication, and access
Docker is software that
performs operating-system-level virtualization also known as containerization. Containerization allows the existence of multiple instances on a server .
Source: https://en.wikipedia.org/wiki/docker_(software)
Modern Technology Innovations
Simplify Deployments Innovate Faster Deploy Anywhere
Distributed ML / DL and Containers
- ML / DL applications are compute
hardware intensive
- They can benefit from the flexibility,
agility, and resource sharing attributes
- f containerization
- But care must be taken in how this is
done, especially in a large-scale distributed environment
22 IT Data Data Platforms Data Science and ML / DL Tools Solutions Genome Research Video Surveillance Customer 360 Example Industry Use Cases Fraud Detection
HDFS/NFS
User Access Security Time to Deploy Multi-Tenant Data Duplication Data Store Cloud
AI-Driven Solutions for the Enterprise
Turnkey Container-Based Solution
IOBoost™ – Extreme performance and scalability ElasticPlane™ – Self-service, multi-tenant clusters DataTap™ – In-place access to data on-prem or in the cloud
BlueData EPIC™ Software Platform
Data Scientists Developers Data Engineers Data Analysts
BI/Analytics Tools Bring-Your-Own
NFS HDFS
Compute Storage On-Premises Public Cloud
Big Data Tools ML / DL Tools Data Science Tools
CPUs GPUs
TensorFlow and Horovod on Containers with Intel Xeon processors and Intel MKL
Distributed TensorFlow – Concepts
- Running TensorFlow training in parallel, on multiple
devices
- Goal is to improve accuracy and speed
- Different layers may be trained on different nodes
(model parallelism)
- Same model can applied on different subset of data,
in different nodes (data parallelism)
Distributed TensorFlow – Schemes
- Data parallelism implementation
– Needs to sync model parameters – Uses a centralized or decentralized scheme to communicate parameter update
- Centralized schemes use Parameter
Server to communicate updates to
parameters (gradients) between nodes
- Decentralized schedules use ring-allreduce
scheme
- Horovod is an open source framework
developed by Uber that supports allreduce
Meet Horovod
- Distributed training framework for
– Tensorflow, PyTorch, Keras
- Separates infrastructure capabilities from ML
- Installs easily on existing ML framework
– pip install horovod
- Uses bandwidth optimal communication
protocol – RDMA, InfiniBand if available
Shared Data
TensorFlow with Horovod on Docker
Docker Containers
MPI 3.1.3 TensorFlow 1.7* MKL MPI 3.1.3 TensorFlow 1.7* MKL MPI 3.1.3 TensorFlow 1.7* MKL Horovod cluster on multiple containers, and machines
TensorFlow with Horovod
- tensorflow_wrd2vec.py from git https://github.com/horovod/horovod
examples
- Data comes from shared NFS mounts, automatically surfaced by BlueData
into containers
- Passwordless ssh setup during cluster creation
- All prerequisites installed on all nodes, including:
– MKL – Math Kernel Library – tensorflow, pytorch, scikit-learn, ... (compute frameworks) –
- penmpi (To distribute the job)
– tensorboard for visualization
App Store with Pre-Built ML / DL Images
Docker images for multiple applications and versions Ability to create and add new images
mpirun –np 4 /
- -allow-run-as-root /
- d -H bluedata-302.bdlocal:2,bluedata-301.bdlocal:4 /
- bind-to none -map-by slot /
- x LD_LIBRARY_PATH /
- x PATH /
- mca pml ob1 /
- mca btl ^openib python tensorflow_word2vec_logs.py
TensorFlow with Horovod
Lessons Learned and Key Takeaways
Lessons Learned and Takeaways
- Enterprises are using ML / DL today to solve difficult problems
(example use cases: fraud detection, disease prediction)
- Distributed ML / DL in the enterprise requires a complex stack,
with multiple different tools (TensorFlow is one popular option)
- The only constant is change … be prepared
– Business needs, use cases, and tools will constantly evolve
- Deployments are challenging, with many potential pitfalls
– Containerization can deliver agility and cost saving benefits
Lessons Learned and Takeaways
- Leverage a flexible, scalable, and elastic platform for success