Machine Learning Applications Ben Chandler Hewlett Packard Labs - - PowerPoint PPT Presentation

machine learning applications
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Applications Ben Chandler Hewlett Packard Labs - - PowerPoint PPT Presentation

A Platform for Accelerating Machine Learning Applications Ben Chandler Hewlett Packard Labs April 6th, 2016 HPE Big Data and HPC portfolio strategy Design and deliver comprehensive solutions with purpose-built platforms Innovate, design &


slide-1
SLIDE 1

A Platform for Accelerating Machine Learning Applications

Ben Chandler Hewlett Packard Labs

April 6th, 2016

slide-2
SLIDE 2

Optimized HW/SW Platforms

HPE Big Data and HPC portfolio strategy

Design and deliver comprehensive solutions with purpose-built platforms

2

Innovate, design & deliver the best-in-class hardware and software to support foundational infrastructure needs of the Big Data customers Provide vertical solutions by building software stack and partner ecosystem Enable Advisory Services to help manage customer’s technology journey

Drive HPC and Big Data across all Enterprises

1 2 3

slide-3
SLIDE 3

Modernize your datacenter for massive parallel processing innovation

Deliver automated intelligence, real-time insights and optimized performance

3 Optimized performance Real-time insights Automated intelligence Extreme performance capabilities to process, manage and analyze data, I/O and storage intensive application workloads with high speed, scale, efficiency and enable high flexibility for open infrastructure innovation

Navigate the data-driven transformation journey across all enterprises with new HPC and Big Data capabilities that accelerate time-to-value for increased competitive differentiation

Deep Learning Innovation HPC Compute & Storage Solution HPE Vertica for SQL on Hadoop Integrity MC990 X for Database Processing Risk Compliant Archive Solution Trade & Match Server Solution HPC for Trader Workstation

Apollo 6500, Apollo 4520 Apollo 2000 Apollo 4510 HPE Moonshot Apollo 4000 Series

slide-4
SLIDE 4

Deliver automated intelligence in real-time for Deep Learning

Unprecedented performance and scale with HPE Apollo 6500 high density GPU solution

Customer benefits HPE Apollo 6500 is an ideal HPC and Deep Learning platform providing unprecedented performance with 8 GPUs, high bandwidth fabric and a configurable GPU topology to match deep learning workloads − Up to 8 high powered GPUs per tray (node), 2P Intel E5-2600 v4 support − Choice of high-speed, low latency fabrics with 2x IO expansion − Workload optimized using flexible configuration capabilities Video, Image, Text, Audio, time series pattern recognition Large, highly complex, unstructured simulation and modeling Real-time, near real-time analytics

Faster Model training time, better fusion of data*

Use Cases

Transform

to a hybrid infrastructure

Enable

workplace productivity

Protect

your digital enterprise

Empower

a data-driven

  • rganization

Automated Intelligence

delivered by HPE Apollo 6500 and Deep Learning software solutions

* Benchmarking results provided at or shortly after announcement

4

slide-5
SLIDE 5

HPE Apollo 6500 solution innovation

System Design Innovation to maximize GPU capacity and performance with lower TCO

HPE Apollo 6500

– Dense GPU server optimized for Deep Learning and HPC workloads – Density optimization – High performance fabrics Cluster Management Enhancements

(Massive Scaling, Open APIs, tight Integration, multiple user interfaces)

− GPU density − Configurable GPU topologies − More network bandwidth − Power and cooling optimization − Manageability − Better productivity

New technologies, products

Unique Solution differentiators

Deep Learning, HPC Software platform Enablement

(HPE CCTK, Caffe, CUDA, Google TensorFlow, HPE IDOL)

5

slide-6
SLIDE 6

Roadmap

–Motivating evidence –The CogX project and vision –Open-source availability

slide-7
SLIDE 7

A simple data-intensive program

val movie1 = ... val movie2 = ... val average = (movie1 + movie2) / 2

movie1 movie2 + / 2 average

slide-8
SLIDE 8

Simplified architecture diagram

CPU GPU CPU Mem GPU Mem

slide-9
SLIDE 9

CPU CPU Mem

Naïve data flow in practice

val average = (movie1 + movie2) / 2

GPU GPU Mem

slide-10
SLIDE 10

CPU GPU CPU Mem GPU Mem

Optimized data flow in practice

val average = fusedOp(movie1, movie2, 2)

slide-11
SLIDE 11

Performance portability on GPUs

11

slide-12
SLIDE 12

Roadmap

–Motivating evidence –The CogX project and vision –Open-source availability

slide-13
SLIDE 13

Vision

performance-portable, high-productivity programming for accelerators

13

slide-14
SLIDE 14

CogX

  • Domain-specific embedded language with associated optimizing compiler and runtime
  • Array programming language embedded in a state machine execution model
  • Targets advanced analytics workloads on massively parallel distributed systems
  • Design Goals

– Optimal deployment on parallel hardware – Fast design iterations – Enforce scalability – Broad COTS hardware support – Compatible with shared infrastructure – High productivity for analysts and algorithm engineers

What is CogX?

slide-15
SLIDE 15

CogX compute model

  • Compute Graphs

– Fields – Operators – Sensors/Actuators – Feedback/Time

Compute Graph

slide-16
SLIDE 16

CogX compute model

val movie = ColorMovie(“courtyard.mp4”) val background = VectorField(movie.fieldShape, Shape(3)) val nextBackground = 0.999f * background + 0.001f * movie background <== nextBackground val suspicious = reduceSum(abs(movie - background))

slide-17
SLIDE 17

Demo: Hello World application

17

slide-18
SLIDE 18

CogX compute model

val movie = ColorMovie(“courtyard.mp4”)

Compute graph

moviet

ColorMovie

slide-19
SLIDE 19

CogX compute model

val background = VectorField(movie.fieldShape, Shape(3))

Compute graph

moviet backgroundt

ColorMovie

slide-20
SLIDE 20

CogX compute model

val nextBackground = 0.999f * background + 0.001f * movie

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt

ColorMovie

slide-21
SLIDE 21

CogX compute model

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt background <== nextBackground backgroundt+1

ColorMovie

slide-22
SLIDE 22

CogX compute model

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt backgroundt+1 val suspicious = reduceSum(abs(movie - background))

  • abs

reduce Sum

suspicioust

ColorMovie

slide-23
SLIDE 23

CogX compute model

movie0 background0

+

* 0.999f * 0.001f

background1

  • suspicious0

abs reduceSum

movie1

+

* 0.999f * 0.001f

background2

  • suspicious1

abs reduceSum

movie2

+

* 0.999f * 0.001f

background3

  • suspicious2

abs reduceSum

= 0

slide-24
SLIDE 24

Opportunities for optimization

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt backgroundt+1

  • abs

reduce Sum

suspicioust

ColorMovie

slide-25
SLIDE 25

Opportunities for optimization

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt backgroundt+1

  • abs

reduce Sum

suspicioust

Initially: 6 separate device kernels.

device kernel

ColorMovie

slide-26
SLIDE 26

Opportunities for optimization

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt backgroundt+1

  • abs

reduce Sum

suspicioust

device kernel

After a “single-output” kernel fuser pass: 2 device kernels remain.

ColorMovie

slide-27
SLIDE 27

Opportunities for optimization

Compute graph

moviet backgroundt

+ * 0.999f * 0.001f

nextBackgroundt backgroundt+1

  • abs

reduce Sum

suspicioust

device kernel

After a “multi-output” kernel fuser pass: only a single device kernel remains.

ColorMovie

slide-28
SLIDE 28

CogX compiler: translating CogX to OpenCL with kernel fusion

User CogX model (scala)

parsing and OpenCL code generation

(ops, fields) Kernel circuit (kernels, field bufs)

Syntax tree (ops, fields)

Optimized kernel circuit (merged kernels)

  • ptimizations,

including kernel fusion

CogX code snippet val A = ScalarField(10,10) val B = ScalarField(10,10) val C = A * B val D = ScalarField(10,10) val E = C + D

*

A B C

  • pencl

multiply kernel

+

D E

  • pencl

add kernel

+

A D E

fused

  • pencl

multiply/ add kernel

B

*

slide-29
SLIDE 29

CogX core functions and operators

  • Basic operators
  • +, -, *, /, %
  • Logical operators
  • >, >=, <, <=, ===, !===
  • Pointwise functions
  • cos, cosh, acos
  • sin, sinh, asin
  • tan, tanh, atan2
  • sq, sqrt, log, signum
  • pow, reciprocal
  • exp, abs, floor
  • Comparison functions
  • max, min
  • Shape manipulation
  • flip, shift, shiftCyclic
  • transpose, subfield
  • expand, select, stack
  • matrixRow, reshape
  • subfields, trim
  • vectorElement, vectorElements
  • transposeMatrices
  • transposeVectors
  • replicate, slice
  • FFT/DCT
  • fft, fftInverse
  • fftRI, fftInverseRI
  • fftRows, fftInverseRows
  • fftColumns, fftInverseColumns
  • dct, dctInverse, dctTransposed
  • dctInverseTransposed
  • Complex numbers
  • phase, magnitude, conjugate
  • realPart, imaginaryPart
  • Convolution-like
  • crossCorrelate,

crossCorrelateSeparable

  • convolve, convolveSeparable
  • projectFrame, backProjectFrame
  • crossCorrelateFilterAdjoint
  • convolveFilterAdjoint
  • Gradient/divergence
  • backwardDivergence
  • backwardGradient
  • centralGradient
  • forwardGradient
  • Linear algebra
  • dot, crossDot
  • reverseCrossDot
  • Debugging
  • probe
  • Type coercion
  • toScalarField, toVectorField
  • toMatrixField, toComplexField
  • toComplexVectorField, toColorField
  • toGenericComplexField
  • Type construction
  • complex, polarComplex
  • vectorField, complexVectorField
  • matrixField, colorField
  • Reductions
  • reduceSum, blockReduceSum
  • reduceMin, blockReduceMin
  • reduceMax, blockReduceMax
  • fieldReduceMax, fieldReduceMin
  • fieldReduceSum, fieldReduceMedian
  • Normalizations
  • normalizeL1, normalizeL2
  • Resampling
  • supersample, downsample, upsample
  • Special operators
  • winnerTakeAll
  • random
  • solve
  • transform
  • warp
  • <==
slide-30
SLIDE 30

CogX software stack

Application

CogX debugger

CogX compiler and standard library

Neural network toolkit Sandbox toolkit I/O toolkit Scala CogX runtime C++ CogX runtime HDF5 loader JOCL OpenCL HDF5 HDF5 CogX core External libraries CogX libraries/toolkit Cluster package Apache Mesos

Applications are written by users

− Introductory and training examples for single-GPU and distributed computation − Performance benchmarks covering the core and neural network package − Several larger-scale demo applications integrating multiple CogX functions

slide-31
SLIDE 31

CogX toolkit functions

  • Computer Vision
  • Annotation tools
  • Color space transformations
  • Polynomial dense optic flow
  • Segmentation
  • Solvers
  • Boundary-gated nonlinear

diffusion

  • FISTA solver (with sub-

variants)

  • Golden section solver
  • Incremental k-means

implementation

  • LSQR solver (with sub-

variants)

  • Poisson solver (with sub-

variants)

  • Filtering
  • Contourlets
  • 4 frequency-domain filters
  • Mathematical morphology
  • perators
  • 27 space-domain filters (from

a simple box filter up to local polynomial expansion and steerable Gabor filters)

  • Steerable pyramid filter
  • Wavelets
  • Variants of whitening

transforms

  • Contrast normalization
  • Domain transfer filter
  • Gaussian pyramid
  • Monogenic phase

congruency

  • Dynamical Systems
  • Kalman filter
  • Linear system modeling

support

  • CPU matrix pseudo-

inverse

  • Statistics
  • Normal and uniform

distributions

  • Histograms
  • Moment calculations
  • Pseudo-random number

generator sensors

slide-32
SLIDE 32

Labeling Dynamic Ordinal Depth

Goal: “direct” readout of “in front of”, “behind”, “emerging”, or “disappearing” in video streams Scene segmentation Based on motion signals only Not contrast edges, stereo, ... Use CogX, software from HPE Labs Maximize use of GPUs Near real-time processing, ~2 fps on an HP Z820 workstation Some processing in CPU kernels

Video Stream Optic Flow Discretized Motion Motion

Onset/Offset

Boundary Ownership Occlusion Status Region Properties Motion Regions Region Traces Motion Field Preprocessing Region Processing Occluders Region Completion Ordinal Depth

Visualizing ordinal depth and occlusions. Unoccluded moving parts of an object are highlighted. Occluder is marked in red.

slide-33
SLIDE 33

Functional Control Flow of CogMO Algorithm

Enumerating motion surfaces Optic Flow Assigning Boundary Ownership Motion surfaces Ordinal Depth

CogMO – Ordinal Depth

33

slide-34
SLIDE 34

Video: CogMO algorithm

34

slide-35
SLIDE 35

Roadmap

–Motivating evidence –The CogX project and vision –Open-source availability

slide-36
SLIDE 36

HPE Cognitive Computing Toolkit

Application

CogX debugger

CogX compiler and standard library

Neural network toolkit Sandbox toolkit I/O toolkit Scala CogX runtime C++ CogX runtime HDF5 loader JOCL OpenCL HDF5 HDF5 CogX core External libraries CogX libraries/toolkit Cluster package Apache Mesos

Applications are written by users

− Introductory and training examples for single-GPU and distributed computation − Performance benchmarks covering the core and neural network package − Several larger-scale demo applications integrating multiple CogX functions

slide-37
SLIDE 37

HPE Cognitive Computing Toolkit

Application

CogX debugger

CogX compiler and standard library

Neural network toolkit Sandbox toolkit I/O toolkit Scala CogX runtime HDF5 loader JOCL OpenCL HDF5 CogX core External libraries CogX libraries/toolkit

Applications are written by users

− Introductory and training examples for single-GPU and distributed computation − Performance benchmarks covering the core and neural network package − Several larger-scale demo applications integrating multiple CogX functions

slide-38
SLIDE 38

High-level comparison

CogX TensorFlow

Core data abstraction Tensor Fields: single precision, restriction on dimensions Tensors: typed multi-dimensional array Core compute abstraction OpenCL functions emitted and compiled at runtime. User kernels. C++/CUDA functions compiled into TensorFlow project Graph optimizations Kernel fusion Not available Distribution across GPUs Simulated annealing placer Unreleased: Graph partitioning, Greedy placer Debugging Single-step runtime debugging. Text based profiler. Non-interactive log file parser. Better graph visualization. Unreleased profiler. Automatic differentiation Supported as a library for neural network specific operations Supported by most of core API Fault tolerance Not yet implemented Automatic check-pointing and restart of graph Control flow Not yet implemented Predicated execution Runtime optimization Not yet implemented Interleaved processing of iterations, placer

slide-39
SLIDE 39

High-level comparison

CogX TensorFlow

Core data abstraction Tensor Fields: single precision, restriction on dimensions Tensors: typed multi-dimensional array Core compute abstraction OpenCL functions emitted and compiled at runtime. User kernels. C++/CUDA functions compiled into TensorFlow project Graph optimizations Kernel fusion Not available Distribution across GPUs Simulated annealing placer Unreleased: Graph partitioning, Greedy placer Debugging Single-step runtime debugging. Text based profiler. Non-interactive log file parser. Better graph visualization. Unreleased profiler. Automatic differentiation Supported as a library for neural network specific operations Supported by most of core API Fault tolerance Not yet implemented Automatic check-pointing and restart of graph Control flow Not yet implemented Predicated execution Runtime optimization Not yet implemented Interleaved processing of iterations, placer

slide-40
SLIDE 40

TensorFlow plugin: high productivity, high performance

  • perators

40

Simple Python API Protobuf Intermediate Representation Optimizer CUDA Generator C Generator TensorFlow Custom Op

Python plugin TensorFlow

slide-41
SLIDE 41

TensorFlow plugin: a familiar programming model

41

Example: element-wise L2 Norm of three 2 x 2 tensors Input tensors Workgroup shape

  • ut[pos] = sqrt(in_0[pos]*in_0[pos] + … + in_2[pos]*in_2[pos])

Output tensor

slide-42
SLIDE 42

TensorFlow plugin: high productivity, high performance

42

High productivity: def op(in0, in1, in2): pos = position_in(in0.shape)

  • ut = output_like(in0)

a = in0[pos] b = in1[pos] c = in2[pos]

  • ut[pos] = sqrt(a*a + b*b + c*c)

return out High performance:

slide-43
SLIDE 43

Ben Chandler benjamin.chandler@hpe.com

43