High Performance Computing @ AUB GradEx Workshop Mher Kazandjian - - PowerPoint PPT Presentation

high performance computing aub
SMART_READER_LITE
LIVE PREVIEW

High Performance Computing @ AUB GradEx Workshop Mher Kazandjian - - PowerPoint PPT Presentation

High Performance Computing @ AUB GradEx Workshop Mher Kazandjian November 2018 American University of Beirut How this talk is structured? History of computing Scientifjc computing workfmows Computer architecture overview


slide-1
SLIDE 1

November 2018

High Performance Computing @ AUB

American University of Beirut

GradEx Workshop

Mher Kazandjian

slide-2
SLIDE 2

How this talk is structured?

  • History of computing
  • Scientifjc computing workfmows
  • Computer architecture overview
  • Do's and Don'ts
  • Demo's and walk throughs
slide-3
SLIDE 3

Goals

  • Demonstrate how you (as users) can benefjt from AUB's

HPC facilities

  • Attract users, because:
  • we want to boost scientifjc computing research
  • we want to help you
  • we have capacity

This presentation is based on actual feedback and use cases collected from users over the past year

slide-4
SLIDE 4

History of computing

Alan Turing 1912-1954

slide-5
SLIDE 5

Growth over time

12 orders of magnitude since 1960

slide-6
SLIDE 6

Growth over time

~12 orders of magnitude since 1960

if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today

slide-7
SLIDE 7

What is HPC used for today?

  • Solving scientifjc problems
  • Data mining and deep learning
  • Military research and security
  • Cloud computing
  • Blockchain (cryptocurrency)
slide-8
SLIDE 8

What is HPC used for today?

  • https://blog.openai.com/ai-and-compute/
slide-9
SLIDE 9

Multicores hit the markets in

~2005

 Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text

Growth over time

Users at home started benefjting from parallelism

Prior to that applications that scaled well were restricted to mainframes / datacenters and HPC clusters

slide-10
SLIDE 10

HPC @ AUB in 2006

8 compute nodes Specs per node

  • 4 cores
  • 8 GB ram

~ 80 GFlops

slide-11
SLIDE 11

HPC is all about scalability

  • The high speed network is the "most" important

component

slide-12
SLIDE 12

But what is scalability?

Performance improvements as the number of cores (resources) increases for the same problem size

  • hard scalability
slide-13
SLIDE 13

But what is scalability?

This is a CPU under a microscope

slide-14
SLIDE 14

But what is scalability?

Prog.exe 2 sec Serial runtime = T_serial

slide-15
SLIDE 15

But what is scalability?

Prog.exe 1 sec Prog.exe parallel runtime = T_parallel

slide-16
SLIDE 16

But what is scalability?

Prog.exe 0.5 sec Prog.exe Prog.exe Prog.exe parallel runtime = T_parallel

slide-17
SLIDE 17

But what is scalability?

Prog.exe 0.5 sec Prog.exe Prog.exe Prog.exe Very nice!! but this is usually never the case

slide-18
SLIDE 18

First demo – First scalability diagram

slide-19
SLIDE 19

But what is scalability?

Repeat the same process across multiple processors

Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe

slide-20
SLIDE 20

But what is scalability?

Wait!

  • how do these processors talk to each other?
  • how much data needs to be transferred for a certain task?
  • how fast do the processes communicate with each other?
  • how often should the processes communicate with each other?

Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe

slide-21
SLIDE 21

At the single chip level

Through the cache memory of the CPU Typical latency ~ ns (or less) Typical bandwidth > 150 GB/s

slide-22
SLIDE 22

At the single chip level

Through the RAM Random Access Momory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more) https://ark.intel.com/#@Processors Through the RAM

slide-23
SLIDE 23

Second demo: bandwidth and some lingo

  • An array is just a bunch of bytes
  • Bandwidth is the speed with which information is tranferred
  • A fmoat (double precision) is 8 bytes
  • an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB
  • if I measure the time to initialize this array I can measure how fast the cpu can

access the RAM (since initializing the array implies visiting each memory address and setting it to zero)

  • bandwidth = size of array / time to initialize it
slide-24
SLIDE 24

Second demo: bandwidth and some lingo

  • An array is just a bunch of bytes
  • Bandwidth is the speed with which information is tranferred
  • A fmoat (double precision) is 8 bytes
  • an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB
  • if I measure the time to initialize this array I can measure how fast the cpu can

access the RAM (since initializing the array implies visiting each memory address and setting it to zero)

  • bandwidth = size of array / time to initialize it

Intel i7-6700HQ

  • https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz-
  • Advertised bandwidth = 34 GB/s
  • measured bandwidth (single thread quickie) = 22.8 GB/s
slide-25
SLIDE 25

At the single motherboard level

Through the RAM Random Access Memory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s (sometimes more) Random Access Memory (aka RAM) Through QPI (quick path interconnect)

  • typical latency for small data ~ ns
  • typical bandwidth 100 GB/s

TIP: server = node = compute node = numa node QPI

slide-26
SLIDE 26

Second demo: bandwidth multi-threaded

  • https://github.com/jefghammond/STREAM

https://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GT s-Intel-QPI

  • 2 x sockets, expected bandwidth ~102 GB/s
  • measured ~ 75 GB/s
  • on a completely idle node ~95 GB/s is possible

Another benchmark 2 socket Intel Xeon server

slide-27
SLIDE 27

At the cluster level (multiple nodes)

Through the network (ethernet) Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s

slide-28
SLIDE 28

At the cluster level (multiple nodes)

Through the network (infiniband – high speed network) Typical latency ~ a few to micro-seconds to < 1 micro sec Typical bandwidth > 3 GB/s Benefits over ethernet:

  • Remote direct memory access
  • higher bandwidth
  • much lower latency

https://en.wikipedia.org/wiki/InfiniBand

slide-29
SLIDE 29

What hardware we have at AUB What hardware we have at AUB?

  • Arza:
  • 256 core, 1 TB RAM IBM cluster
  • production simulations, benchmarking
  • http://website.aub.edu.lb/it/hpc/Pages/home.aspx
  • vLabs
  • see Vassili’s slide
  • very flexible, easy to manage, windows support
  • public cloud
  • infinite resources – limited by $$$
  • two pilot projects being tested – will be open soon for testing
slide-30
SLIDE 30

Parallelization libraries / software

SMP parallelism

  • OpenMP
  • CUDA
  • Matlab
  • Spark (recently deployed and tested)

distributed parallelism (cluster wide)

  • MPI
  • Spark
  • MPI + OpenMP (hybrid)
  • MPI + CUDA
  • MPI + CUDA + OpenMP
  • Spark + CUDA (not tested – any volunteers?)
slide-31
SLIDE 31

Linux/Unix culture

> 99% of HPC clusters wold wide use some kind of linux / unix

  • Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power

users.

  • Linux is:
  • open-source
  • free
  • secure (at least much secure than windows et. al )
  • no need for an antivirus that slows down your system
  • respects your privacy
  • huge community support in scientific computing
  • 99.8% of all HPC systems world wide since 1996 are non-windows machines

https://github.com/mherkazandjian/top500parser

slide-32
SLIDE 32

Software stack on the HPC cluster

  • Matlab
  • C, Java, C++, fortran
  • python 2 and python 3
  • jupyter notebooks
  • Tensorflow (Deep learning)
  • Scala
  • Spark
  • R
  • R studio, R server (new)
slide-33
SLIDE 33

Cluster usage: Demo

  • The scheduler: resource manager
  • bjobs
  • bqueues
  • bhosts
  • lsload
  • important places
  • /gpfs1/my_username
  • /gpfs1/apps/sw
  • basic linux knowledge
  • sample job script
slide-34
SLIDE 34

Cluster usage: Documentation

https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide The guide is for you

  • we want you to contribute to it directly
  • please send us pull requests
slide-35
SLIDE 35

Cluster usage: Job scripts

https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

slide-36
SLIDE 36

Cluster usage: Job scripts

https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html In the user guide, there are samples and templates for many use cases:

  • we will help you write your own if your use case is not covered
  • this is 90% of the getting started task
  • recent success story:
  • spark server job template
slide-37
SLIDE 37

Cluster usage: Job scripts

https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

slide-38
SLIDE 38

How to benefjt from the HPC hardware?

  • run many serial jobs that do not need to communicate
  • aka embarrassingly parallel jobs

(nothing embarrasing about it though as long as you get your job done)

  • e.g
  • train several neural networks with different layer numbers
  • do a parameter sweep for a certain model

./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously

  • difficulty: very easy
slide-39
SLIDE 39

How to benefjt from the HPC hardware?

  • run many serial jobs that do not need to communicate

Demo

slide-40
SLIDE 40

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • e.g
  • matlab
  • C/C++/python/Java

Difficulty: very easy to medium (problem dependent)

slide-41
SLIDE 41

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • C
slide-42
SLIDE 42

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • C
slide-43
SLIDE 43

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • Demo:

matlab parfor

slide-44
SLIDE 44

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • Demo:

matlab parfor

slide-45
SLIDE 45

How to benefjt from the HPC hardware?

  • run a SMP parallel program (i.e on one node using threads)
  • Demo:

matlab parfor

slide-46
SLIDE 46

How to benefjt from the HPC hardware?

  • run a hybrid MPI + OpenMP parallel job
  • Demo:

Gauß (astrophysics N-Body code) scalability diagram

  • single node

MPI OpenMP OpenMP OpenMP OpenMP

slide-47
SLIDE 47

How to benefjt from the HPC hardware?

  • hybrid MPI + OpenMP parallel job

Gauß (astrophysics N-Body code) scalability diagram

  • single node

MPI OpenMP OpenMP OpenMP OpenMP

slide-48
SLIDE 48

How to benefjt from the HPC hardware?

  • run a deep learning job
  • Demo:

Tensorflow

slide-49
SLIDE 49

How to benefjt from the HPC hardware?

  • run a deep learning job
  • Demo:

Tensorflow

slide-50
SLIDE 50

How to benefjt from the HPC hardware?

  • jupyter notebooks (connect through web interface)
  • R-server (connect throug web interface)
  • Spark (full cluster configuration – up to 1 TB ram usage)
  • Map-Reduce
slide-51
SLIDE 51

How to benefjt from the HPC hardware?

  • jupyter notebooks (connect through web interface)
  • R-server (connect throug web interface)
  • Spark (full cluster configuration – up to 1 TB ram usage)
  • Map-Reduce
slide-52
SLIDE 52

How to benefjt from the HPC hardware?

  • Optimal performance benefits
  • Go low level
  • C/Fortran/C++
  • need to have good design
  • good understanding of architecture
  • Currently only customers running such codes:
  • Chemistry department research group
  • Physics department
  • Computer science research group getting involved too
slide-53
SLIDE 53

Workfmows: best practices

  • prototype on your machine:

laptop, teminal/workstation at your department

  • when you think your job could benefit from HPC resources:
  • talk to us (we can help you assess your program better)
  • prepare a clean prototype
  • we will provide you with a pilot project access
  • tune / parallelize your application [ we can help you with that if needed ]
  • run production jobs
  • if you need specific hardware that is not available on campus:
  • go to the cloud
  • ideal for testing / benchmarking your code/app on the latest and the greatest hardware
slide-54
SLIDE 54

Containers

  • Run on top of kernel
  • zero over head
  • portable
  • currently

+ deep learning containers are used on the cluster + R studio server (new)

  • we can help you produce a custom container to your problem

+ you can create your own container too (no need for admin rights)

  • pros:

+ reproducibilty and portability

  • cons:

+ must be a geek to set up a container and willing to put effort to do it (lots of help available online though)

slide-55
SLIDE 55

Thank you for attending! # root>