November 2018
High Performance Computing @ AUB
American University of Beirut
GradEx Workshop
Mher Kazandjian
High Performance Computing @ AUB GradEx Workshop Mher Kazandjian - - PowerPoint PPT Presentation
High Performance Computing @ AUB GradEx Workshop Mher Kazandjian November 2018 American University of Beirut How this talk is structured? History of computing Scientifjc computing workfmows Computer architecture overview
November 2018
American University of Beirut
GradEx Workshop
Mher Kazandjian
Alan Turing 1912-1954

if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today


Multicores hit the markets in
~2005
 Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text Click to add text
Users at home started benefjting from parallelism
Prior to that applications that scaled well were restricted to mainframes / datacenters and HPC clusters
8 compute nodes Specs per node
Performance improvements as the number of cores (resources) increases for the same problem size
This is a CPU under a microscope
Prog.exe 2 sec Serial runtime = T_serial
Prog.exe 1 sec Prog.exe parallel runtime = T_parallel
Prog.exe 0.5 sec Prog.exe Prog.exe Prog.exe parallel runtime = T_parallel
Prog.exe 0.5 sec Prog.exe Prog.exe Prog.exe Very nice!! but this is usually never the case
Repeat the same process across multiple processors
Wait!
Through the cache memory of the CPU Typical latency ~ ns (or less) Typical bandwidth > 150 GB/s
Through the RAM Random Access Momory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more) https://ark.intel.com/#@Processors Through the RAM
access the RAM (since initializing the array implies visiting each memory address and setting it to zero)
access the RAM (since initializing the array implies visiting each memory address and setting it to zero)
Intel i7-6700HQ
Through the RAM Random Access Memory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s (sometimes more) Random Access Memory (aka RAM) Through QPI (quick path interconnect)
TIP: server = node = compute node = numa node QPI
https://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GT s-Intel-QPI
Another benchmark 2 socket Intel Xeon server
Through the network (ethernet) Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s
Through the network (infiniband – high speed network) Typical latency ~ a few to micro-seconds to < 1 micro sec Typical bandwidth > 3 GB/s Benefits over ethernet:
https://en.wikipedia.org/wiki/InfiniBand
SMP parallelism
distributed parallelism (cluster wide)
> 99% of HPC clusters wold wide use some kind of linux / unix
users.
https://github.com/mherkazandjian/top500parser
https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide The guide is for you
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html In the user guide, there are samples and templates for many use cases:
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
(nothing embarrasing about it though as long as you get your job done)
./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously
Demo
Difficulty: very easy to medium (problem dependent)
matlab parfor
matlab parfor
matlab parfor
Gauß (astrophysics N-Body code) scalability diagram
MPI OpenMP OpenMP OpenMP OpenMP
Gauß (astrophysics N-Body code) scalability diagram
MPI OpenMP OpenMP OpenMP OpenMP
Tensorflow
Tensorflow
laptop, teminal/workstation at your department
+ deep learning containers are used on the cluster + R studio server (new)
+ you can create your own container too (no need for admin rights)
+ reproducibilty and portability
+ must be a geek to set up a container and willing to put effort to do it (lots of help available online though)