Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC
- Dr. Fu Li
li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd.
Scale-out Computing Model on Massive Core System: From HPC to - - PowerPoint PPT Presentation
Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd. Cook Book 1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system
Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC
li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd.
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Cook Book
1.1. HPC system 1.2. GPU system 1.3. MicroSlides: Fabric-based SoC
3.1. MPI and openMP in HPC 3.2. Memory coalescing and cudaDMA in GPU computing
4.1. the hardware (Socionext) 4.2. the architecture 4.3. the result (arm vs x86 vs GPU) new
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Quantum Theory and Spectroscopy
Molecular Dynamics Fast Fourier Transform HPC
Content-Centric Networking Cloud Storage
Doppler ASIC
Boba FPGA MPI, OpenMP
CUDA Statistic Mechanics
GPU switch
PacketShader
Introduction to Quantum Cloud
With background from Quantum calculation, 1) we perform large-scale molecular dynamics simulation on HPC cluster using Amber and Gromacs, 2) we optimize Fourier transform and matrix operation on multicore system.
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Then we found GPU is a great tool for both molecular dynamics and matrix
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Later we found similar systems with massive CPU cores.
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Today we will show some practical example about our scale-out algorithm on these systems
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Number of Cores
1 10 100 1,000 10,000 100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server Blade Server Super Computer
General-purpose
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Number of Cores
1 10 100 1,000 10,000 100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server Blade Server Super Computer GPU GPU Cluster
General-purpose Special-purpose
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Number of Cores
1 10 100 1,000 10,000 100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server Blade Server Super Computer GPU GPU Cluster
General-purpose Special-purpose
Traditional ARM Server ARM SoC
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Number of Cores
1 10 100 1,000 10,000 100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server Blade Server Super Computer GPU GPU Cluster Microslides
Special-purpose General-purpose General-purpose
Microslides
Microslides
Traditional ARM Server ARM SoC
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Number of Cores
1 10 100 1,000 10,000 100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server Blade Server Super Computer GPU GPU Cluster Microslides Microslides
Microslides
2006 2018 2012 intra CPU connection inter CPU connection cluster connection
Special-purpose General-purpose General-purpose
Traditional ARM Server ARM SoC
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 Cache/Storage I/O
Hierarchical structure is critical for Von Neumann architecture
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 Cache/Storage I/O
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1
instruction-level parallelism OS-level parallelism algorithm-level parallelism
Cache/Storage I/O
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1
instruction-level parallelism OS-level parallelism algorithm-level parallelism
batch, share-nothing stateless computing big RAM avoid context switching TLB, cache-conscious big.LITTLE GPU, FPGA Fast cache, cache prefetch Vector processing, SIMD/AVX
Cache/Storage I/O
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1
instruction-level parallelism OS-level parallelism algorithm-level parallelism
batch, share-nothing stateless computing big RAM avoid context switching TLB, cache-conscious big.LITTLE GPU, FPGA Fast cache, cache prefetch Vector processing, SIMD/AVX
Cache/Storage I/O
Consolidation will be the next-wave innovation for Chip design and system optimization
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Parallel and Scaling
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Fabric-Based ARM SoC
From SOCIONEXT
watt/core ARM SoC 1 x86 16 ~ 25 GPU 0.3~0.5
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Cluster Management Tools
PBS
kubernetes mesos basic batch process kvm container container/noncontainer pro very fast very flexible normally with MPI very secure very stable system-level isolation fast secure production ready fast compatible with process and container production ready can be secure cons no isolation high overhead slow container app not flexible enough complexity scenario scientific calculation private cloud application CI Datacenter OS
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Share-Nothing + Message Queue Architecture
Stateless 计算架构 host core core IO core use an “individual” core to do IO for the host to increase the throughput
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: PacketShader on GPU
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: Rendering on Arm
Render@Baremetal Render@Container
1 2 3 4 buggy fishy cat bmps teeglasFX splash poked Intel ARM 0.5 1 1.5 2 bmw27 classroom bechmark Baremetal 1container 2container 4container
并发情况下提⾼髙3倍 多实例禮并发情况下提⾼髙1.8倍
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: Rendering on Arm
7.5 15 22.5 30 performace scaled 1 scaled 2 Intel arm SoC Intel arm SoC Intel arm SoC scaled 1: scaled performance with frequency and core number scaled 2: scaled performance with frequency and core number and watts
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: AI on Arm
Caffe@Container ARM vs Intel vs GPU (scaled)
0.4 0.8 1.2 1.6 CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3 Intel ARM GPU 1070
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: AI on Arm SoC
4 8 12 16 caffe scaled caffe darknet scaled darknet Intel SoC Intel SoC Intel SoC Intel SoC 2.25 4.5 6.75 9 caffe scaled caffe darknet scaled darknet Intel SoC Intel SoC Intel SoC Intel SoC
Training Inference
量勵⼦孑云未来(北磻京)信息科技有限公司(以下称量勵⼦孑云)是⼀丁家以影视⾏行降业为主的垂直⾏行降业云计算公司。 量勵⼦孑云专注于影视⾏行降业的云化,和国际知名影视公司和特效制作公司合作,为影视⾏行降业客户提供制作软件、图形⼯左作站、⾼髙性能存储、渲染服务等⼀丁站式解决⽅斺案等。
ADDRESS
北磻京市朝阳区⼯左体北磻路露8号三⾥里離屯SOHO办公A座2101
NUMBER EMAIL
info@lzyco.com
WEBSITE
010-53518265 www.lzyco.com