Scale-out Computing Model on Massive Core System: From HPC to - - PowerPoint PPT Presentation

scale out computing model on massive core system from hpc
SMART_READER_LITE
LIVE PREVIEW

Scale-out Computing Model on Massive Core System: From HPC to - - PowerPoint PPT Presentation

Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd. Cook Book 1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system


slide-1
SLIDE 1

Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC

  • Dr. Fu Li

li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd.

slide-2
SLIDE 2

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Cook Book

  • 1. What is Massive Core System (MCS)?

1.1. HPC system 1.2. GPU system 1.3. MicroSlides: Fabric-based SoC

  • 2. Why scale-out computing is important in MCS?
  • 3. How to make MCS faster?

3.1. MPI and openMP in HPC 3.2. Memory coalescing and cudaDMA in GPU computing

  • 4. QCF’s scale-out computing model for Microslides

4.1. the hardware (Socionext) 4.2. the architecture 4.3. the result (arm vs x86 vs GPU) new

slide-3
SLIDE 3

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Quantum Theory and Spectroscopy

Molecular Dynamics Fast Fourier Transform HPC

Content-Centric Networking Cloud Storage

Doppler ASIC

Boba FPGA MPI, OpenMP

CUDA Statistic Mechanics

GPU switch

PacketShader

Introduction to Quantum Cloud

With background from Quantum calculation, 1) we perform large-scale molecular dynamics simulation on HPC cluster using Amber and Gromacs, 2) we optimize Fourier transform and matrix operation on multicore system.

slide-4
SLIDE 4

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Introduction to Quantum Cloud

Then we found GPU is a great tool for both molecular dynamics and matrix

  • peration.
slide-5
SLIDE 5

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Introduction to Quantum Cloud

Later we found similar systems with massive CPU cores.

slide-6
SLIDE 6

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Introduction to Quantum Cloud

Today we will show some practical example about our scale-out algorithm on these systems

slide-7
SLIDE 7

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Number of Cores

1 10 100 1,000 10,000 100,000

System Power Consumption (Watts)

10 100 1000 10K 100k 1M

System and Cores: Communication Matters

QCF & SOCIONEXT

PC Server Blade Server Super Computer

General-purpose

slide-8
SLIDE 8

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Number of Cores

1 10 100 1,000 10,000 100,000

System Power Consumption (Watts)

10 100 1000 10K 100k 1M

System and Cores: Communication Matters

QCF & SOCIONEXT

PC Server Blade Server Super Computer GPU GPU Cluster

General-purpose Special-purpose

slide-9
SLIDE 9

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Number of Cores

1 10 100 1,000 10,000 100,000

System Power Consumption (Watts)

10 100 1000 10K 100k 1M

System and Cores: Communication Matters

QCF & SOCIONEXT

PC Server Blade Server Super Computer GPU GPU Cluster

General-purpose Special-purpose

Traditional ARM Server ARM SoC

slide-10
SLIDE 10

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Number of Cores

1 10 100 1,000 10,000 100,000

System Power Consumption (Watts)

10 100 1000 10K 100k 1M

System and Cores: Communication Matters

QCF & SOCIONEXT

PC Server Blade Server Super Computer GPU GPU Cluster Microslides

Special-purpose General-purpose General-purpose

Microslides

  • f ARM CPU

Microslides

  • f ARM SoC

Traditional ARM Server ARM SoC

slide-11
SLIDE 11

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Number of Cores

1 10 100 1,000 10,000 100,000

System Power Consumption (Watts)

10 100 1000 10K 100k 1M

System and Cores: Communication Matters

QCF & SOCIONEXT

PC Server Blade Server Super Computer GPU GPU Cluster Microslides Microslides

  • f ARM CPU

Microslides

  • f ARM SoC

2006 2018 2012 intra CPU connection inter CPU connection cluster connection

Special-purpose General-purpose General-purpose

Traditional ARM Server ARM SoC

slide-12
SLIDE 12

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Data Communication Between Systems Is Obstacle

cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 Cache/Storage I/O

Hierarchical structure is critical for Von Neumann architecture

slide-13
SLIDE 13

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Data Communication Between Systems Is Obstacle

cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 Cache/Storage I/O

slide-14
SLIDE 14

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Data Communication Between Systems Is Obstacle

cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1

instruction-level parallelism OS-level parallelism algorithm-level parallelism

Cache/Storage I/O

slide-15
SLIDE 15

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Data Communication Between Systems Is Obstacle

cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1

instruction-level parallelism OS-level parallelism algorithm-level parallelism

batch, share-nothing stateless computing big RAM avoid context switching TLB, cache-conscious big.LITTLE GPU, FPGA Fast cache, cache prefetch Vector processing, SIMD/AVX

Cache/Storage I/O

slide-16
SLIDE 16

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Data Communication Between Systems Is Obstacle

cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1 cores Intra CPU Fabric Sockets Bus Memory Networking Cache L2/L3 Cache L1

instruction-level parallelism OS-level parallelism algorithm-level parallelism

batch, share-nothing stateless computing big RAM avoid context switching TLB, cache-conscious big.LITTLE GPU, FPGA Fast cache, cache prefetch Vector processing, SIMD/AVX

Cache/Storage I/O

Consolidation will be the next-wave innovation for Chip design and system optimization

  • IO consolidation: networking, bus, fabric
  • storage consolidation: memory, cache, networking buffer
slide-17
SLIDE 17

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Parallel and Scaling

slide-18
SLIDE 18

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Fabric-Based ARM SoC

From SOCIONEXT

  • PCIe Fabric for networking
  • 768 cores
  • c2c 10Gbps, 36 microsec latency
  • 1TB DDR4 RAM
  • 700 watts TDP per chassis

watt/core ARM SoC 1 x86 16 ~ 25 GPU 0.3~0.5

slide-19
SLIDE 19

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Cluster Management Tools

PBS

  • penstack

kubernetes mesos basic batch process kvm container container/noncontainer pro very fast very flexible normally with MPI very secure very stable system-level isolation fast secure production ready fast compatible with process and container production ready can be secure cons no isolation high overhead slow container app not flexible enough complexity scenario scientific calculation private cloud application CI Datacenter OS

slide-20
SLIDE 20

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Share-Nothing + Message Queue Architecture

Stateless 计算架构 host core core IO core use an “individual” core to do IO for the host to increase the throughput

slide-21
SLIDE 21

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Example: PacketShader on GPU

slide-22
SLIDE 22

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Example: Rendering on Arm

Render@Baremetal Render@Container

1 2 3 4 buggy fishy cat bmps teeglasFX splash poked Intel ARM 0.5 1 1.5 2 bmw27 classroom bechmark Baremetal 1container 2container 4container

并发情况下提⾼髙3倍 多实例禮并发情况下提⾼髙1.8倍

slide-23
SLIDE 23

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Example: Rendering on Arm

7.5 15 22.5 30 performace scaled 1 scaled 2 Intel arm SoC Intel arm SoC Intel arm SoC scaled 1: scaled performance with frequency and core number scaled 2: scaled performance with frequency and core number and watts

slide-24
SLIDE 24

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Example: AI on Arm

Caffe@Container ARM vs Intel vs GPU (scaled)

0.4 0.8 1.2 1.6 CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3 Intel ARM GPU 1070

slide-25
SLIDE 25

Quantum Cloud Future (Beijing) Technology Co. Ltd.

Example: AI on Arm SoC

4 8 12 16 caffe scaled caffe darknet scaled darknet Intel SoC Intel SoC Intel SoC Intel SoC 2.25 4.5 6.75 9 caffe scaled caffe darknet scaled darknet Intel SoC Intel SoC Intel SoC Intel SoC

Training Inference

slide-26
SLIDE 26

量勵⼦孑云未来(北磻京)信息科技有限公司(以下称量勵⼦孑云)是⼀丁家以影视⾏行降业为主的垂直⾏行降业云计算公司。 量勵⼦孑云专注于影视⾏行降业的云化,和国际知名影视公司和特效制作公司合作,为影视⾏行降业客户提供制作软件、图形⼯左作站、⾼髙性能存储、渲染服务等⼀丁站式解决⽅斺案等。

ADDRESS

北磻京市朝阳区⼯左体北磻路露8号三⾥里離屯SOHO办公A座2101

NUMBER EMAIL

info@lzyco.com

WEBSITE

010-53518265 www.lzyco.com

THANKS