Popcorn Linux: System Software for Heterogeneous Hardware - - PowerPoint PPT Presentation

popcorn linux
SMART_READER_LITE
LIVE PREVIEW

Popcorn Linux: System Software for Heterogeneous Hardware - - PowerPoint PPT Presentation

Popcorn Linux: System Software for Heterogeneous Hardware Sang-Hoon Kim Postdoctoral Associate Systems Software Research Group May 25, 2018 Trend towards heterogeneous systems Clear that microprocessor trends have shifted since 2005


slide-1
SLIDE 1

Popcorn Linux:

System Software for Heterogeneous Hardware

Sang-Hoon Kim

Postdoctoral Associate Systems Software Research Group May 25, 2018

slide-2
SLIDE 2

Trend towards heterogeneous systems

  • Clear that microprocessor trends have shifted since 2005

2

[https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data]

Limited single thread performance

  • Thermal and power budget
  • Dark silicon effect

Increase core counts

Exploit heterogeneity

Specialize cores

slide-3
SLIDE 3

Micro-architectural heterogeneity is already here

3

Compute capacity (Performance) Power

Energy-efficient LITTLE cores

High-performance big cores

ARM DynamIQ / big.LITTLE

iPhone X Galaxy S8

slide-4
SLIDE 4

Micro-architectural heterogeneity is already here

4

Compute capacity (Performance) Power

Energy-efficient LITTLE cores

High-performance big cores

ARM big.LITTLE / DynamIQ

iPhone X Galaxy S8

But only for homogeneous instruction set architecture (ISA) Can we utilize heterogeneous-ISA?

slide-5
SLIDE 5

Different ISA, different execution profile

  • “Harnessing ISA Diversity: Design of a Heterogeneous-ISA

Chip Multiprocessor,” Venkat and Tullsen (UCSD), ISCA’14

– RISC vs CISC – Register memory architecture vs load/store architecture – Vector instruction support (e.g., SIMD) – Power efficiency per instruction – Pipeline depth – Degree of parallelism

5

slide-6
SLIDE 6

Different ISA, different execution profile

  • “Harnessing ISA Diversity: Design of a Heterogeneous-ISA

Chip Multiprocessor,” Venkat and Tullsen (UCSD), ISCA’14

6

Phase 2 Phase 1

Performance of bzip2 for different peak power budgets

alpha x86 x86 alpha

slide-7
SLIDE 7

ISA affinity opens up opportunities

  • Can improve performance and energy consumption by

migrating work to an optimal-ISA node

7

  • Alpha

Homogeneous

  • big Alpha
  • medium Alpha
  • little Alpha

Single-ISA

  • ARM’s thumb
  • x86_64
  • Alpha

Heterogeneous-ISA EDP Performance

slide-8
SLIDE 8

Challenges in exploiting the ISA affinity

  • Relocate execution across machine boundaries

– Single-chip/board heterogeneous-ISA architecture is not available – Not obvious even between homogeneous-ISA machines

  • Deal with discrepancies between ISAs

– Let assume ISAs have the same endian and primitive data type size – However, register set, stack layout, executable layout, …

  • Want to run applications as-is

– Cost(developer/software) >>>> cost(hardware) – Can enable future-proofing – important for legacy!

8

slide-9
SLIDE 9

Popcorn Linux considers programmability

9

void full_verify(void) { MPI_Status status; MPI_Request request; INT_TYPE i, j; INT_TYPE k, last_local_key; for( i=0; i<total_local_keys; i++ ) key_array[--key_buff_ptr_global[key_buff2[i]]- total_lesser_keys] = key_buff2[i]; last_local_key = (total_local_keys<1)? 0 : (total_local_keys-1); if( my_rank > 0 ) MPI_Irecv( &k, 1, MP_KEY_TYPE, my_rank-1, 1000, MPI_COMM_WORLD, &request ); if( my_rank < comm_size-1 ) MPI_Send( &key_array[last_local_key], 1, MP_KEY_TYPE, my_rank+1, 1000, MPI_COMM_WORLD ); if( my_rank > 0 ) MPI_Wait( &request, &status ); ... }

MPI

void full_verify( void ) { cl_kernel k_fv0, k_fv1; cl_mem m_j; cl_int ecode; INT_TYPE *g_j; INT_TYPE j = 0, i; size_t j_size; size_t fv0_lws[1], fv0_gws[1]; size_t fv1_lws[1], fv1_gws[1]; j_size = sizeof(INT_TYPE) * (FV2_GLOBAL_SIZE / FV2_GROUP_SIZE); m_j = clCreateBuffer(context, CL_MEM_READ_WRITE, j_size, NULL, &ecode); k_fv1 = clCreateKernel(program, "full_verify1", &ecode); k_fv0 = clCreateKernel(program, "full_verify0", &ecode); ecode = clSetKernelArg(k_fv0, 0, sizeof(cl_mem), (void*)&m_key_array); ecode |= clSetKernelArg(k_fv0, 1, sizeof(cl_mem), (void*)&m_key_buff2); fv0_lws[0] = work_item_sizes[0]; fv0_gws[0] = NUM_KEYS; ecode = clEnqueueNDRangeKernel(cmd_queue, k_fv0, 1, NULL, fv0_gws, fv0_lws, 0, NULL, NULL); ecode = clSetKernelArg(k_fv1, 0, sizeof(cl_mem), (void*)&m_key_buff2); ecode |= clSetKernelArg(k_fv1, 1, sizeof(cl_mem), (void*)&m_key_buff1); fv1_lws[0] = work_item_sizes[0]; fv1_gws[0] = NUM_KEYS; ecode = clEnqueueNDRangeKernel(cmd_queue, k_fv1, 1, NULL, fv1_gws, fv1_lws, 0, NULL, NULL); ... }

OpenCL

void full_verify(void) { INT_TYPE i, j; for( i=0; i<NUM_KEYS; i++ ) key_buff2[i] = key_array[i]; for( i=0; i<NUM_KEYS; i++ ) key_array[--key_buff_ptr_global[key_buff2[i]]] = key_buff2[i]; ... }

Serial

NPB IS

slide-10
SLIDE 10

Popcorn Linux

Software framework to run applications “as-is”

  • n heterogeneous-ISA hardware

10

http://popcornlinux.org

slide-11
SLIDE 11

Outline

  • What for heterogeneous-ISA systems?
  • Introduction to Popcorn Linux
  • Our approaches in Popcorn Linux

– Compiler – Runtime – Operating System

  • Ongoing work
slide-12
SLIDE 12

Previously:

Popcorn Linux for replicated kernels

  • Run multiple kernels on a single system

– Run a kernel on a subset of processors in a system – Primarily for OS scalability

  • Provide a single system image over the multiple kernels
  • Migrate processes across the kernel boundary

OS 0 OS 1 OS 2 Single operating system image

Core 0 Core 1 Core 2 Core 3 13

slide-13
SLIDE 13

Popcorn Linux for heterogeneous ISAs

  • Extend the replicated kernel concept over multiple nodes

– Exploit the execution migration feature

  • Allow threads in a process to be split over multiple nodes
  • Support execution migration across ISA-different nodes

x86 OS ARM OS Single operating system image

x86 Core 0 x86 Core 1 ARM Core 0 ARM Core 1 Memory consistency protocol High-speed low-latency interconnect 14

slide-14
SLIDE 14
  • “Breaking the boundaries in heterogeneous-ISA

datacenters,” Barbalace et al., ASPLOS’17

– Workload sets drawn from HPC benchmark suite (NPB) – Yields 30% energy savings on average (max is 66% for set-3)

50 100 150 200 Energy Consumption (kJ) static x86(1) static x86(2) balanced x86 balanced ARM 50 P e6s) set-0 set-1 set-2 set-3 set-4 set-5 set-6 set-7 set-8 set-9 avg

66% gain!

30%

Popcorn Linux yields performance and energy gains over homogenous-ISA

15

slide-15
SLIDE 15

How Popcorn Linux work?

16

Application source (.c) Popcorn Kernel Popcorn Kernel Popcorn Runtime Popcorn Runtime Popcorn compiler toolchain Process Popcorn Multi-ISA Binary Process

Compiler: Generate multi-ISA binary Runtime: Transform dynamic,

ISA-specific program states

Kernel: Migrate execution and

provide a distributed execution environment

slide-16
SLIDE 16

Popcorn Compiler

Compilation

  • Built on top of clang/LLVM
  • Application source lowered into LLVM IR

– Insert migration points

  • Migration only at “equivalence points”; e.g., function entry/exit

– Analyze liveness of variables

  • IR passed through each ISA backend for generating code

– Instrumentation to generate metadata (e.g., live locations)

  • A post-process aligns code and data in uniform layout

17

slide-17
SLIDE 17

Popcorn Compiler

Multi-ISA binary

  • Migratable across ISAs

– Single .data section, multiple .text sections (one per-ISA) – Global data (.data), code (.text) and TLS aligned across all compilations

  • Pointers are valid across all ISAs

– State transformation metadata

  • Added to binary for translating

registers/stack between ISA-specific formats

Multi-ISA Binary

Data x86_64 code ARM64 code RISC-V code Transform metadata

Popcorn Compiler Toolchain Post- Processing Link Compile C/C++ Source

slide-18
SLIDE 18

Popcorn Compiler

Multi-ISA binary

  • Migratable across ISAs

– Single .data section, multiple .text sections (one per-ISA) – Global data (.data), code (.text) and TLS aligned across all compilations

  • Pointers are valid across all ISAs

– State transformation metadata

  • Added to binary for translating

registers/stack between ISA-specific formats

19

slide-19
SLIDE 19

Popcorn Runtime

  • Transform registers and stack between ISA-specific formats

– Refer to the transformation metadata in the binary

  • Two-phase process

– Read compiler metadata describing function activation layouts – Rewrite stack in its entirety from source to destination ISA format

  • After transformation, runtime invokes migration

– Pass destination ISA’s register state and stack to OS

20

slide-20
SLIDE 20

Popcorn Runtime

Stack transformation

21

3 2 1 baz() call frame bar() call frame foo() call frame

Source Destination

Function: baz Call site: 10 Call frame size: 32 bytes Return address: 0x410548 Function: baz Call site: 10 Call frame size: 48 bytes Return address: 0x410532

Top of Stack

Function: bar Call site: 37 Call frame size: 16 bytes Return address: 0x410204 Function: bar Call site: 37 Call frame size: 32 bytes Return address: 0x410198 Function: foo Call site: 193 Call frame size: 32 bytes Return address: 0x412820 Function: foo Call site: 193 Call frame size: 40 bytes Return address: 0x412700

slide-21
SLIDE 21

Popcorn Runtime

Stack transformation

Function: bar Call site: 37 Call frame size: 32 bytes Return address: 0x410198 Function: foo Call site: 193 Call frame size: 40 bytes Return address: 0x412700

22

slide-22
SLIDE 22

Popcorn Runtime

Stack transformation

23

Popcorn Kernel

Invoke migration Register set mapped to target architecture Stack

slide-23
SLIDE 23

Popcorn Kernel

  • Based on Linux kernel v4.4.55

– Working on x86-64 and aarch64

  • Tried to be architecture-agnostic

– Except for register and PTE manipulation

24

  • Relocate/distribute threads
  • ver multiple nodes
  • Migrating entire memory is infeasible
  • Should provide sequential consistency
slide-24
SLIDE 24

Popcorn Kernel

Migrating execution

  • Equivalent to context switching across machines
  • At origin: Save the execution context

– The runtime provides the register set – thread_struct + mm_struct

  • At remote: Restore the context on a thread

– Fork a kernel thread, and downgrade it to a user thread – Construct mm_struct and associate it with the thread – Setup register set and thread_struct – Return from the kernel space è Resume execution as if returned from system call

Exec. contexts Exec. contexts

Origin Remote

25

slide-25
SLIDE 25

Distributed thread execution in action

26

Node 1 (Origin) Node 0 (Remote) Node 2 (Remote) High-speed low-latency interconnect Remote thread Multiple thread relocation Original threads Remote threads Exclusive page access for writes Shared page access for reads Fetch VMA

  • n demand

View from applications Actual execution

Single thread migration Rack 0 Writable page Readable page VMA Invalid page Process A

1 2 3 1 2 3 3’ 2’ 0’

slide-26
SLIDE 26

Providing a consistent memory view to distributed threads

  • The origin controls the ownership and data

– Origin owns all pages in the beginning – Contact origin to get an ownership and data for pages

  • Read-replicate, write-invalidate protocol at page

granularity

– To exploit the common cases in memory-intensive workloads

  • Implemented in the virtual memory system in operating

system

– Transparent to the application’s perspective

27 Exclusive No permission Shared Origin Remote 0 Remote 1 R2 W1 R1 Revoke R3 Local Shared Exclusive

Write Read Write Read Write Read

slide-27
SLIDE 27

Taming concurrent page faults with leader/follower model

  • Coalesce multiple faults and handle with a single operation
  • Leader

– The first thread that starts a page fault operation for a page at a moment – Execute the fault handling operation for the page

  • E.g., bring the page from remotes, fix up page table,

flush TLB, …

  • Followers

– Threads that can utilize the leader’s outcome – Wait for the completion of the leader’s fault handling

  • Otherwise

– Wait or retry

28 Local write Local read Remote read 0xbeef000 0xbeee000 0xbef0000

… …

slide-28
SLIDE 28

Reducing false page sharing

  • Inherent in page-level consistency protocol

– A page can bounce between nodes if they access different data

  • bject in the same page
  • Behavior analysis tool helps to identify false page sharing

– Analyze page fault events collected in profiling mode – Pinpoint to the location in code

  • # of faults, type of faults, type of program objects

29

slide-29
SLIDE 29

Reducing false page sharing

  • Inherent in page-level consistency protocol

– A page can bounce between nodes if they access different data

  • bject in the same page
  • Behavior analysis tool helps to identify false page sharing

– Analyze page fault events collected in profiling mode – Pinpoint to the location in code

  • # of faults, type of faults, type of program objects

30 Application

  • Mod. Lo

C Application

  • Mod. LoC

Simpl e Grep +21 -12 PARSEC Blackscholes

  • Kmeans

+6

  • 3

NPB Common +1

  • 1

Polymer Common +86

  • 67

BT +5

  • 2

BFS +10

  • 4

EP +2

  • 1

BP +13

  • 10

FT +1

  • 1

PageRank +32

  • 30

Took 4 days for a Ph.D. student to reduce false page sharing from 9 applications

slide-30
SLIDE 30

The memory consistency protocol allows applications to scale their performance

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1 2 3 4 5 6 7 8 Normalized performance (a) GRP 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7 8 (b) KMN 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 2 3 4 5 6 7 8 (c) BT 0.0 1.0 2.0 3.0 4.0 5.0 6.0 1 2 3 4 5 6 7 8 (d) EP 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 (e) FT 0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 7 8 (f) BLK 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 (g) BFS 0.0 2.0 4.0 6.0 8.0 10.0 12.0 1 2 3 4 5 6 7 8 (h) BP 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 1 2 3 4 5 6 7 8 (i) PR Initial Optimized 31

Results from 8 homogeneous x86 nodes

slide-31
SLIDE 31

Summary: How Popcorn Linux work?

  • Compiler generates “multi-ISA” binaries

– .text for every ISA (symbol-aligned) – One .data for the entire machine (symbol-aligned)

  • Runtime transforms dynamic ISA-specific program state

– Stack, registers, etc, re-written on the fly

  • Operating system migrates execution and provides a

consistent execution environment across machines

– Guarantee sequential data consistency to distributed threads

32

slide-32
SLIDE 32

Ongoing work

  • Towards a heterogeneous rack-scale system

– Previously: ARM + x86 prototype platform

33

– aarch64

  • APM883208-X1

– 8 cores @2.4GHz

  • 16GB RAM, PCIe 8x

– x86_64

  • Intel Xeon E5-1650v2

– 6 cores 2HT @3.50 GHz

  • 16GB RAM, PCIe 8x

x86 ARM

Dolphin PXH810 Dolphin PXH810

PCIe PCIe

slide-33
SLIDE 33

Rack-scale prototype platform

Currently on-line

  • 8 Intel Xeon (x86_64)
  • 8 Cavium ThunderX (ARM64)

Working on

  • APM X-Gene2
  • IBM Power8
  • RISC-V

34

slide-34
SLIDE 34

Interconnects

  • Dolphin interconnect

– Dolphin PXH810 over PCIe up to 56Gb/s – Between tightly coupled nodes

  • InfiniBand

– Mellanox ConnectX-4/3 NICs and SX6036 switch up to 56Gb/s – Utilize Remote DMA (RDMA) feature – For global communication

  • Ethernet

– Based on the standard TCP/IP and sockets – As a standard, versatile interconnect

35

slide-35
SLIDE 35

Ongoing work

  • Towards a heterogeneous rack-scale system
  • Incorporate more heterogeneity

– IBM Power8 – RISC-V

  • Task scheduling in a heterogeneous-ISA rack
  • Popcorn as a security infrastructure

– E.g., Increase entropy to prevent ROP attacks

  • Cross-ISA execution in a virtualization setting

36

slide-36
SLIDE 36

37

Thank you!

* Popcorn Linux Team *

Supervising: Changwoo Min, Binoy Ravindran Compiler, runtime: Anthony Carno, Mohamed Karaoui, Robert Lyerly Kernel: Horen Chuang, Sang-Hoon Kim

slide-37
SLIDE 37

Backup slides

38

slide-38
SLIDE 38

Performance and energy gains over homogenous-ISA

39

50 100 150 200 Energy Consumption (kJ) static x86(1) static x86(2) balanced x86 balanced ARM 50 100 150 200 set-0 set-1 set-2 set-3 set-4 set-5 set-6 set-7 set-8 set-9 avg EDP (J*1e6s)

  • Workload sets drawn from

HPC benchmark suite (NPB)

  • Smaller the energy-delay

product, the better ü Popcorn yields 30% energy savings on average (max is 66% for set-3) ü Popcorn yields 11% reduction in EDP

66% gain!

30% 11%

[“Breaking the boundaries in heterogeneous-ISA datacenters,” Barbalace et al., ASPLOS’17]

slide-39
SLIDE 39

ISA affinity opens up opportunities

– “The Impact of ISAs on Performance,” Akram and Sawalha, WDDD/ISCA’17 – “OS Support for Thread Migration and Distribution in the Fully Heterogeneous Datacenter,” Olivier et al., HotOS’17

PARSEC blackscholes

40

slide-40
SLIDE 40

Motivation

  • Proliferation of

heterogeneous-ISA platforms

– Discrete

  • Xeon Phi, GPUs

– Integrated On-Die/SoC

  • CPU + GPU (AMD A-series)
  • CPU + Accelerator Slices (Tilera

TILEncore Gx-series)

  • CPU + GPU + DSP + … (Qualcomm

Snapdragon)

  • Mix of OS & non-OS capable

41

slide-41
SLIDE 41

Instruction Set Architecture (ISA)

42

Reduced Instruction Set Computer (RISC) Complex Instruction Set Computer (CISC) Sourc e Code ARM Compiler Binary Emitter Optimization Language Parser x86 Compiler Binary Emitter Optimization Language Parser

slide-42
SLIDE 42

Popcorn compilation

  • Built on top of clang/LLVM

– clang/LLVM 3.7.1, GNU gold 2.27 (~12.4k LoC) – Address space alignment (~700 LoC), post-processing (~1.7k LoC) tools – State transformation/migration libraries (~5.9k LoC) – Minor updates to musl-libc 1.1.18, libelf, and GNU OpenMP runtime

43

slide-43
SLIDE 43

State Transformation

  • How fast is state transformation?

– Reference: typically scheduling is done at every 10 milliseconds

44

slide-44
SLIDE 44

Migration Points

45

slide-45
SLIDE 45

Significant rewriting cost: NPB example

  • From shared memory/OpenMP to MPI
  • From serial code to OpenCL

Benchmark CG EP FT IS MG OpenMP LOC 1150 297 1106 1108 1481 MPI modified 98% 44% 98% 46% 97% OpenMP and MPI version of NASA NPB Benchmark CG EP FT IS MG Serial LOC 506 163 606 454 852 OpenCL added 303 % 164 % 143 % 177 % 189% OpenCL and serial version of SNU NPB

“Popcorn: bridging the programmability gap in heterogeneous-ISA platforms,” A. Barbalace et al., EuroSys, 2015.

46

slide-46
SLIDE 46

Thread migration in action

47 VMA Map

Kernel 2 Kernel 1

Single System Image

Original thread Remote thread VMA Map str x1, [sp,#0xbeef]

Page fault at sp + 0xbeef send page containing

sp + 0xbeef

Migrate to ARM

.text (x86) .text (ARM)

Page fault at @str

slide-47
SLIDE 47

Popcorn Kernel

Provide consistent memory Migrate threads

48

slide-48
SLIDE 48

The Rack

  • Bundle

– The building block for “The Rack” – A set of nodes that are tightly-coupled each other

  • To control the latency of memory consistency protocol
  • Bundles are connected via a high-speed

switching interconnect

ARM x86

49

slide-49
SLIDE 49

The Rack

ARM affinity thread x86 affinity thread Bundle 0 ARMv8 Xeon

PCIe

Application Bundle 2 ARMv8 Xeon Bundle 3 ARMv8 Xeon Bundle 1 ARMv8 Xeon

InfiniBand interconnect

50