An Implementation of Fast memset() Using Hardware Accelerators - PowerPoint PPT Presentation

An Implementation of Fast memset() Using Hardware Accelerators Runtime and Operating Systems for Supercomputers Workshop June 12 2018 Rob Gardner Jared Smolens Kishore Pusukuri Oracle Inc Arm Inc NetApp Inc

Multicore Systems & Big-memory Applications • Multicore systems with huge caches and many cores are ubiquitous ➢ e.g., SPARC T7-1: 32 cores (256 vCPUs), 64MB L3 cache, and 512GB RAM ➢ Maximizes performance of emerging big-memory workloads (databases, graph analytics, key-value stores, and HPC workloads) • Big-memory applications ➢ Require many virtual-to-physical address translations in page tables and TLBs ➢ e.g., consumes ~51% of execution cycles just on TLB misses [ISCA’13] [ISCA’13] A. Basu et al., Efficient Virtual Memory for Big Memory Servers. In Proceedings of ISCA, 2013. 2

Huge Pages • Modern hardware and OS introduced support for huge pages ➢ e.g., Linux (kernel 2.6.39) on T7 supports 8MB, 2GB, and 16GB (default 8KB) ➢ Improves performance and system utilization by reducing TLB miss rate • Creating a huge page is an expensive operation ➢ Need to zero the page through kernel memset() ➢ e.g., while creating a 2GB page takes 322 msec , zeroing it takes 320 msec ➢ Zeroed during kernel initialization (booting) and whenever application requests ➢ Big-memory applications need 100s of huge pages → zeroing operation impacts kernel initialization times (application startup times/serviceability) 3

This Work On a SPARC T7-1 running Oracle Linux (2.6.39) • Exploits T7 and its co-processors (DAX) to speed up the creation of huge pages by up to 11x → improves database startup time up to 6x • Presents an enhanced memset() that uses T7 co-processors → improves JVM Garbage Collector latencies by 4x 4

Outline 1 Background 2 DAX memset 3 Hybrid DAX memset 4 Case Studies 5

Kernel memset() During kernel initialization 80 Million Times • 99.9% calls are for zeroing 6

Zeroing Huge Pages () takes 320 msecs to zero a 2GB huge page • clear_huge_p e_page() void clear_huge_page(struct page *page, unsigned long addr, unsigned int pages_per_huge_page ) { ... for (i = 0; i < pages_per_huge_page; i++) { cond_resched(); // c // calls lls kern kernel el mems memset et() () clear_user_highpage(page + i, addr + i * PAGE_SIZE); } } 7

Improving clear_huge_page() • Kernel Threads • Work Queues (a pool of worker threads) Technique 8MB page 2GB page Kernel Threads 2.0x 4.9x Work Queues 4.0x 5.0x • Multithreaded kernel memset() ➢ Complexity of using multiple kernel threads? (Software Engineering) ➢ Spawning multiple kernel threads on a loaded system? (Performance) 8

SPARC T7 & DAX (1) • T7 processor has 8 co-processor units (DAX) • Data is directly read from and written to the memory space • Multiple tasks (or commands) can be simultaneously submitted and hypervisor controls queuing and scheduling • DAX calls are asynchronous (mwait instruction can be used to monitor the status → facilitates heterogeneous computing) 9

SPARC T7 & DAX (2) • How to request DAX? ➢ Fill in a ccb (coprocessor control block) with the operation to be done, pointers to various memory regions, sizes, etc. ➢ Call hypervisor API: hypervisor distributes ccb commands (tasks or DAX requests) across DAX units • Only works on contiguous memory -- page boundaries cannot be crossed by a single command (task) • Max. ccb blocks per ccb_command (or DAX request) are 15 • Max. concurrent ccb_commands are 64 10

Outline 1 SPARC M7 DAX 2 DAX memset Hybrid DAX memset 3 Case Studies 4 11

DAX memset() • Assumes contiguous memory (effective for huge pages) • As memset(address, fill_value, size), where [address..address+size] must reside on one page /* fill coprocessor control block (ccb) */ ccb_fill_t ccb; …. ccb.tl.hdr.opcode = CCB_MSG_OPCODE_FILL; ccb.ctl.hdr.at_dst = CCB_AT_VA; ccb.imm_op = fill_value; ccb.ctl.size = size; ... /* ccb_submit to DAX */ hypercall(completion_area, &ccb , …); ….. mwait() .... 12

DAX memset() (cont …) Creating 2GB: 320 msec → 31 msec, Fill CCBs & Hypercall Hypervisor CPU 10x speedup .. .. CCB submits.. .. mwait D D D D D D D D A A A A A A A A X X X X X X X X .. .. Fill the memory with zeros .. .. 2GB huge page (contiguous memory) 13

DAX memset() (cont …) Heterogeneous computing Creating 2GB : Fill CCBs & Hypercall Hypervisor 320 msec → 29 msec, CPU 11x speedup .. .. CCB Submits .. .. Creating 8MB : 8x speedup mwait D D D D D D D D A A A A A A A A X X X X X X X X Fill 128 MB .. .. Fill the memory with zeros .. .. 1/16 1/ 2GB huge page (contiguous memory) 14

DAX memset() vs Multithreaded Techniques 8MB page 2GB page Technique Kernel Threads 2.0x 4.9x Work Queues 4.0x 5.0x DAX 8.0x 11.0x 15

Outline 1 SPARC M7 DAX 2 DAX memset() 3 Hybrid DAX memset() Case Studies 4 16

Hybrid DAX memset (1) • Kernel memset() functionality: should also work on non-contiguous memory (assume no huge pages) • Identify the discontinuities in physical memory corresponding to the entire virtual buffer – need to do a page table lookup for each page • Derive a scatter-gather list: a list of contiguous memory chunks (starting address of the memory region and its length) ➢ e.g., { [0x100, 200]; [0x400, 100]; [0x1000, 200]; [0x6000, 400]; [0x2000, 300]} • Feed each item of the list as a CCB to the DAX ➢ Max. ccbs per ccb_command (or task) are 15 and Max. condurrent ccb_commands are 64 17

Hybrid DAX memset (2) • Effectively distribute the scatter-gather list across all the 8 DAX units -- balancing load across DAX units Bin-1 Bin-0 Scatter-gather list: Bin-packing Algorithm : [0x100, 200] number of bins = 2 [Address, length] [0x6100, 300] [0x400, 100] max capacity = 1200/2 = 600 [0x2000, 300] [0x1000, 200] { [0x6000, 100] R0 : [0x100, 200] Bin-0  R0 + R1 + R2 + R3 (100) Bin-1  R3 (300) + R4 (300) R1 : [0x400, 100] R2 : [0x1000, 200] ccb command ccb command R3 : [0x6000, 400] 6.5x speedup for 32 MB size R4 : [0x2000, 300] - Default memset: 6504 usec - Hybrid DAX memset : 1008 usec DAX } DAX 18

Optimizing Hybrid DAX memset (1) • The overhead (va_to_pa) of generating scatter-gather list is 402 usec (for 32 MB) • Derive scatter-gather list in two stages ➢ Process half of the memory and build the scatter-gather list (then derive bins) ➢ Feed the bins (ccb_commands) to DAX units ➢ Use CPU to process second half of the memory while DAX is processing the ccb_commands of the first half of memory Speedup: 6.5x → 7.9x - Default memset : 6504 usec - DAX MEMSET : 1008 usec - DAX MEMSET (opt-1) : 821 usec 19

Optimizing Hybrid DAX memset (2) • Fill 1/8 th of memory region using CPU ( while DAX is processing 7/8 th ) Speedup: 6.5x → 8.7x - Default memset() : 6504 usec - Hybrid DAX memset() : 1008 usec - Hybrid DAX memset() (opt-1) : 821 usec - Hybrid DAX memset() (opt-1 + opt-2) : 746 usec • Achieves 9x speedup in zeroing 32MB memory 20

Case Studies • Database SGA Preparation Times (Startup Times) ➢ improves up to 5x (256GB SGA, with 8MB pages) • Java JVM GC Latency ➢ improves by up to 4.0x (provided ioctl() interface for applications) 21

Limitations • Only effective for sizes > 256KB as the DAX setup takes ~15 usec • Contention for DAX devices when the load is heavy? 22

Conclusions • Demonstrates the potential benefits of utilizing T7 coprocessors ➢ Speeds up the creation of huge pages by 11.0x ➢ Java JVM GC latencies are improved by up to 4.0x 23

This Work On a SPARC T7-1 running Oracle Linux (2.6.39) • Exploits T7 and its co-processors (DAX) to speed up the creation of huge pages by up to 11x → improves database startup time up to 6x • Presents an enhanced memset() that uses T7 co-processors → improves JVM Garbage Collector latencies by 4x 24

An Implementation of Fast memset() Using Hardware Accelerators - PowerPoint PPT Presentation

An Implementation of Fast memset() Using Hardware Accelerators Runtime and Operating Systems for Supercomputers Workshop June 12 2018 Rob Gardner Jared Smolens Kishore Pusukuri Oracle Inc Arm Inc NetApp Inc Multicore Systems &

CS415: Systems programming Memory management (malloc, calloc, free, realloc, memset, memcpy,

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

The Education for All The Education for All Fast Track Initiative Fast Track Initiative

Redis for Fast Data Ingest Agenda Fast Data Ingest and its challenges Redis for Fast

Fast-track listing Fast-track listing process Time to market can be essential benefits of

Fast-SCNN: Fast Semantic Segmentation Network Rudra PK Poudel Stephan Liwicki Roberto Cipolla

IOTA/FAST Collaboration Meeting - Intro Vladimir SHILTSEV, AD/APC IOTA/FAST Workshop and

OPNET Implementation of OPNET Implementation of OPNET Implementation of OPNET Implementation of

FAST ALGORITHM FOR FINDING n LATTICE SUBSPACES IN AND ITS IMPLEMENTATION ANDREW M. POWNUK THE

EIA Implementation during EIA Implementation during the EIA Implementation during EIA

Cthulus Clutches Lovecraftian Horror Theme Storyboard Implementation Theme Storyboard

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

IDN ccTLD Fast Track Update Naela Sarras IDN Fast Track Manager Agenda Status update

Using NVDIMM under KVM Applications of persistent memory in virtualization Stefan Hajnoczi

Exponentially concave functions and multiplicative cyclical monotonicity W. Schachermayer

Disclaimer Contango Asset Management Limited (ABN 52 085 487 421) holds an Australian Financial

Understanding the Changing Cultural Value of the Bri4sh Council

A First Course on Kinetics and Reaction Engineering Class 35 on Unit 33 Where Were Going

A Probabilistic Model of Cross- situational Word Learning from Noisy and Ambiguous Data Afra

Persistent Memory Architecture Research at UCSC Workload Characterization and Hardware

eight beam sections - geometric properties Sections 1 Elements of Architectural Structures