HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube Operations
John D. Leidel, Yong Chen May 23, 2016 AsHES 2016
1
HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube - - PowerPoint PPT Presentation
HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube Operations John D. Leidel, Yong Chen May 23, 2016 AsHES 2016 1 Overview Introduction & Overview CMC Simulation Sample CMC Mutexes Future Research 2 Hybrid
1
2
3
did not provide sufficient accessibility and bandwidth
4
http://gc64.org
5 / 22
6 / 22
2015.
2) available without an NDA
connectivity, clock frequency, packet configuration, etc)
stages as defined by the spec
7
applications affect memory throughput & latency
modeling and simulation
8
reductions, logical reductions, processing near memory, etc
would it be?
9
3
broken (Sandia SST)
focus on CMC, not learning HMC-Sim internals
creativity in implementing CMC ops
should be maintained
11
support for logging, CMC ops will need this as well
sure users can develop/ distribute their CMC ideas separate from the simulator
simulation results!
infrastructure to construct a single CMC operation mapped to a single opcode in a shared library
interface to load the CMC shared library at runtime
12
libhmcsim.a HMC Data Structures & Commands CMC Data Structure & Function Pointers RD16 RD32 . . WR16 WR32 . . CMC04 CMC05 . . CMC05 CMC04 CMC20 CMC07 CMC21 CMC22 CMC23 . . libMY_CMC_1.so libMY_CMC_2.so libSomeCMC.so
logging
(from the list of 70)
lengths
(can be custom response)
13
hmc_cmc_t Data Structures Function Pointers int (*cmc_register)(hmc_rqst_t *, uint32_t *, uint32_t *, uint32_t *, hmc_response_t *, uint8_t *); int (*cmc_execute)(void *, uint32_t, uint32_t, uint32_t, uint32_t, uint64_t, uint32_t, uint64_t, uint64_t, uint64_t *, uint64_t *); void (*cmc_str)(char *); hmc_rqst_t rqst uint32_t cmd uint32_t rqst_len uint32_t rsp_len hmc_response_t rsp_cmd uint8_t rsp_cmd_code uint32_t active void *handle
CMC Tutorial: http://gc64.org/?page_id=140
extern int hmcsim_load_cmc( struct hmcsim_t *hmc, char *cmc ); Is HMC-Sim Initialized? return error No Begin Registering CMC Library Yes Initiate Dynamic Loader dlopen( char *cmc, RTLD_NOW) Shared Lib Loaded? No Yes void (*cmc_str)(char *); int (*cmc_execute)(void *, uint32_t, uint32_t, uint32_t, uint32_t, uint64_t, uint32_t, uint64_t, uint64_t, uint64_t *, uint64_t *); int (*cmc_register)(hmc_rqst_t *, uint32_t *, uint32_t *, uint32_t *, hmc_response_t *, uint8_t *); Register CMC Function Pointers dlsym(handle,FUNC) Execute Registration Function int (*cmc_register) (...) Save Data to hmc_cmc_t Structure return success
14 / 22
extern int hmcsim_process_rqst(...) HMC Vault Request Queue Decode Packet Header & Tail Find Available Response Queue Slot Available? return error No Yes Examine the request command code CMC Command ? No Yes Process CMC Command Is CMC Command Active? Process Normal HMC Command Response Required? Register Response return success Yes No No Retrieve Execution Function Pointer struct hmc_cmc_t cmc[] Execute CMC Command Using Function Pointer
15 / 22
3
HMC_LOCK
17
Thread/Task ID Lock 127 64 63 0
memory location
space
size for normal HMC RD/WR requests
to implement our mutexes
18 Operation Pseudocode Command Enum Request Command Request Length Response Command Response Length hmc lock IF ( ADDR[63:0] == 0 ){ ADDR[127:64 = TID; ADDR[63:0]=1; RET 1}ELSE{ RET 0 } CMC125 125 2 FLITS WR RS 2 hmc trylock IF ( ADDR[63:0] == 0){ADDR[127:64 = TID; ADDR[63:0]=1; RET ADDR[127:64]}ELSE{ RET ADDR[127:64] } CMC126 126 2 FLITS RD RS 2 hmc unlock IF ( ADDR[127:64] == TID && ADDR[63:0] == 1 ){ ADDR[63:0] = 0; RET 1}ELSE{ RET 0 } CMC127 127 2 FLITS WR RS 2
HMC_LOCK if( LOCK == 0 ){ TID = MY_TID; LOCK = 1; return 1; }else{ return 0; } HMC_TRYLOCK if( LOCK == 0 ){ TID = MY_TID; LOCK = 1; return TID; }else{ return TID; } HMC_UNLOCK if( TID == MY_TID && LOCK == 1){ LOCK = 0; return 1; }else{ return 0; }
locks on a single mutex location
tasks from 2-100
configurations
cycles for any thread to obtain the lock
cycles for any thread to obtain the lock
for all threads to obtain the lock
Algorithm 1 CMC Mutex Algorithm for Nthreads do HMC LOCK(ADDR) if LOCK SUCCESS then HMC UNLOCK(ADDR) else HMC TRYLOCK(ADDR) while LOCK FAILED do HMC TRYLOCK(ADDR) end while HMC UNLOCK(ADDR) end if end for
19 / 22
20 / 22
4 5 6 7 8 9 10 11 12 20 40 60 80 100 Cycle Counts Thread Count HMC-SIM Minimum Lock Cycle Counts 4L-4GB Minimum Lock Cycle Count 8L-8GB Minimum Lock Cycle Count 50 100 150 200 250 300 350 400 20 40 60 80 100 Cycle Counts Thread Count HMC-SIM Maximum Lock Cycle Counts 4L-4GB Maximum Lock Cycle Count 8L-8GB Maximum Lock Cycle Count
Device Min Cycle Count Max Cycle Count Avg Cycle Count 4Link-4GB 6 392 226.48 8Link-8GB 6 387 221.48
cycles (not host cycles)
higher maximum latency
21 / 22
50 100 150 200 250 20 40 60 80 100 Cycle Counts Thread Count HMC-SIM Average Lock Cycle Counts 4L-4GB Average Lock Cycle Count 8L-8GB Average Lock Cycle Count
lower average and maximum latencies
applications dependent upon primitive locking operations (embedded applications), the additional queuing capacity with more links is helpful
device promotes sub-linear scaling for both device configurations!
3
23 / 22
empty (tag) bit operations
lock bits and data payloads with a single command
array into tag bit vectors and data payloads for better concurrency
bit operations
2016
24 / 22
10 20 30 40 50 10 20 30 40 50 60 70 80 90 100 Cycles Number of Threads 4 Link Average Barrier Latency 4 Link Maximum Barrier Latency 8 Link Average Barrier Latency 8 Link Maximum Barrier Latency
50 100 150 200 250 300 350 400 10 20 30 40 50 60 70 80 90 100 Cycles Number of Threads 4 Link Average Lock Latency 8 Link Average Lock Latency Linear Scaling
25