Characterizing the Performance of “Big Memory” on Blue Gene Linux
Kazutomo Yoshii Mathematics and Computer Science Division Argonne National Laboratory Kamil Iskra
- P. Chris Broekema (ASTRON)
Harish Naik Pete Beckman
Characterizing the Performance of Big Memory on Blue Gene Linux - - PowerPoint PPT Presentation
Characterizing the Performance of Big Memory on Blue Gene Linux Kazutomo Yoshii Mathematics and Computer Science Division Argonne National Laboratory Kamil Iskra P. Chris Broekema (ASTRON) Harish Naik Pete Beckman ZeptoOS Project
Kazutomo Yoshii Mathematics and Computer Science Division Argonne National Laboratory Kamil Iskra
Harish Naik Pete Beckman
– System Noise Study : Selfish suite – I/O forwarding : ZOID(ZeptoOS I/O Daemon) – Memory Subsystems: Big Memory – Performance Analysis: Tau, Ktau – Linux based compute node kernel
– Torus, collective and barrier – Single clock source
– 5 out of 10 in the green 500 (Jun09)
– 32-bit 4-way SMP runs at 850MHz – Peak: 3.4 Gflops/core ( 2 * 2 * 850 )
– Noise free – Thread per core, single user – No additional capabilities: remote login, VFS, ...
– although no I/O, device drivers
– Node level performance? – Scalability?
2 usec
CNK Linux 64K Linux 4K 5 10 15 20 25 30 35 40 45 50
random access (read-only)
MB/s
– It costs approx. 0.3 usec
– 64 TLBs per core
– Impact on a random or stride access pattern
NOTE: NAS version 3.3 IBM XL Compiler
Process Address Space Kernel Stack Text heap Stack Text heap Virtual Memory Area(VMA)
Page Table Entry(PTE) TLBs TLB handler Page fault handler
– Monopolize CPU resources – Context switch is not preferable – One thread per core is best – Pin down memory for network devices
Zepto Process Address Space Kernel Shared mmap Shared mmap Shared mmap Shared mmap VMA Big Memory Region PTE TLBs Memory Allocator TLB Handler Zepto Memory Manager Page Fault Handler Zepto Binary i.e. Install 256MB TLBs
TLB miss PTE? Install TLB from PTE Zepto task? VMA?
Within Big Memory?
Install Big Memory TLBs (Semi statically) Install PTE from VMA Yes No No No No Yes Yes Yes Memory Fault!
CNK Linux ZCB Linux 64K Linux 4K 5 10 15 20 25 30 35 40 45 50
random access (read-only)
MB/s
NOTE: NPB 3.3 / MPICH 1.0.7 / DCMF 1.0 IBM XL Compiler SMP mode
NOTE: POP 2.0.1 / X1 benchmark data set IBM XL compiler SMP mode
– With Tau enabled, POP took 131 sec on CNK
Stock – 16 bit
Receive UDP/IP packets
Copy data to ring buffer
Send ring buffer to CN
Receive data from CN
Send results to storage
Total system load
Zepto – 16 bit Zepto – 4 bit
– Intel Xeon, AMD Opetron, IBM Power – Software compatibility – Fewer bugs – Less cost
– Network devices, memory subsystems, FPU are
– Increase memory performance – Porting communication library became easier
– Big memory on other CPU – Extended to DUAL, VN node mode – Tickless kernel
NOTE: Total Mop/s NPB 3.3 / MPICH 1.0.7 / DCMF 1.0 for Linux, 2.0 for CNK SMP mode
NOTE: Mop/s per proc NPB 3.3 / MPICH 1.0.7 / DCMF 1.0 SMP mode
NOTE: Mop/s per proc NPB 3.3 / MPICH 1.0.7 / DCMF 1.0 SMP mode