the mystery of the computer bits and data
play

The mystery of the computer bits and data 10101011110101 Mikko - PowerPoint PPT Presentation

01110111010110 11110101010101 00101011010011 01010111010101 01001010101010 10101010101010 The mystery of the computer bits and data 10101011110101 Mikko Kivel 01010101011101 01010111010110 Department of Computer Science Aalto


  1. 01110111010110 11110101010101 00101011010011 01010111010101 01001010101010 10101010101010 The mystery of the computer –– bits and data 10101011110101 Mikko Kivelä 01010101011101 01010111010110 Department of Computer Science Aalto University 10101101010110 10101110101010 CS-A1120 Programming 2 11101010101101 01110111010110 2 March 2020 10111011010101 11110101010101 Lecture notes based on material created by Petteri Kaski 00010101010101 01011010101110 10101010100101

  2. Billion computations per second def test(m : Long) = { var i = 1L var s = 0L while (i <= m) { // s = 1 + 2 + ... + m s = s + i i = i + 1 } s } val NANOS_PER_SEC = 1e9 val test_start_time = System .nanoTime test(4000000000L) val test_end_time = System .nanoTime val test_duration = test_end_time - test_start_time println("test took %.2f seconds".format(test_duration/ NANOS_PER_SEC ))

  3. Intel Skylake – machine code example (***) 1029: c4 e2 7d 19 02 vbroadcastsd (%rdx),%ymm0 Example: 102e: c4 e2 7d 19 0c 0a vbroadcastsd (%rdx,%rcx,1),%ymm1 1034: c4 e2 7d 19 14 4a vbroadcastsd (%rdx,%rcx,2),%ymm2 The innermost loop of a 103a: 48 83 c2 08 add $0x8,%rdx matrix multiplication 103e: c5 fd 28 18 vmovapd (%rax),%ymm3 1042: c4 e2 fd b8 e3 vfmadd231pd %ymm3,%ymm0,%ymm4 subroutine implemented 1047: c4 e2 f5 b8 eb vfmadd231pd %ymm3,%ymm1,%ymm5 with Intel X86-64 machine 104c: c4 e2 ed b8 f3 vfmadd231pd %ymm3,%ymm2,%ymm6 code with AVX2 and FMA 1051: c5 fd 28 58 20 vmovapd 0x20(%rax),%ymm3 1056: c4 e2 fd b8 fb vfmadd231pd %ymm3,%ymm0,%ymm7 extensions supported by 105b: c4 62 f5 b8 c3 vfmadd231pd %ymm3,%ymm1,%ymm8 the Skylake architecture 1060: c4 62 ed b8 cb vfmadd231pd %ymm3,%ymm2,%ymm9 1065: c5 fd 28 58 40 vmovapd 0x40(%rax),%ymm3 106a: c4 62 fd b8 d3 vfmadd231pd %ymm3,%ymm0,%ymm10 106f: c4 62 f5 b8 db vfmadd231pd %ymm3,%ymm1,%ymm11 1074: c4 62 ed b8 e3 vfmadd231pd %ymm3,%ymm2,%ymm12 1079: c5 fd 28 58 60 vmovapd 0x60(%rax),%ymm3 107e: c4 62 fd b8 eb vfmadd231pd %ymm3,%ymm0,%ymm13 1083: c4 62 f5 b8 f3 vfmadd231pd %ymm3,%ymm1,%ymm14 1088: c4 62 ed b8 fb vfmadd231pd %ymm3,%ymm2,%ymm15 108d: 48 01 c8 add %rcx,%rax 1090: 48 ff cb dec %rbx 1093: 75 94 jne 1029 https://github.com/pkaski/cluster-play/blob/master/haswell-mm-test/libmynative.c

  4. ? Intel Skylake – machine code example (***) 1029: c4 e2 7d 19 02 vbroadcastsd (%rdx),%ymm0 Example: 102e: c4 e2 7d 19 0c 0a vbroadcastsd (%rdx,%rcx,1),%ymm1 1034: c4 e2 7d 19 14 4a vbroadcastsd (%rdx,%rcx,2),%ymm2 The innermost loop of a 103a: 48 83 c2 08 add $0x8,%rdx matrix multiplication 103e: c5 fd 28 18 vmovapd (%rax),%ymm3 1042: c4 e2 fd b8 e3 vfmadd231pd %ymm3,%ymm0,%ymm4 subroutine implemented 1047: c4 e2 f5 b8 eb vfmadd231pd %ymm3,%ymm1,%ymm5 with Intel X86-64 machine 104c: c4 e2 ed b8 f3 vfmadd231pd %ymm3,%ymm2,%ymm6 code with AVX2 and FMA 1051: c5 fd 28 58 20 vmovapd 0x20(%rax),%ymm3 1056: c4 e2 fd b8 fb vfmadd231pd %ymm3,%ymm0,%ymm7 extensions supported by 105b: c4 62 f5 b8 c3 vfmadd231pd %ymm3,%ymm1,%ymm8 the Skylake architecture 1060: c4 62 ed b8 cb vfmadd231pd %ymm3,%ymm2,%ymm9 1065: c5 fd 28 58 40 vmovapd 0x40(%rax),%ymm3 106a: c4 62 fd b8 d3 vfmadd231pd %ymm3,%ymm0,%ymm10 106f: c4 62 f5 b8 db vfmadd231pd %ymm3,%ymm1,%ymm11 1074: c4 62 ed b8 e3 vfmadd231pd %ymm3,%ymm2,%ymm12 1079: c5 fd 28 58 60 vmovapd 0x60(%rax),%ymm3 107e: c4 62 fd b8 eb vfmadd231pd %ymm3,%ymm0,%ymm13 1083: c4 62 f5 b8 f3 vfmadd231pd %ymm3,%ymm1,%ymm14 1088: c4 62 ed b8 fb vfmadd231pd %ymm3,%ymm2,%ymm15 108d: 48 01 c8 add %rcx,%rax 1090: 48 ff cb dec %rbx 1093: 75 94 jne 1029 https://github.com/pkaski/cluster-play/blob/master/haswell-mm-test/libmynative.c

  5. NVIDIA Volta – machine code example (***) LOP3.LUT R8, R6, R8, R19, 0x96, !PT; /* 0x0000000806087212 */ /* 0x000fe400078e9613 */ LOP3.LUT R64, R11, R64, RZ, 0x3c, !PT; /* 0x000000400b407212 */ /* 0x000fc400078e3cff */ Example: LOP3.LUT R62, R62, R5, R4.reuse, 0x96, !PT; /* 0x000000053e3e7212 */ /* 0x100fe400078e9604 */ Part an inner loop of an LOP3.LUT R17, R17, R15.reuse, R7.reuse, 0x78, !PT; /* 0x0000000f11117212 */ /* 0x180fe400078e7807 */ algorithm (vertex-localized LOP3.LUT R8, R8, R15, R7, 0x78, !PT; /* 0x0000000f08087212 */ /* 0x000fe400078e7807 */ graph motif search) with LOP3.LUT R18, R19.reuse, R18, R4.reuse, 0x96, !PT; /* 0x0000001213127212 */ /* 0x140fe400078e9604 */ NVIDIA GV100 
 LOP3.LUT R5, R19, R10, R4, 0x96, !PT; /* 0x0000000a13057212 */ GPU machine code /* 0x000fe400078e9604 */ LOP3.LUT R9, R6, R9, R19, 0x96, !PT; /* 0x0000000906097212 */ (Compute Capability 7.0) /* 0x000fc400078e9613 */ LOP3.LUT R7, R64, R15, R7, 0x78, !PT; /* 0x0000000f40077212 */ /* 0x000fe400078e7807 */ LOP3.LUT R61, R61, R12, R19, 0x96, !PT; /* 0x0000000c3d3d7212 */ /* 0x000fe400078e9613 */ LOP3.LUT R59, R17, R59, RZ, 0x3c, !PT; /* 0x0000003b113b7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R60, R60, R5, R6, 0x96, !PT; /* 0x000000053c3c7212 */ /* 0x000fe400078e9606 */ LOP3.LUT R58, R9, R58, RZ, 0x3c, !PT; /* 0x0000003a093a7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R51, R18, R51, RZ, 0x3c, !PT; /* 0x0000003312337212 */ /* 0x000fc400078e3cff */ LOP3.LUT R50, R8, R50, RZ, 0x3c, !PT; /* 0x0000003208327212 */ /* 0x000fe400078e3cff */ LOP3.LUT R57, R7, R57, RZ, 0x3c, !PT; /* 0x0000003907397212 */ /* 0x000fe200078e3cff */ @P0 BRA 0x8d0; /* 0xfffff96000000947 */ /* 0x000fee000383ffff */ https://github.com/pkaski/motif-localized

  6. /* 0x000fee000383ffff */ ? NVIDIA Volta – machine code example (***) LOP3.LUT R8, R6, R8, R19, 0x96, !PT; /* 0x0000000806087212 */ /* 0x000fe400078e9613 */ LOP3.LUT R64, R11, R64, RZ, 0x3c, !PT; /* 0x000000400b407212 */ /* 0x000fc400078e3cff */ Example: LOP3.LUT R62, R62, R5, R4.reuse, 0x96, !PT; /* 0x000000053e3e7212 */ /* 0x100fe400078e9604 */ Part an inner loop of an LOP3.LUT R17, R17, R15.reuse, R7.reuse, 0x78, !PT; /* 0x0000000f11117212 */ /* 0x180fe400078e7807 */ algorithm (vertex-localized LOP3.LUT R8, R8, R15, R7, 0x78, !PT; /* 0x0000000f08087212 */ /* 0x000fe400078e7807 */ graph motif search) with LOP3.LUT R18, R19.reuse, R18, R4.reuse, 0x96, !PT; /* 0x0000001213127212 */ /* 0x140fe400078e9604 */ NVIDIA GV100 
 LOP3.LUT R5, R19, R10, R4, 0x96, !PT; /* 0x0000000a13057212 */ GPU machine code /* 0x000fe400078e9604 */ LOP3.LUT R9, R6, R9, R19, 0x96, !PT; /* 0x0000000906097212 */ (Compute Capability 7.0) /* 0x000fc400078e9613 */ LOP3.LUT R7, R64, R15, R7, 0x78, !PT; /* 0x0000000f40077212 */ /* 0x000fe400078e7807 */ LOP3.LUT R61, R61, R12, R19, 0x96, !PT; /* 0x0000000c3d3d7212 */ /* 0x000fe400078e9613 */ LOP3.LUT R59, R17, R59, RZ, 0x3c, !PT; /* 0x0000003b113b7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R60, R60, R5, R6, 0x96, !PT; /* 0x000000053c3c7212 */ /* 0x000fe400078e9606 */ LOP3.LUT R58, R9, R58, RZ, 0x3c, !PT; /* 0x0000003a093a7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R51, R18, R51, RZ, 0x3c, !PT; /* 0x0000003312337212 */ /* 0x000fc400078e3cff */ LOP3.LUT R50, R8, R50, RZ, 0x3c, !PT; /* 0x0000003208327212 */ /* 0x000fe400078e3cff */ LOP3.LUT R57, R7, R57, RZ, 0x3c, !PT; /* 0x0000003907397212 */ /* 0x000fe200078e3cff */ @P0 BRA 0x8d0; /* 0xfffff96000000947 */ https://github.com/pkaski/motif-localized

  7. The mystery of the computer What are the principles of how computers work? What is computing ?

  8. Why is it important that a programmer understand the central principles of computers?

  9. • Computer is a machine 
 –– understanding the basic principles of how this machine works is a fundamental part of programmers professional competence • Skills for applications where the computer needs to be used at the limits of its performance • Physical device (“hardware”) and programs (“software”) are interacting all the way from design to execution • Curiosity and the joy of finding out how things work

  10. dgx01.triton.aalto.fi (***) (NVIDIA DGX-1, 8 x Tesla V100 GPU, 40960 cores , 3.2 kW, 170 teraflops)

  11. Finland: Mahti & Puhti (***) New Finnish supercomputer Puhti: 320 Nvidia V100 Volta GPUs (2.7 petaflops) @ CSC Kajaani (Atos ~27 000 Intel Xeon cores (2.5 petaflops) BullSequana) Mahti: ~180 000 AMD EPYC cores (7.5 petaflops) https://research.csc.fi/techspecs/

  12. Summit: #1 top500.org (***) (~4600 x 6 x 5120 32 bit cores 
 ~4600 computational nodes, 
 = ~ 140 million cores 
 every node has six 
 1312 MHz clock rate, 15 MW) NVIDIA Volta V100s ~200 petaflops https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend