The mystery of the computer programmable computer 10101011110101 - - PowerPoint PPT Presentation

the mystery of the computer programmable computer
SMART_READER_LITE
LIVE PREVIEW

The mystery of the computer programmable computer 10101011110101 - - PowerPoint PPT Presentation

01110111010110 11110101010101 00101011010011 01010111010101 01001010101010 10101010101010 The mystery of the computer programmable computer 10101011110101 Mikko Kivel 01010101011101 01010111010110 Department of Computer Science


slide-1
SLIDE 1

01110111010110 11110101010101 00101011010011 01010111010101 01001010101010 10101010101010 10101011110101 01010101011101 01010111010110 10101101010110 10101110101010 11101010101101 01110111010110 10111011010101 11110101010101 00010101010101 01011010101110 10101010100101

Mikko Kivelä

Department of Computer Science Aalto University

23 March 2020 Lecture notes based on material created by Petteri Kaski

The mystery of the computer –– programmable computer

slide-2
SLIDE 2

The mystery of the computer (4 rounds)

  • 2. Bits and data



 


  • 3. Combinational logic



 


  • 4. Sequential logic



 


  • 5. Programmable computer

–– binary numbers,
 manipulating bits with Scala –– circuits built of logic gates simulated with Scala –– assembly language programming 
 Scala simulated “armlet” processor –– clock and feedback to the Scala simulation, processor data path

slide-3
SLIDE 3
  • 5. Programmable

computer

slide-4
SLIDE 4

What are the principles

  • f how computers operate?

(sequential logic — round 4)

How to build a programmable machine? What is computing?

slide-5
SLIDE 5

Building blocks for sequential logic

(clock triggered feedbacks)

?

AND NOT

OR

(wire) (logic gates)

1

(input elements)

slide-6
SLIDE 6

armlet data path

mem_data mem_addr mem_read_e mem_write_e C lcu_e read_in C lcu_e $0 $1 $2 $3 $4 $5 $6 $7 $0 $1 $2 $3 $4 $5 $6 $7

Arithmetic logic unit Load completion unit Memory interface unit Instruction decoder unit

instr_in immed_in

slide-7
SLIDE 7

Instruction set

(instructions configuring the data path)

nop # no operation mov $L, $A # $L = $A (copy the value of $A to $L) and $L, $A, $B # $L = bitwise AND of $A and $B ior $L, $A, $B # $L = bitwise (inclusive) OR of $A and $B eor $L, $A, $B # $L = bitwise exclusive-OR of $A and $B not $L, $A # $L = bitwise NOT of $A add $L, $A, $B # $L = $A + $B sub $L, $A, $B # $L = $A - $B neg $L, $A # $L = -$A lsl $L, $A, $B # $L = $A shifted to the left by $B bits lsr $L, $A, $B # $L = $A shifted to the right by $B bits asr $L, $A, $B # $L = $A (arithmetically) shifted to the right by $B bits mov $L, I # $L = I (copy the immediate data I to $L) add $L, $A, I # $L = $A + I sub $L, $A, I # $L = $A - I and $L, $A, I # $L = bitwise AND of $A and I ior $L, $A, I # $L = bitwise (inclusive) OR of $A and I eor $L, $A, I # $L = bitwise exclusive OR of $A and I lsl $L, $A, I # $L = $A shifted to the left by I bits lsr $L, $A, I # $L = $A shifted to the right by I bits asr $L, $A, I # $L = $A (arithmetically) shifted to the right by I bits loa $L, $A # $L = [contents of memory word at address $A] sto $L, $A # [contents of memory word at address $L] = $A

slide-8
SLIDE 8

Representing instructions in binary (**)

sub $2, $0, $1

0001000010000111

ior $7, $1, 12345

0000001111011100 0011000000111001

slide-9
SLIDE 9

Data path with the Trigger tool [demo]

import armlet._ new DataPathTrigger()

slide-10
SLIDE 10

Data path executes given instructions one by one –– What about programmability? What is missing?

slide-11
SLIDE 11

def test(m: Long) = { var i = 1L var s = 0L while(i <= m) { // s = 1 + 2 + ... + m s = s + i i = i + 1 } s } val NANOS_PER_SEC = 1e9 val test_start_time = System.nanoTime test(4000000000L) val test_end_time = System.nanoTime val test_duration = test_end_time - test_start_time println("test took %.2f seconds".format(test_duration/NANOS_PER_SEC))

Scala example from the first lecture

slide-12
SLIDE 12

[demo]

slide-13
SLIDE 13

Program = a sequence of instructions in the memory

Machine executes the program, one instruction at a time, automatically

00000: 0000000000011010 00001: 0000000000001010 00002: 0000000001011010 00003: 0000000000000001 00004: 0000000010011010 00005: 0000000000000000 00006: 0000000000100011 00007: 0000000000000000 00008: 0000000000100101 00009: 0000000000010001 00010: 0001010010000110 00011: 0000001001011110 00012: 0000000000000001 00013: 0000000000011111 00014: 0000000000000001 00015: 0000000000100100 00016: 0000000000000110 00017: 0000000000111111

slide-14
SLIDE 14

How to make the execution automatic?

(using the tools of sequential logic)

slide-15
SLIDE 15

What is execution?

slide-16
SLIDE 16

1) Load instructions from memory, one at a time, and 2) Direct the instruction to the data path

slide-17
SLIDE 17

Is this all?

slide-18
SLIDE 18

Instructions are loaded from the memory, “one at a time” But in which order? Where exactly do we get the next instruction?

slide-19
SLIDE 19

Program Counter

(PC) = register which at time t tells where the execution of the program is at (= from which memory address the instruction executed at time t was loaded)

slide-20
SLIDE 20

Increasing the program counter

  • By default: PCt + 1 = PCt + 1 (start: PC0 = 0)
  • The default order can be altered with instructions

controlling the flow of execution:

  • Jump instructions
  • PCt + 1 = a target address
  • Branch instructions (“branch”)
  • PCt + 1 = a target address


if branching condition is true; 


  • therwise default order (PCt + 1)
slide-21
SLIDE 21

Program Status Register (PSR)

= register, where the result of the latest comparison instruction is saved (branching instructions use the result to control the flow of execution)

slide-22
SLIDE 22

Jump

  • A jump instructs to continue the program

execution at a target address

  • In practise: jump moves the target address

to the program counter register (similar to a mov instruction)

jmp >target # jump to label @target # ... some code here ... @target: # ... some further code here ...

slide-23
SLIDE 23

Comparison + Branching

  • Compare two values


(register vs register or register vs constant)

  • Conditional branching based on the latest comparison result
  • Jump is executed if and only if branching condition is true
  • Otherwise continue executing in the default order

cmp $7, 0 # compare $7 (left) and 0 (right) beq >done # jump to label @done if left == right # ... some code here ... @done: # ... some further code here ...

slide-24
SLIDE 24

Armlet processor (*)

mem_data mem_addr mem_read_e mem_write_e C lcu_e C lcu_e $0 $1 $2 $3 $4 $5 $6 $7 $0 $1 $2 $3 $4 $5 $6 $7

Arithmetic logic unit Load completion unit Memory interface unit Control and execution unit

PC PS read_in PC PS reset_e hlt_f

slide-25
SLIDE 25

Round 5: Programmable computer

  • Instructions controlling the flow of execution 


(jump and conditional branching)

  • Machine code (binary) and


symbolic machine code (“assembly language”)

  • Symbolic machine code programming
  • Machine code programming and simulation

environment “armlet” architecture (“Ticker” tool)

slide-26
SLIDE 26

trp # trap (break out of execution for debugging)

Instruction set

(Instruction controlling the flow of execution)

cmp $A, $B # compare $A (left) and $B (right) cmp $A, I # compare $A (left) and I (right) jmp $A # jump to address $A beq $A # ... if left == right (in the most recent comparison) bne $A # ... if left != right bgt $A # ... if left > right (signed) blt $A # ... if left < right (signed) bge $A # ... if left >= right (signed) ble $A # ... if left <= right (signed) bab $A # ... if left > right (unsigned) bbw $A # ... if left < right (unsigned) bae $A # ... if left >= right (unsigned) bbe $A # ... if left <= right (unsigned) jmp I # jump to address I beq I # ... if left == right (in the most recent comparison) bne I # ... if left != right bgt I # ... if left > right (signed) blt I # ... if left < right (signed) bge I # ... if left >= right (signed) ble I # ... if left <= right (signed) bab I # ... if left > right (unsigned) bbw I # ... if left < right (unsigned) bae I # ... if left >= right (unsigned) bbe I # ... if left <= right (unsigned) hlt # halt execution

Comparison Halt Jump and branch based

  • n the results
  • f the latest

comparison Jump and branch based

  • n the results
  • f the latest

comparison

slide-27
SLIDE 27

var t = 10 var i = 1 var s = 0 while(t != 0) { s = s + i i = i + 1 t = t - 1 }

Our first armlet program

slide-28
SLIDE 28

00000: 0000000000011010 00001: 0000000000001010 00002: 0000000001011010 00003: 0000000000000001 00004: 0000000010011010 00005: 0000000000000000 00006: 0000000000100011 00007: 0000000000000000 00008: 0000000000100101 00009: 0000000000010001 00010: 0001010010000110 00011: 0000001001011110 00012: 0000000000000001 00013: 0000000000011111 00014: 0000000000000001 00015: 0000000000100100 00016: 0000000000000110 00017: 0000000000111111 mov $0, 10 mov $1, 1 mov $2, 0 @loop: cmp $0, 0 beq >done add $2, $2, $1 add $1, $1, 1 sub $0, $0, 1 jmp >loop @done: hlt

Symbolic armlet machine language program Machine language program (armlet-binary ) Assembly-language program

(Loadable/executable)
 armlet binary

Machine code compiler

= Assembler

slide-29
SLIDE 29

Examples in the reading material

  • Sum of the data in an array
  • Maximum value in an array
  • Sorting an array
  • Using the stack when you run out of

registers

slide-30
SLIDE 30

Halting and trapping the execution

  • Trap instruction trp that can be used to signal a

break in execution so that the programmer can inspect what the processor is doing
 –– useful for “hunting bugs”


  • Halting with the instruction hlt stops the

execution (without the possibility to continue it)

slide-31
SLIDE 31

armlet programming in the Ticker machine code programming environment [demo]

import armlet._ new Ticker()

(or by running the launchTicker object in an Eclipse exercise package)

slide-32
SLIDE 32

In the exercises

  • We will convince ourselves that the

armlet processor we designed is a programmable machine

  • Programming with (symbolic) machine

language is hard work for the programmer

  • After solving the exercises you can

appreciate the modern programming languages

slide-33
SLIDE 33

Exercises (armlet assembly programming)

  • expression


–– additions and subtractions with registers

  • wordOps


–– manipulating bits in registers

  • range


–– the range of numbers in an array (= max – min)

  • multiply


–– algorithm for multiplications (armlet does not support multiplication at the circuit level)

  • mostFrequent


–– finding the most frequent value in an array (when a sorting algorithm is given)

  • remainder16


–– computing the remainder of a division between numbers in two registers

  • remainder32 (challenge problem)


–– computing the remainder of division between two 32 bit numbers in memory

  • gcd (challenge problem)


–– finding the greatest common divisor of numbers in two registers

slide-34
SLIDE 34

The mystery of the computer is now solved?

slide-35
SLIDE 35

Yes –– At the level of principles of computing hardware and programming it directly –– We can now build a programmable machine

slide-36
SLIDE 36

The universality principle

  • f computing:

Programmable machine can, with right programming, simulate another programmable machine

slide-37
SLIDE 37

Computing and programming are concepts which are independent of the device and the programming language

(In practice, the physical device, for example the amount of memory that it has, and the efficiency of the simulation restrict the above statement slightly)

slide-38
SLIDE 38

The mystery of the computer is now solved?

slide-39
SLIDE 39

No –– At the level of software and details of hardware

slide-40
SLIDE 40

Hardware Software

Operating system

slide-41
SLIDE 41

But can you, after all, express more with Scala code than with the armlet assembly code?

slide-42
SLIDE 42

No –– In principle no, at the hardware level Scala programs are executed with a machine code similar to the armlet machine code (transformed during execution from Java bytecode) (**)

slide-43
SLIDE 43

Example of Java bytecode (*)

var i = 1L var s = 0L while(i <= 100000000L) { s = s + i i = i + 1 } 0: lconst_1 1: lstore_2 2: lconst_0 3: lstore 4 5: lload_2 6: ldc2_w #15; //long 100000000l 9: lcmp 10: ifgt 26 13: lload 4 15: lload_2 16: ladd 17: lstore 4 19: lload_2 20: lconst_1 21: ladd 22: lstore_2 23: goto 5 26:

Scala

(more in the reading material)

slide-44
SLIDE 44

From Java bytecode to machine code (**)

0: lconst_1 1: lstore_2 2: lconst_0 3: lstore 4 5: lload_2 6: ldc2_w #15; //long 100000000l 9: lcmp 10: ifgt 26 13: lload 4 15: lload_2 16: ladd 17: lstore 4 19: lload_2 20: lconst_1 21: ladd 22: lstore_2 23: goto 5 26:

JVM

xorq %rax, %rax incq %rax xorq %rdx, %rdx L1: addq %rax, %rdx incq %rax cmpq 100000001, %rax jne L1 L2:

(Intel Core i7 machine code, 
 here represented as assembly code for human readability
 –– the machine only executes binary code)

slide-45
SLIDE 45

In binary (**)

[represented as hexadecimals]

xorq %rax, %rax incq %rax xorq %rdx, %rdx addq %rax, %rdx incq %rax cmpq 100000001, %rax jne -14 4831C048FFC04831D24801C248FFC0483D01E1F50575F2 4831C0 48FFC0 4831D2 4801C2 48FFC0 483D01E1F505 75F2

(23 x 8 =184 bits of Intel Core i7 machine code)

slide-46
SLIDE 46

But can you, after all, express more with Scala code than with the armlet assembly code?

slide-47
SLIDE 47

Yes –– In practice yes, Scala is more expressive language and you can be much more productive when programming with it than when using assembly language –– Scala is “only a program”, which makes using the hardware easier for a programmer

slide-48
SLIDE 48

Epilog: Hardware in the future (***)

  • https://doi.org/10.1109/MCSE.2017.32
  • https://www.semiwiki.com/forum/content/7191-

iedm-2017-intel-versus-globalfoundries-leading-edge.html

  • http://dx.doi.org/10.2200/

S00516ED2V01Y201306CAC024

  • http://darksilicon.org/papers/

taylor_dark_silicon_horsemen_dac_2012.pdf

  • https://community.cadence.com/cadence_blogs_8/b/

breakfast-bytes/posts/arm-at-smc

slide-49
SLIDE 49

Growth of computing power & Moore’s law (***)

Moore’s law = number of transistors double every 1.5 years

100 101 102 103 104 105 106 107 1970 1980 1990 2000 2010 2020 Number of Logical Cores Frequency (MHz) Single-Thread Performance (SpecINT x 103) Transistors (thousands) Typical Power (Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

Year 42 Years of Microprocessor Trend Data

source (CC4 .0 BY): https://github.com/karlrupp/microprocessor-trend-data

slide-50
SLIDE 50

http://spectrum.ieee.org/static/special-report-50-years-of-moores-law

slide-51
SLIDE 51

https://doi.org/10.1109/MCSE.2017.32

slide-52
SLIDE 52

Example: Intel processors (***)

Tick-tock model = shrink manufacturing process - improve architecture Process - architecture - optimisation (current) Processes in nano meters (nm) → smaller = more transistors

Haswell, 22 nm chip

slide-53
SLIDE 53

Skylake AVX2 core: 
 2 x 8 = 16 floating point operations
 (binary32)


Intel Core i5-7360U 2 x AVX2 cores = 32 floating point operations


(2.3 GHz base, 3.6 GHz @ turbo)
 [ 74 Gflops base, 115 Gflops @ turbo]

Intel Skylake architecture

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

(***)

14 nm

http://ark.intel.com/products/97129/Intel-Core-i7-7700K-Processor-8M-Cache-up-to-4_50-GHz

slide-54
SLIDE 54

1029: c4 e2 7d 19 02 vbroadcastsd (%rdx),%ymm0 102e: c4 e2 7d 19 0c 0a vbroadcastsd (%rdx,%rcx,1),%ymm1 1034: c4 e2 7d 19 14 4a vbroadcastsd (%rdx,%rcx,2),%ymm2 103a: 48 83 c2 08 add $0x8,%rdx 103e: c5 fd 28 18 vmovapd (%rax),%ymm3 1042: c4 e2 fd b8 e3 vfmadd231pd %ymm3,%ymm0,%ymm4 1047: c4 e2 f5 b8 eb vfmadd231pd %ymm3,%ymm1,%ymm5 104c: c4 e2 ed b8 f3 vfmadd231pd %ymm3,%ymm2,%ymm6 1051: c5 fd 28 58 20 vmovapd 0x20(%rax),%ymm3 1056: c4 e2 fd b8 fb vfmadd231pd %ymm3,%ymm0,%ymm7 105b: c4 62 f5 b8 c3 vfmadd231pd %ymm3,%ymm1,%ymm8 1060: c4 62 ed b8 cb vfmadd231pd %ymm3,%ymm2,%ymm9 1065: c5 fd 28 58 40 vmovapd 0x40(%rax),%ymm3 106a: c4 62 fd b8 d3 vfmadd231pd %ymm3,%ymm0,%ymm10 106f: c4 62 f5 b8 db vfmadd231pd %ymm3,%ymm1,%ymm11 1074: c4 62 ed b8 e3 vfmadd231pd %ymm3,%ymm2,%ymm12 1079: c5 fd 28 58 60 vmovapd 0x60(%rax),%ymm3 107e: c4 62 fd b8 eb vfmadd231pd %ymm3,%ymm0,%ymm13 1083: c4 62 f5 b8 f3 vfmadd231pd %ymm3,%ymm1,%ymm14 1088: c4 62 ed b8 fb vfmadd231pd %ymm3,%ymm2,%ymm15 108d: 48 01 c8 add %rcx,%rax 1090: 48 ff cb dec %rbx 1093: 75 94 jne 1029

Example: The inner loop of a matrix multiplication algorithm written in Intel x86–64 machine code using the AVX2 & FMA instruction sets supported by the Skylake 
 microarchitecture

Intel Skylake – machine code example (***)

https://github.com/pkaski/cluster-play/blob/master/haswell-mm-test/libmynative.c

slide-55
SLIDE 55

Streaming multiprocessor: 64 floating point cores (binary32)
 32 floating point cores(binary64) 8 tensor cores (binary16)


Tesla V100 
 Volta GV100 (80 SM) [full GV100: 84SM]
 = 5120 floating point cores (binary32)


(1312 MHz base, 1530 MHz boost)
 [13.4 Tflops base, 15.7 Tflops boost]
 = 2560 floating point cores (binary64)
 (1312 MHz base, 1530 MHz boost)
 [6.7 Tflops base, 7.8 Tflops boost]

= 640 tensor cores (binary16)


(1312 MHz base, 1530 MHz boost)
 [107 Tflops base, 125 Tflops boost]

NVIDIA
 Volta

(***)

12 nm

21.1 billion transistors 815 mm2 http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

slide-56
SLIDE 56

NVIDIA Volta – machine code example (***)

https://github.com/pkaski/motif-localized

LOP3.LUT R8, R6, R8, R19, 0x96, !PT; /* 0x0000000806087212 */ /* 0x000fe400078e9613 */ LOP3.LUT R64, R11, R64, RZ, 0x3c, !PT; /* 0x000000400b407212 */ /* 0x000fc400078e3cff */ LOP3.LUT R62, R62, R5, R4.reuse, 0x96, !PT; /* 0x000000053e3e7212 */ /* 0x100fe400078e9604 */ LOP3.LUT R17, R17, R15.reuse, R7.reuse, 0x78, !PT; /* 0x0000000f11117212 */ /* 0x180fe400078e7807 */ LOP3.LUT R8, R8, R15, R7, 0x78, !PT; /* 0x0000000f08087212 */ /* 0x000fe400078e7807 */ LOP3.LUT R18, R19.reuse, R18, R4.reuse, 0x96, !PT; /* 0x0000001213127212 */ /* 0x140fe400078e9604 */ LOP3.LUT R5, R19, R10, R4, 0x96, !PT; /* 0x0000000a13057212 */ /* 0x000fe400078e9604 */ LOP3.LUT R9, R6, R9, R19, 0x96, !PT; /* 0x0000000906097212 */ /* 0x000fc400078e9613 */ LOP3.LUT R7, R64, R15, R7, 0x78, !PT; /* 0x0000000f40077212 */ /* 0x000fe400078e7807 */ LOP3.LUT R61, R61, R12, R19, 0x96, !PT; /* 0x0000000c3d3d7212 */ /* 0x000fe400078e9613 */ LOP3.LUT R59, R17, R59, RZ, 0x3c, !PT; /* 0x0000003b113b7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R60, R60, R5, R6, 0x96, !PT; /* 0x000000053c3c7212 */ /* 0x000fe400078e9606 */ LOP3.LUT R58, R9, R58, RZ, 0x3c, !PT; /* 0x0000003a093a7212 */ /* 0x000fe400078e3cff */ LOP3.LUT R51, R18, R51, RZ, 0x3c, !PT; /* 0x0000003312337212 */ /* 0x000fc400078e3cff */ LOP3.LUT R50, R8, R50, RZ, 0x3c, !PT; /* 0x0000003208327212 */ /* 0x000fe400078e3cff */ LOP3.LUT R57, R7, R57, RZ, 0x3c, !PT; /* 0x0000003907397212 */ /* 0x000fe200078e3cff */ @P0 BRA 0x8d0; /* 0xfffff96000000947 */ /* 0x000fee000383ffff */

Example: A part of the inner loop of parallelised GF(28) multilinear sieving algorithm implemented with NVIDIA GV100 GPU machine code (Compute Capability 7.0)

slide-57
SLIDE 57
  • 1 nm


= 1⋅10–9 m
 = 0.000 000 001 m


  • The covalent radius of a silicon

atom is around 0.111 nm

  • One side of a cubic side of a

silicon crystal lattice is around 0.54 nm (in 300 K temperature)

  • Intel currently uses 10 nm process 

  • Commercial 3 nm processes

announced to start within few years


  • Berkeley Lab published “1 nm gate”
  • Note: nm values are not always exactly

comparable between manufacturers

slide-58
SLIDE 58

http://www.google.com/about/datacenters/inside/locations/hamina/

Billion instructions per second, for each core, for a factory full of computers (***)

http://dx.doi.org/10.2200/S00516ED2V01Y201306CAC024

slide-59
SLIDE 59

Contents (rounds and modules)

  • 1. Warmup round
  • I The mystery of the computer

  • 2. Bits and data

  • 3. Combinational logic

  • 4. Sequential logic

  • 5. Programmable computer
  • II Programming abstractions and analysis

  • 6. Collections and functions

  • 7. Efficiency

  • 8. Recursion

  • 9. Algorithms and representations of information
  • III Frontiers

  • 10. Concurrency and parallelism

  • 11. Virtualization and scalability

  • 12. Machines that learn?
slide-60
SLIDE 60

Summary (1/3)

  • Hardware–software interface, 


“from bottom up”

  • bits, gates, busses, combinational

circuits, feedbacks and clock, sequential circuits, data path, processor, memory unit, programmable machine

  • Instruction set, microarchitecture
  • Machine code, symbolic machine code,

assembler, programming environment


slide-61
SLIDE 61

Summary (2/3)

  • What did we learn, about programming? (incl.)
  • One thousand million instructions per second (!)
  • In the end, everything is bits
  • Designing hardware is also programming
  • Program, execution, computing
  • Computing and programming are concepts which are

independent of the actual hardware and programming language 
 –– any programmable machine, can with proper programming, simulate any another programmable machine

  • The use of simulation as teaching tool and professionally
slide-62
SLIDE 62
  • What did we learn, about programming? (incl.)
  • Programming is a skill to harness computing to

desired purpose, and when needed build your own tools while making use of the existing tools

  • Playing with, testing and critically evaluating the

tools is important

  • Encapsulation and well-thought abstractions in the

programs (functions, classes, packages, etc.) help managing more complex programs

  • Programming languages (like Scala) are “only”

software tools which help harnessing the hardware to do what the programmer wants

Summary (3/3)

slide-63
SLIDE 63

Module II

Programming abstractions and analysis