 
              3 4 How to find minimum of function Keccak optimization f defined on ( x; y )-plane? Goal: Fastest C code for Keccak “Gradient descent”: on a Cortex-M4 CPU core. Starting from ( x 0 ; y 0 ), You start with simple C code try to figure out direction implementing Keccak. where f decreases fastest. You compile it; see how fast it is; Could do line search to find modify it to try to make it faster; minimum in that direction. repeat; eventually stop trying. Then find a new direction. You publish your fastest code. Better: Step down that direction. Maybe lots of people use it, Then find a new direction. and care about its speed. Silly: Line search in x direction; line search in y direction; repeat.
3 4 to find minimum of function Keccak optimization Compiler defined on ( x; y )-plane? your Keccak Goal: Fastest C code for Keccak “Gradient descent”: on a Cortex-M4 CPU core. rting from ( x 0 ; y 0 ), You start with simple C code figure out direction implementing Keccak. f decreases fastest. You compile it; see how fast it is; do line search to find modify it to try to make it faster; minimum in that direction. repeat; eventually stop trying. find a new direction. You publish your fastest code. Better: Step down that direction. Maybe lots of people use it, find a new direction. and care about its speed. Line search in x direction; search in y direction; repeat.
3 4 minimum of function Keccak optimization Compiler writer lea y )-plane? your Keccak Cortex-M4 Goal: Fastest C code for Keccak descent”: on a Cortex-M4 CPU core. 0 ; y 0 ), You start with simple C code direction implementing Keccak. decreases fastest. You compile it; see how fast it is; search to find modify it to try to make it faster; that direction. repeat; eventually stop trying. direction. You publish your fastest code. wn that direction. Maybe lots of people use it, direction. and care about its speed. h in x direction; direction; repeat.
3 4 function Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C co Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. You start with simple C code implementing Keccak. You compile it; see how fast it is; find modify it to try to make it faster; direction. repeat; eventually stop trying. direction. You publish your fastest code. direction. Maybe lots of people use it, direction. and care about its speed. direction; repeat.
4 5 Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C code. Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. You start with simple C code implementing Keccak. You compile it; see how fast it is; modify it to try to make it faster; repeat; eventually stop trying. You publish your fastest code. Maybe lots of people use it, and care about its speed.
4 5 Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C code. Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. Compiles it; sees how fast it is. Modifies compiler to try to You start with simple C code make the compiled code faster. implementing Keccak. Repeats; eventually stops trying. You compile it; see how fast it is; modify it to try to make it faster; repeat; eventually stop trying. You publish your fastest code. Maybe lots of people use it, and care about its speed.
4 5 Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C code. Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. Compiles it; sees how fast it is. Modifies compiler to try to You start with simple C code make the compiled code faster. implementing Keccak. Repeats; eventually stops trying. You compile it; see how fast it is; Publishes a new compiler version. modify it to try to make it faster; repeat; eventually stop trying. You publish your fastest code. Maybe lots of people use it, and care about its speed.
4 5 Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C code. Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. Compiles it; sees how fast it is. Modifies compiler to try to You start with simple C code make the compiled code faster. implementing Keccak. Repeats; eventually stops trying. You compile it; see how fast it is; Publishes a new compiler version. modify it to try to make it faster; repeat; eventually stop trying. Later: Maybe you try the new compiler. Whole process repeats. You publish your fastest code. Maybe lots of people use it, and care about its speed.
4 5 Keccak optimization Compiler writer learns about your Keccak Cortex-M4 C code. Goal: Fastest C code for Keccak on a Cortex-M4 CPU core. Compiles it; sees how fast it is. Modifies compiler to try to You start with simple C code make the compiled code faster. implementing Keccak. Repeats; eventually stops trying. You compile it; see how fast it is; Publishes a new compiler version. modify it to try to make it faster; repeat; eventually stop trying. Later: Maybe you try the new compiler. Whole process repeats. You publish your fastest code. You treat compiler as constant. Maybe lots of people use it, Compiler treats code as constant. and care about its speed.
4 5 Keccak optimization Compiler writer learns about Define f your Keccak Cortex-M4 C code. code x with Fastest C code for Keccak Cortex-M4 CPU core. Compiles it; sees how fast it is. Modifies compiler to try to start with simple C code make the compiled code faster. implementing Keccak. Repeats; eventually stops trying. compile it; see how fast it is; Publishes a new compiler version. it to try to make it faster; eat; eventually stop trying. Later: Maybe you try the new compiler. Whole process repeats. publish your fastest code. You treat compiler as constant. lots of people use it, Compiler treats code as constant. re about its speed.
4 5 tion Compiler writer learns about Define f ( x; y ) as time your Keccak Cortex-M4 C code. code x with compiler code for Keccak CPU core. Compiles it; sees how fast it is. Modifies compiler to try to simple C code make the compiled code faster. Keccak. Repeats; eventually stops trying. see how fast it is; Publishes a new compiler version. to make it faster; eventually stop trying. Later: Maybe you try the new compiler. Whole process repeats. our fastest code. You treat compiler as constant. eople use it, Compiler treats code as constant. its speed.
4 5 Compiler writer learns about Define f ( x; y ) as time taken your Keccak Cortex-M4 C code. code x with compiler y . eccak re. Compiles it; sees how fast it is. Modifies compiler to try to de make the compiled code faster. Repeats; eventually stops trying. st it is; Publishes a new compiler version. it faster; trying. Later: Maybe you try the new compiler. Whole process repeats. code. You treat compiler as constant. it, Compiler treats code as constant.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by your Keccak Cortex-M4 C code. code x with compiler y . Compiles it; sees how fast it is. Modifies compiler to try to make the compiled code faster. Repeats; eventually stops trying. Publishes a new compiler version. Later: Maybe you try the new compiler. Whole process repeats. You treat compiler as constant. Compiler treats code as constant.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by your Keccak Cortex-M4 C code. code x with compiler y . Compiles it; sees how fast it is. x 0 : initial code. Modifies compiler to try to y 0 : initial compiler. make the compiled code faster. Repeats; eventually stops trying. Publishes a new compiler version. Later: Maybe you try the new compiler. Whole process repeats. You treat compiler as constant. Compiler treats code as constant.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by your Keccak Cortex-M4 C code. code x with compiler y . Compiles it; sees how fast it is. x 0 : initial code. Modifies compiler to try to y 0 : initial compiler. make the compiled code faster. You try to minimize f ( x; y 0 ). Repeats; eventually stops trying. x 1 : new code from this Publishes a new compiler version. line search in x direction. Later: Maybe you try the new compiler. Whole process repeats. You treat compiler as constant. Compiler treats code as constant.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by your Keccak Cortex-M4 C code. code x with compiler y . Compiles it; sees how fast it is. x 0 : initial code. Modifies compiler to try to y 0 : initial compiler. make the compiled code faster. You try to minimize f ( x; y 0 ). Repeats; eventually stops trying. x 1 : new code from this Publishes a new compiler version. line search in x direction. Later: Maybe you try the new Compiler writer: f ( x 1 ; y ). compiler. Whole process repeats. y 1 : new compiler from this You treat compiler as constant. line search in y direction. Compiler treats code as constant.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by your Keccak Cortex-M4 C code. code x with compiler y . Compiles it; sees how fast it is. x 0 : initial code. Modifies compiler to try to y 0 : initial compiler. make the compiled code faster. You try to minimize f ( x; y 0 ). Repeats; eventually stops trying. x 1 : new code from this Publishes a new compiler version. line search in x direction. Later: Maybe you try the new Compiler writer: f ( x 1 ; y ). compiler. Whole process repeats. y 1 : new compiler from this You treat compiler as constant. line search in y direction. Compiler treats code as constant. This whole approach is silly.
5 6 Compiler writer learns about Define f ( x; y ) as time taken by min { f ( x; Keccak Cortex-M4 C code. code x with compiler y . fastest Kec Compiles it; sees how fast it is. x 0 : initial code. difies compiler to try to y 0 : initial compiler. the compiled code faster. You try to minimize f ( x; y 0 ). eats; eventually stops trying. x 1 : new code from this Publishes a new compiler version. line search in x direction. Maybe you try the new Compiler writer: f ( x 1 ; y ). compiler. Whole process repeats. y 1 : new compiler from this treat compiler as constant. line search in y direction. Compiler treats code as constant. This whole approach is silly.
5 6 learns about Define f ( x; y ) as time taken by min { f ( x; y ) } is the rtex-M4 C code. code x with compiler y . fastest Keccak Cortex-M4 how fast it is. x 0 : initial code. compiler to try to y 0 : initial compiler. compiled code faster. You try to minimize f ( x; y 0 ). eventually stops trying. x 1 : new code from this compiler version. line search in x direction. ou try the new Compiler writer: f ( x 1 ; y ). process repeats. y 1 : new compiler from this compiler as constant. line search in y direction. code as constant. This whole approach is silly.
5 6 out Define f ( x; y ) as time taken by min { f ( x; y ) } is the time tak code. code x with compiler y . fastest Keccak Cortex-M4 asm. it is. x 0 : initial code. to y 0 : initial compiler. faster. You try to minimize f ( x; y 0 ). trying. x 1 : new code from this version. line search in x direction. new Compiler writer: f ( x 1 ; y ). repeats. y 1 : new compiler from this constant. line search in y direction. constant. This whole approach is silly.
6 7 Define f ( x; y ) as time taken by min { f ( x; y ) } is the time taken by code x with compiler y . fastest Keccak Cortex-M4 asm. x 0 : initial code. y 0 : initial compiler. You try to minimize f ( x; y 0 ). x 1 : new code from this line search in x direction. Compiler writer: f ( x 1 ; y ). y 1 : new compiler from this line search in y direction. This whole approach is silly.
6 7 Define f ( x; y ) as time taken by min { f ( x; y ) } is the time taken by code x with compiler y . fastest Keccak Cortex-M4 asm. x 0 : initial code. Slowly bouncing between x -line searches, y -line searches is y 0 : initial compiler. a silly way to approach this min. You try to minimize f ( x; y 0 ). x 1 : new code from this line search in x direction. Compiler writer: f ( x 1 ; y ). y 1 : new compiler from this line search in y direction. This whole approach is silly.
6 7 Define f ( x; y ) as time taken by min { f ( x; y ) } is the time taken by code x with compiler y . fastest Keccak Cortex-M4 asm. x 0 : initial code. Slowly bouncing between x -line searches, y -line searches is y 0 : initial compiler. a silly way to approach this min. You try to minimize f ( x; y 0 ). Clearly min can be achieved by x 1 : new code from this many different pairs ( x; y ). line search in x direction. Which pair is easiest to find? Compiler writer: f ( x 1 ; y ). y 1 : new compiler from this line search in y direction. This whole approach is silly.
6 7 Define f ( x; y ) as time taken by min { f ( x; y ) } is the time taken by code x with compiler y . fastest Keccak Cortex-M4 asm. x 0 : initial code. Slowly bouncing between x -line searches, y -line searches is y 0 : initial compiler. a silly way to approach this min. You try to minimize f ( x; y 0 ). Clearly min can be achieved by x 1 : new code from this many different pairs ( x; y ). line search in x direction. Which pair is easiest to find? Compiler writer: f ( x 1 ; y ). Generalize from C to other y 1 : new compiler from this languages: which language line search in y direction. makes min easiest to find? This whole approach is silly. Why did goal say “C code”? End user doesn’t need C.
6 7 f ( x; y ) as time taken by min { f ( x; y ) } is the time taken by Does end with compiler y . fastest Keccak Cortex-M4 asm. initial code. Slowly bouncing between x -line searches, y -line searches is initial compiler. a silly way to approach this min. try to minimize f ( x; y 0 ). Clearly min can be achieved by new code from this many different pairs ( x; y ). search in x direction. Which pair is easiest to find? Compiler writer: f ( x 1 ; y ). Generalize from C to other new compiler from this languages: which language search in y direction. makes min easiest to find? whole approach is silly. Why did goal say “C code”? End user doesn’t need C.
6 7 as time taken by min { f ( x; y ) } is the time taken by Does end user need compiler y . fastest Keccak Cortex-M4 asm. Slowly bouncing between x -line searches, y -line searches is compiler. a silly way to approach this min. minimize f ( x; y 0 ). Clearly min can be achieved by rom this many different pairs ( x; y ). direction. Which pair is easiest to find? f ( x 1 ; y ). Generalize from C to other compiler from this languages: which language direction. makes min easiest to find? roach is silly. Why did goal say “C code”? End user doesn’t need C.
6 7 en by min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. Slowly bouncing between x -line searches, y -line searches is a silly way to approach this min. 0 ). Clearly min can be achieved by many different pairs ( x; y ). Which pair is easiest to find? Generalize from C to other this languages: which language makes min easiest to find? silly. Why did goal say “C code”? End user doesn’t need C.
7 8 min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. Slowly bouncing between x -line searches, y -line searches is a silly way to approach this min. Clearly min can be achieved by many different pairs ( x; y ). Which pair is easiest to find? Generalize from C to other languages: which language makes min easiest to find? Why did goal say “C code”? End user doesn’t need C.
7 8 min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. CPU designer learns about your Slowly bouncing between Keccak Cortex-M4 asm. x -line searches, y -line searches is a silly way to approach this min. Clearly min can be achieved by many different pairs ( x; y ). Which pair is easiest to find? Generalize from C to other languages: which language makes min easiest to find? Why did goal say “C code”? End user doesn’t need C.
7 8 min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. CPU designer learns about your Slowly bouncing between Keccak Cortex-M4 asm. x -line searches, y -line searches is Modifies the CPU design to a silly way to approach this min. try to make this code faster. Clearly min can be achieved by Repeats; eventually stops trying. many different pairs ( x; y ). Which pair is easiest to find? Generalize from C to other languages: which language makes min easiest to find? Why did goal say “C code”? End user doesn’t need C.
7 8 min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. CPU designer learns about your Slowly bouncing between Keccak Cortex-M4 asm. x -line searches, y -line searches is Modifies the CPU design to a silly way to approach this min. try to make this code faster. Clearly min can be achieved by Repeats; eventually stops trying. many different pairs ( x; y ). Years later, sells a new CPU. Which pair is easiest to find? You reoptimize for this CPU. Generalize from C to other languages: which language makes min easiest to find? Why did goal say “C code”? End user doesn’t need C.
7 8 min { f ( x; y ) } is the time taken by Does end user need Cortex-M4? fastest Keccak Cortex-M4 asm. CPU designer learns about your Slowly bouncing between Keccak Cortex-M4 asm. x -line searches, y -line searches is Modifies the CPU design to a silly way to approach this min. try to make this code faster. Clearly min can be achieved by Repeats; eventually stops trying. many different pairs ( x; y ). Years later, sells a new CPU. Which pair is easiest to find? You reoptimize for this CPU. Generalize from C to other Sometimes CPUs try extending languages: which language or replacing instruction set, but makes min easiest to find? this is poorly coordinated with Why did goal say “C code”? programmers, compiler writers. End user doesn’t need C.
7 8 ( x; y ) } is the time taken by Does end user need Cortex-M4? Generalize Keccak Cortex-M4 asm. f ( x; y ) is CPU designer learns about your code x on bouncing between Keccak Cortex-M4 asm. searches, y -line searches is If compiler Modifies the CPU design to way to approach this min. asm y ( x try to make this code faster. f ( x; y ) = min can be achieved by Repeats; eventually stops trying. different pairs ( x; y ). Years later, sells a new CPU. pair is easiest to find? You reoptimize for this CPU. Generalize from C to other Sometimes CPUs try extending languages: which language or replacing instruction set, but min easiest to find? this is poorly coordinated with did goal say “C code”? programmers, compiler writers. user doesn’t need C.
7 8 the time taken by Does end user need Cortex-M4? Generalize f ( x; y ) Cortex-M4 asm. f ( x; y ) is time taken CPU designer learns about your code x on platform between Keccak Cortex-M4 asm. y -line searches is If compiler y on co Modifies the CPU design to approach this min. asm y ( x ) for Cortex-M4: try to make this code faster. f ( x; y ) = f ( y ( x ) ; Co be achieved by Repeats; eventually stops trying. pairs ( x; y ). Years later, sells a new CPU. easiest to find? You reoptimize for this CPU. C to other Sometimes CPUs try extending which language or replacing instruction set, but easiest to find? this is poorly coordinated with y “C code”? programmers, compiler writers. esn’t need C.
7 8 taken by Does end user need Cortex-M4? Generalize f ( x; y ) definition: asm. f ( x; y ) is time taken by CPU designer learns about your code x on platform y . Keccak Cortex-M4 asm. rches is If compiler y on code x produces Modifies the CPU design to this min. asm y ( x ) for Cortex-M4: try to make this code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). achieved by Repeats; eventually stops trying. ). Years later, sells a new CPU. d? You reoptimize for this CPU. other Sometimes CPUs try extending language or replacing instruction set, but this is poorly coordinated with de”? programmers, compiler writers.
8 9 Does end user need Cortex-M4? Generalize f ( x; y ) definition: f ( x; y ) is time taken by CPU designer learns about your code x on platform y . Keccak Cortex-M4 asm. If compiler y on code x produces Modifies the CPU design to asm y ( x ) for Cortex-M4: try to make this code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). Repeats; eventually stops trying. Years later, sells a new CPU. You reoptimize for this CPU. Sometimes CPUs try extending or replacing instruction set, but this is poorly coordinated with programmers, compiler writers.
8 9 Does end user need Cortex-M4? Generalize f ( x; y ) definition: f ( x; y ) is time taken by CPU designer learns about your code x on platform y . Keccak Cortex-M4 asm. If compiler y on code x produces Modifies the CPU design to asm y ( x ) for Cortex-M4: try to make this code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). Repeats; eventually stops trying. Without the CPU changing: Years later, sells a new CPU. Minimize f ( a; Cortex-M4). You reoptimize for this CPU. Search for ( x; y ) with y ( x ) = a . Sometimes CPUs try extending or replacing instruction set, but this is poorly coordinated with programmers, compiler writers.
8 9 Does end user need Cortex-M4? Generalize f ( x; y ) definition: f ( x; y ) is time taken by CPU designer learns about your code x on platform y . Keccak Cortex-M4 asm. If compiler y on code x produces Modifies the CPU design to asm y ( x ) for Cortex-M4: try to make this code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). Repeats; eventually stops trying. Without the CPU changing: Years later, sells a new CPU. Minimize f ( a; Cortex-M4). You reoptimize for this CPU. Search for ( x; y ) with y ( x ) = a . Sometimes CPUs try extending Typical CPU designer: or replacing instruction set, but View a as a constant; this is poorly coordinated with try to minimize f ( a; y ). programmers, compiler writers. Silly optimization approach.
8 9 end user need Cortex-M4? Generalize f ( x; y ) definition: “I know f ( x; y ) is time taken by I’ve develop designer learns about your code x on platform y . that computes Keccak Cortex-M4 asm. This circuit If compiler y on code x produces difies the CPU design to asm y ( x ) for Cortex-M4: make this code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). eats; eventually stops trying. Without the CPU changing: later, sells a new CPU. Minimize f ( a; Cortex-M4). reoptimize for this CPU. Search for ( x; y ) with y ( x ) = a . Sometimes CPUs try extending Typical CPU designer: replacing instruction set, but View a as a constant; poorly coordinated with try to minimize f ( a; y ). rogrammers, compiler writers. Silly optimization approach.
8 9 need Cortex-M4? Generalize f ( x; y ) definition: “I know the minimum! f ( x; y ) is time taken by I’ve developed the learns about your code x on platform y . that computes Keccak. rtex-M4 asm. This circuit is my CPU.” If compiler y on code x produces PU design to asm y ( x ) for Cortex-M4: code faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). eventually stops trying. Without the CPU changing: a new CPU. Minimize f ( a; Cortex-M4). for this CPU. Search for ( x; y ) with y ( x ) = a . CPUs try extending Typical CPU designer: instruction set, but View a as a constant; ordinated with try to minimize f ( a; y ). compiler writers. Silly optimization approach.
8 9 rtex-M4? Generalize f ( x; y ) definition: “I know the minimum! f ( x; y ) is time taken by I’ve developed the fastest circuit out your code x on platform y . that computes Keccak. This circuit is my CPU.” If compiler y on code x produces to asm y ( x ) for Cortex-M4: faster. f ( x; y ) = f ( y ( x ) ; Cortex-M4). trying. Without the CPU changing: CPU. Minimize f ( a; Cortex-M4). CPU. Search for ( x; y ) with y ( x ) = a . extending Typical CPU designer: set, but View a as a constant; with try to minimize f ( a; y ). writers. Silly optimization approach.
9 10 Generalize f ( x; y ) definition: “I know the minimum! f ( x; y ) is time taken by I’ve developed the fastest circuit code x on platform y . that computes Keccak. This circuit is my CPU.” If compiler y on code x produces asm y ( x ) for Cortex-M4: f ( x; y ) = f ( y ( x ) ; Cortex-M4). Without the CPU changing: Minimize f ( a; Cortex-M4). Search for ( x; y ) with y ( x ) = a . Typical CPU designer: View a as a constant; try to minimize f ( a; y ). Silly optimization approach.
9 10 Generalize f ( x; y ) definition: “I know the minimum! f ( x; y ) is time taken by I’ve developed the fastest circuit code x on platform y . that computes Keccak. This circuit is my CPU.” If compiler y on code x produces asm y ( x ) for Cortex-M4: Wait a minute: “CPU” concept f ( x; y ) = f ( y ( x ) ; Cortex-M4). is more restrictive than “chip”. Without the CPU changing: Perspective of CPU designer: Minimize f ( a; Cortex-M4). This chip can do anything! Search for ( x; y ) with y ( x ) = a . People want this chip to support Typical CPU designer: SHA-1, SHA-2, SHA-3, SHAmir; View a as a constant; all sorts of block ciphers; try to minimize f ( a; y ). public-key cryptosystems; Silly optimization approach. non-cryptographic computations.
9 10 Generalize f ( x; y ) definition: “I know the minimum! Adding fast ) is time taken by I’ve developed the fastest circuit (“Keccak on platform y . that computes Keccak. adds area This circuit is my CPU.” compiler y on code x produces Adding fast ( x ) for Cortex-M4: Wait a minute: “CPU” concept for desired ) = f ( y ( x ) ; Cortex-M4). is more restrictive than “chip”. adds even Without the CPU changing: Perspective of CPU designer: Minimize f ( a; Cortex-M4). This chip can do anything! for ( x; y ) with y ( x ) = a . People want this chip to support ypical CPU designer: SHA-1, SHA-2, SHA-3, SHAmir; as a constant; all sorts of block ciphers; minimize f ( a; y ). public-key cryptosystems; optimization approach. non-cryptographic computations.
9 10 ) definition: “I know the minimum! Adding fast Keccak taken by I’ve developed the fastest circuit (“Keccak coprocesso rm y . that computes Keccak. adds area to CPU. This circuit is my CPU.” code x produces Adding fast coproc rtex-M4: Wait a minute: “CPU” concept for desired mix of ) ; Cortex-M4). is more restrictive than “chip”. adds even more area CPU changing: Perspective of CPU designer: Cortex-M4). This chip can do anything! with y ( x ) = a . People want this chip to support designer: SHA-1, SHA-2, SHA-3, SHAmir; constant; all sorts of block ciphers; f ( a; y ). public-key cryptosystems; optimization approach. non-cryptographic computations.
9 10 definition: “I know the minimum! Adding fast Keccak circuit I’ve developed the fastest circuit (“Keccak coprocessor”) to CPU that computes Keccak. adds area to CPU. This circuit is my CPU.” roduces Adding fast coprocessors Wait a minute: “CPU” concept for desired mix of operations rtex-M4). is more restrictive than “chip”. adds even more area to CPU. changing: Perspective of CPU designer: rtex-M4). This chip can do anything! ) = a . People want this chip to support SHA-1, SHA-2, SHA-3, SHAmir; all sorts of block ciphers; public-key cryptosystems; roach. non-cryptographic computations.
10 11 “I know the minimum! Adding fast Keccak circuit I’ve developed the fastest circuit (“Keccak coprocessor”) to CPU that computes Keccak. adds area to CPU. This circuit is my CPU.” Adding fast coprocessors Wait a minute: “CPU” concept for desired mix of operations is more restrictive than “chip”. adds even more area to CPU. Perspective of CPU designer: This chip can do anything! People want this chip to support SHA-1, SHA-2, SHA-3, SHAmir; all sorts of block ciphers; public-key cryptosystems; non-cryptographic computations.
10 11 “I know the minimum! Adding fast Keccak circuit I’ve developed the fastest circuit (“Keccak coprocessor”) to CPU that computes Keccak. adds area to CPU. This circuit is my CPU.” Adding fast coprocessors Wait a minute: “CPU” concept for desired mix of operations is more restrictive than “chip”. adds even more area to CPU. Perspective of CPU designer: For same CPU area, This chip can do anything! obtain much better throughput by building many copies People want this chip to support of original CPU core SHA-1, SHA-2, SHA-3, SHAmir; without these coprocessors. all sorts of block ciphers; public-key cryptosystems; non-cryptographic computations.
10 11 “I know the minimum! Adding fast Keccak circuit I’ve developed the fastest circuit (“Keccak coprocessor”) to CPU that computes Keccak. adds area to CPU. This circuit is my CPU.” Adding fast coprocessors Wait a minute: “CPU” concept for desired mix of operations is more restrictive than “chip”. adds even more area to CPU. Perspective of CPU designer: For same CPU area, This chip can do anything! obtain much better throughput by building many copies People want this chip to support of original CPU core SHA-1, SHA-2, SHA-3, SHAmir; without these coprocessors. all sorts of block ciphers; public-key cryptosystems; Fast Keccak chip is special case. non-cryptographic computations. Doesn’t reflect general case.
10 11 w the minimum! Adding fast Keccak circuit CPU designer’s developed the fastest circuit (“Keccak coprocessor”) to CPU What is computes Keccak. adds area to CPU. for a specified circuit is my CPU.” within a Adding fast coprocessors minute: “CPU” concept for desired mix of operations re restrictive than “chip”. adds even more area to CPU. ective of CPU designer: For same CPU area, chip can do anything! obtain much better throughput by building many copies want this chip to support of original CPU core SHA-1, SHA-2, SHA-3, SHAmir; without these coprocessors. rts of block ciphers; public-key cryptosystems; Fast Keccak chip is special case. non-cryptographic computations. Doesn’t reflect general case.
10 11 minimum! Adding fast Keccak circuit CPU designer’s metric: the fastest circuit (“Keccak coprocessor”) to CPU What is best perfo Keccak. adds area to CPU. for a specified mix my CPU.” within a particular Adding fast coprocessors “CPU” concept for desired mix of operations restrictive than “chip”. adds even more area to CPU. CPU designer: For same CPU area, anything! obtain much better throughput by building many copies chip to support of original CPU core SHA-3, SHAmir; without these coprocessors. ciphers; cryptosystems; Fast Keccak chip is special case. non-cryptographic computations. Doesn’t reflect general case.
10 11 Adding fast Keccak circuit CPU designer’s metric: circuit (“Keccak coprocessor”) to CPU What is best performance adds area to CPU. for a specified mix of operations within a particular CPU area? Adding fast coprocessors concept for desired mix of operations “chip”. adds even more area to CPU. designer: For same CPU area, anything! obtain much better throughput by building many copies support of original CPU core SHAmir; without these coprocessors. Fast Keccak chip is special case. computations. Doesn’t reflect general case.
11 12 Adding fast Keccak circuit CPU designer’s metric: (“Keccak coprocessor”) to CPU What is best performance adds area to CPU. for a specified mix of operations within a particular CPU area? Adding fast coprocessors for desired mix of operations adds even more area to CPU. For same CPU area, obtain much better throughput by building many copies of original CPU core without these coprocessors. Fast Keccak chip is special case. Doesn’t reflect general case.
11 12 Adding fast Keccak circuit CPU designer’s metric: (“Keccak coprocessor”) to CPU What is best performance adds area to CPU. for a specified mix of operations within a particular CPU area? Adding fast coprocessors for desired mix of operations CPU designer is much more likely adds even more area to CPU. to consider incorporating a small Keccak coprocessor. For same CPU area, obtain much better throughput by building many copies of original CPU core without these coprocessors. Fast Keccak chip is special case. Doesn’t reflect general case.
11 12 Adding fast Keccak circuit CPU designer’s metric: (“Keccak coprocessor”) to CPU What is best performance adds area to CPU. for a specified mix of operations within a particular CPU area? Adding fast coprocessors for desired mix of operations CPU designer is much more likely adds even more area to CPU. to consider incorporating a small Keccak coprocessor. For same CPU area, obtain much better throughput “So we should design the by building many copies smallest Keccak circuit?” of original CPU core without these coprocessors. Fast Keccak chip is special case. Doesn’t reflect general case.
11 12 Adding fast Keccak circuit CPU designer’s metric: (“Keccak coprocessor”) to CPU What is best performance adds area to CPU. for a specified mix of operations within a particular CPU area? Adding fast coprocessors for desired mix of operations CPU designer is much more likely adds even more area to CPU. to consider incorporating a small Keccak coprocessor. For same CPU area, obtain much better throughput “So we should design the by building many copies smallest Keccak circuit?” of original CPU core —Maybe, but will this extreme without these coprocessors. be faster than using existing CPU Fast Keccak chip is special case. instructions without coprocessor? Doesn’t reflect general case.
11 12 Adding fast Keccak circuit CPU designer’s metric: Intel typically (“Keccak coprocessor”) to CPU What is best performance quite large rea to CPU. for a specified mix of operations 32KB L1 within a particular CPU area? 32KB L1 Adding fast coprocessors several fast sired mix of operations CPU designer is much more likely many different even more area to CPU. to consider incorporating a out-of-order small Keccak coprocessor. same CPU area, “So it’s small much better throughput “So we should design the to add instru ilding many copies smallest Keccak circuit?” for my favo iginal CPU core —Maybe, but will this extreme without these coprocessors. be faster than using existing CPU Keccak chip is special case. instructions without coprocessor? esn’t reflect general case.
11 12 eccak circuit CPU designer’s metric: Intel typically designs essor”) to CPU What is best performance quite large CPU co CPU. for a specified mix of operations 32KB L1 data cache, within a particular CPU area? 32KB L1 instruction rocessors several fast multipliers, of operations CPU designer is much more likely many different instructions, area to CPU. to consider incorporating a out-of-order unit, etc. small Keccak coprocessor. rea, “So it’s small cost etter throughput “So we should design the to add instruction-set many copies smallest Keccak circuit?” for my favorite crypto!” core —Maybe, but will this extreme coprocessors. be faster than using existing CPU chip is special case. instructions without coprocessor? general case.
11 12 CPU designer’s metric: Intel typically designs CPU What is best performance quite large CPU cores: for a specified mix of operations 32KB L1 data cache, within a particular CPU area? 32KB L1 instruction cache, several fast multipliers, erations CPU designer is much more likely many different instructions, CPU. to consider incorporating a out-of-order unit, etc. small Keccak coprocessor. “So it’s small cost for Intel throughput “So we should design the to add instruction-set extension smallest Keccak circuit?” for my favorite crypto!” —Maybe, but will this extreme rs. be faster than using existing CPU l case. instructions without coprocessor? case.
12 13 CPU designer’s metric: Intel typically designs What is best performance quite large CPU cores: for a specified mix of operations 32KB L1 data cache, within a particular CPU area? 32KB L1 instruction cache, several fast multipliers, CPU designer is much more likely many different instructions, to consider incorporating a out-of-order unit, etc. small Keccak coprocessor. “So it’s small cost for Intel “So we should design the to add instruction-set extension smallest Keccak circuit?” for my favorite crypto!” —Maybe, but will this extreme be faster than using existing CPU instructions without coprocessor?
12 13 CPU designer’s metric: Intel typically designs What is best performance quite large CPU cores: for a specified mix of operations 32KB L1 data cache, within a particular CPU area? 32KB L1 instruction cache, several fast multipliers, CPU designer is much more likely many different instructions, to consider incorporating a out-of-order unit, etc. small Keccak coprocessor. “So it’s small cost for Intel “So we should design the to add instruction-set extension smallest Keccak circuit?” for my favorite crypto!” —Maybe, but will this extreme —Yes, but even smaller benefit be faster than using existing CPU for Intel’s mix of operations. instructions without coprocessor?
12 13 designer’s metric: Intel typically designs Intel did is best performance quite large CPU cores: for 1 round specified mix of operations 32KB L1 data cache, How many a particular CPU area? 32KB L1 instruction cache, in an AE several fast multipliers, designer is much more likely Can be 16: many different instructions, consider incorporating a 8: smaller out-of-order unit, etc. Keccak coprocessor. 4: even smaller “So it’s small cost for Intel e should design the : : : 1: probably to add instruction-set extension smallest Keccak circuit?” compared for my favorite crypto!” and using ybe, but will this extreme —Yes, but even smaller benefit faster than using existing CPU for Intel’s mix of operations. instructions without coprocessor?
12 13 metric: Intel typically designs Intel did add instruction erformance quite large CPU cores: for 1 round of AES. mix of operations 32KB L1 data cache, How many parallel rticular CPU area? 32KB L1 instruction cache, in an AES-round cop several fast multipliers, much more likely Can be 16: big; fa many different instructions, rporating a 8: smaller but slow out-of-order unit, etc. coprocessor. 4: even smaller but “So it’s small cost for Intel design the : : : 1: probably not to add instruction-set extension circuit?” compared to skipping for my favorite crypto!” and using other CPU will this extreme —Yes, but even smaller benefit using existing CPU for Intel’s mix of operations. without coprocessor?
12 13 Intel typically designs Intel did add instruction quite large CPU cores: for 1 round of AES. erations 32KB L1 data cache, How many parallel S-boxes a rea? 32KB L1 instruction cache, in an AES-round coprocessor? several fast multipliers, re likely Can be 16: big; fast. many different instructions, 8: smaller but slower. out-of-order unit, etc. r. 4: even smaller but slower. “So it’s small cost for Intel : : : 1: probably not worthwhile to add instruction-set extension compared to skipping coprocesso for my favorite crypto!” and using other CPU instructions. extreme —Yes, but even smaller benefit existing CPU for Intel’s mix of operations. cessor?
13 14 Intel typically designs Intel did add instruction quite large CPU cores: for 1 round of AES. 32KB L1 data cache, How many parallel S-boxes are 32KB L1 instruction cache, in an AES-round coprocessor? several fast multipliers, Can be 16: big; fast. many different instructions, 8: smaller but slower. out-of-order unit, etc. 4: even smaller but slower. “So it’s small cost for Intel : : : 1: probably not worthwhile to add instruction-set extension compared to skipping coprocessor for my favorite crypto!” and using other CPU instructions. —Yes, but even smaller benefit for Intel’s mix of operations.
13 14 Intel typically designs Intel did add instruction quite large CPU cores: for 1 round of AES. 32KB L1 data cache, How many parallel S-boxes are 32KB L1 instruction cache, in an AES-round coprocessor? several fast multipliers, Can be 16: big; fast. many different instructions, 8: smaller but slower. out-of-order unit, etc. 4: even smaller but slower. “So it’s small cost for Intel : : : 1: probably not worthwhile to add instruction-set extension compared to skipping coprocessor for my favorite crypto!” and using other CPU instructions. —Yes, but even smaller benefit An instruction for 4 rounds of for Intel’s mix of operations. SHA-256 is in a few Intel CPUs.
13 14 ypically designs Intel did add instruction Lightweigh large CPU cores: for 1 round of AES. Frequent L1 data cache, How many parallel S-boxes are where X L1 instruction cache, in an AES-round coprocessor? • Keccak; several fast multipliers, • any secure Can be 16: big; fast. different instructions, • a secure 8: smaller but slower. out-of-order unit, etc. “Resource-constrained 4: even smaller but slower. it’s small cost for Intel need the : : : 1: probably not worthwhile instruction-set extension compared to skipping coprocessor favorite crypto!” and using other CPU instructions. but even smaller benefit An instruction for 4 rounds of Intel’s mix of operations. SHA-256 is in a few Intel CPUs.
13 14 designs Intel did add instruction Lightweight crypto cores: for 1 round of AES. Frequent claim in literature, cache, How many parallel S-boxes are where X might be instruction cache, in an AES-round coprocessor? • Keccak; multipliers, • any secure hash; Can be 16: big; fast. instructions, • a secure cipher; : 8: smaller but slower. it, etc. “Resource-constrained 4: even smaller but slower. cost for Intel need the smallest circuit : : : 1: probably not worthwhile tion-set extension compared to skipping coprocessor crypto!” and using other CPU instructions. smaller benefit An instruction for 4 rounds of of operations. SHA-256 is in a few Intel CPUs.
13 14 Intel did add instruction Lightweight crypto for 1 round of AES. Frequent claim in literature, How many parallel S-boxes are where X might be cache, in an AES-round coprocessor? • Keccak; • any secure hash; Can be 16: big; fast. instructions, • a secure cipher; : : : : 8: smaller but slower. “Resource-constrained IoT devices 4: even smaller but slower. Intel need the smallest circuit for : : : 1: probably not worthwhile extension compared to skipping coprocessor and using other CPU instructions. enefit An instruction for 4 rounds of erations. SHA-256 is in a few Intel CPUs.
14 15 Intel did add instruction Lightweight crypto for 1 round of AES. Frequent claim in literature, How many parallel S-boxes are where X might be in an AES-round coprocessor? • Keccak; • any secure hash; Can be 16: big; fast. • a secure cipher; : : : : 8: smaller but slower. “Resource-constrained IoT devices 4: even smaller but slower. need the smallest circuit for X .” : : : 1: probably not worthwhile compared to skipping coprocessor and using other CPU instructions. An instruction for 4 rounds of SHA-256 is in a few Intel CPUs.
14 15 Intel did add instruction Lightweight crypto for 1 round of AES. Frequent claim in literature, How many parallel S-boxes are where X might be in an AES-round coprocessor? • Keccak; • any secure hash; Can be 16: big; fast. • a secure cipher; : : : : 8: smaller but slower. “Resource-constrained IoT devices 4: even smaller but slower. need the smallest circuit for X .” : : : 1: probably not worthwhile compared to skipping coprocessor —Even if speed is acceptable, and using other CPU instructions. who will use smallest X circuit? An instruction for 4 rounds of SHA-256 is in a few Intel CPUs.
14 15 Intel did add instruction Lightweight crypto for 1 round of AES. Frequent claim in literature, How many parallel S-boxes are where X might be in an AES-round coprocessor? • Keccak; • any secure hash; Can be 16: big; fast. • a secure cipher; : : : : 8: smaller but slower. “Resource-constrained IoT devices 4: even smaller but slower. need the smallest circuit for X .” : : : 1: probably not worthwhile compared to skipping coprocessor —Even if speed is acceptable, and using other CPU instructions. who will use smallest X circuit? An instruction for 4 rounds of Why should minimum area for X SHA-256 is in a few Intel CPUs. give minimum area for IoT+ X ?
14 15 did add instruction Lightweight crypto An idea round of AES. Frequent claim in literature, Consider many parallel S-boxes are where X might be public ke ES-round coprocessor? • Keccak; receives • any secure hash; under these e 16: big; fast. • a secure cipher; : : : : verifies thes smaller but slower. “Resource-constrained IoT devices even smaller but slower. e.g. an SS need the smallest circuit for X .” probably not worthwhile Painful histo red to skipping coprocessor —Even if speed is acceptable, all clients using other CPU instructions. who will use smallest X circuit? to suppo instruction for 4 rounds of Why should minimum area for X since old SHA-256 is in a few Intel CPUs. give minimum area for IoT+ X ?
14 15 instruction Lightweight crypto An idea from Adam AES. Frequent claim in literature, Consider a device that rallel S-boxes are where X might be public keys from trusted coprocessor? • Keccak; receives data supp • any secure hash; under these public fast. • a secure cipher; : : : : verifies these signatures. slower. “Resource-constrained IoT devices but slower. e.g. an SSL client. need the smallest circuit for X .” not worthwhile Painful historical event: skipping coprocessor —Even if speed is acceptable, all clients needed upgrades CPU instructions. who will use smallest X circuit? to support new hash r 4 rounds of Why should minimum area for X since old functions few Intel CPUs. give minimum area for IoT+ X ?
14 15 Lightweight crypto An idea from Adam Langley Frequent claim in literature, Consider a device that receives xes are where X might be public keys from trusted sou cessor? • Keccak; receives data supposedly signed • any secure hash; under these public keys; • a secure cipher; : : : : verifies these signatures. “Resource-constrained IoT devices r. e.g. an SSL client. need the smallest circuit for X .” rthwhile Painful historical event: rocessor —Even if speed is acceptable, all clients needed upgrades instructions. who will use smallest X circuit? to support new hash functions rounds of Why should minimum area for X since old functions were brok CPUs. give minimum area for IoT+ X ?
15 16 Lightweight crypto An idea from Adam Langley Frequent claim in literature, Consider a device that receives where X might be public keys from trusted sources; • Keccak; receives data supposedly signed • any secure hash; under these public keys; • a secure cipher; : : : : verifies these signatures. “Resource-constrained IoT devices e.g. an SSL client. need the smallest circuit for X .” Painful historical event: —Even if speed is acceptable, all clients needed upgrades who will use smallest X circuit? to support new hash functions Why should minimum area for X since old functions were broken. give minimum area for IoT+ X ?
15 16 eight crypto An idea from Adam Langley A public signature-verification requent claim in literature, Consider a device that receives in a limited X might be public keys from trusted sources; Keccak; receives data supposedly signed Langley’s secure hash; under these public keys; Replace secure cipher; : : : : verifies these signatures. a full programming “Resource-constrained IoT devices Then can e.g. an SSL client. the smallest circuit for X .” (or upgrade Painful historical event: signatures!) —Even if speed is acceptable, all clients needed upgrades keys, with will use smallest X circuit? to support new hash functions should minimum area for X since old functions were broken. minimum area for IoT+ X ?
15 16 crypto An idea from Adam Langley A public key is a signature-verification in literature, Consider a device that receives in a limited language. be public keys from trusted sources; receives data supposedly signed Langley’s idea: hash; under these public keys; Replace this language cipher; : : : : verifies these signatures. a full programming “Resource-constrained IoT devices Then can upgrade e.g. an SSL client. smallest circuit for X .” (or upgrade to post-quantum Painful historical event: signatures!) by changing is acceptable, all clients needed upgrades keys, with no changes smallest X circuit? to support new hash functions inimum area for X since old functions were broken. rea for IoT+ X ?
15 16 An idea from Adam Langley A public key is a signature-verification program literature, Consider a device that receives in a limited language. public keys from trusted sources; receives data supposedly signed Langley’s idea: under these public keys; Replace this language with verifies these signatures. a full programming language. devices Then can upgrade hash function e.g. an SSL client. for X .” (or upgrade to post-quantum Painful historical event: signatures!) by changing public acceptable, all clients needed upgrades keys, with no changes to clients. circuit? to support new hash functions rea for X since old functions were broken. IoT+ X ?
16 17 An idea from Adam Langley A public key is a signature-verification program Consider a device that receives in a limited language. public keys from trusted sources; receives data supposedly signed Langley’s idea: under these public keys; Replace this language with verifies these signatures. a full programming language. Then can upgrade hash function e.g. an SSL client. (or upgrade to post-quantum Painful historical event: signatures!) by changing public all clients needed upgrades keys, with no changes to clients. to support new hash functions since old functions were broken.
16 17 An idea from Adam Langley A public key is a signature-verification program Consider a device that receives in a limited language. public keys from trusted sources; receives data supposedly signed Langley’s idea: under these public keys; Replace this language with verifies these signatures. a full programming language. Then can upgrade hash function e.g. an SSL client. (or upgrade to post-quantum Painful historical event: signatures!) by changing public all clients needed upgrades keys, with no changes to clients. to support new hash functions Same for public-key encryption since old functions were broken. systems: public key is program.
16 17 idea from Adam Langley A public key is a Say verification signature-verification program is a chip Consider a device that receives in a limited language. How small keys from trusted sources; receives data supposedly signed Langley’s idea: Have to these public keys; Replace this language with size of a verifies these signatures. a full programming language. size of a Then can upgrade hash function SSL client. (or upgrade to post-quantum ainful historical event: signatures!) by changing public clients needed upgrades keys, with no changes to clients. support new hash functions Same for public-key encryption old functions were broken. systems: public key is program.
16 17 Adam Langley A public key is a Say verification device signature-verification program is a chip of area A device that receives in a limited language. How small can public trusted sources; supposedly signed Langley’s idea: Have to consider, e.g., public keys; Replace this language with size of a SHA-256 signatures. a full programming language. size of a Keccak progra Then can upgrade hash function client. (or upgrade to post-quantum rical event: signatures!) by changing public needed upgrades keys, with no changes to clients. hash functions Same for public-key encryption functions were broken. systems: public key is program.
16 17 Langley A public key is a Say verification device signature-verification program is a chip of area A . receives in a limited language. How small can public keys b sources; signed Langley’s idea: Have to consider, e.g., Replace this language with size of a SHA-256 program, a full programming language. size of a Keccak program, etc. Then can upgrade hash function (or upgrade to post-quantum signatures!) by changing public upgrades keys, with no changes to clients. functions Same for public-key encryption roken. systems: public key is program.
17 18 A public key is a Say verification device signature-verification program is a chip of area A . in a limited language. How small can public keys be? Langley’s idea: Have to consider, e.g., Replace this language with size of a SHA-256 program, a full programming language. size of a Keccak program, etc. Then can upgrade hash function (or upgrade to post-quantum signatures!) by changing public keys, with no changes to clients. Same for public-key encryption systems: public key is program.
Recommend
More recommend