hardware acceleration of cryptography
play

Hardware Acceleration of Cryptography Patrick Schaumont Professor - PowerPoint PPT Presentation

Hardware Acceleration of Cryptography Patrick Schaumont Professor Bradley Department of ECE Virginia Tech Patrick Schaumont (VT) Outline 1. Fundamentals of Parallelism 2. Embedded Architecture of MSP430, MSP432 3. Hardware Acceleration


  1. Hardware Acceleration of Cryptography Patrick Schaumont Professor Bradley Department of ECE Virginia Tech Patrick Schaumont (VT)

  2. Outline 1. Fundamentals of Parallelism 2. Embedded Architecture of MSP430, MSP432 3. Hardware Acceleration in Embedded Architectures 4. AES Hardware Accelerator 5. Direct Memory Access 6. Power Dissipation 7. Literature This lecture is about: • Accelerators in microcontrollers • Embedded computing • Efficient crypto (fast, low energy) This lecture is NOT about: • FPGA • Multi‐core • High‐speed crypto Patrick Schaumont (VT) 2

  3. Outline 1. Fundamentals of Parallelism 2. Embedded Architecture of MSP430, MSP432 3. Hardware Acceleration in Embedded Architectures 4. AES Hardware Accelerator 5. Direct Memory Access 6. Power Dissipation 7. Literature Patrick Schaumont (VT) 3

  4. Sequential, Concurrent and Parallel • Consider three tasks and two processors A B v1 v2 C 1 2 Patrick Schaumont (VT) 4

  5. Sequential, Concurrent and Parallel • Sequentially running A, B and C on Proc 1 v1 Proc 1 v2 A B A B C v1 v2 time C Proc 2 time 1 2 v1 and v2 are stored in Proc 1's memory Patrick Schaumont (VT) 5

  6. Sequential, Concurrent and Parallel • In Parallel running A on Proc 1 and B on Proc 2 v1 Proc 1 A B A C v1 v2 time v2 C Proc 2 B time 1 2 v1 is stored in Proc 1's memory v2 is communicated from Proc2 to Proc1 Patrick Schaumont (VT) 6

  7. Sequential, Concurrent and Parallel • Concurrently running A and B on Proc 1, C on Proc 2 Proc 1 A B A + B v1 v2 time v1, v2 C Proc 2 C time 1 2 v1 and v2 are communicated from Proc1 to Proc2 There are many concurrency mechanisms: threading, hyperthreading, SMT, TMT, .. Patrick Schaumont (VT) 7

  8. Control and Data Dependency A Data Dependency is a relation between two operations such that the result of one operation is used by the next A Control Dependency is a relation between two operations such that one operation must execute after the other A → B → C Patrick Schaumont (VT) 8

  9. Control and Data Dependency A Data Dependency is a relation between two operations such that the result of one operation is used by the next A Data Dependency is a fundamental property of the application A Control Dependency is a relation between two operations such that one operation must execute after the other A Control Dependency is caused by a resource constraint Patrick Schaumont (VT) 9

  10. Software and Hardware Software is sequential/concurrent Hardware is parallel PUSH {A4, V1, V2, LR} MOVS A3, #0 MOVW A4, sbox+0 MOVT A4, sbox+0 MOVS A2, #0 ADD V1, A1, A2, LSL #2 LDRB V2, [A3, +V1] SBOX SBOX SBOX SBOX LDRB V2, [A4, +V2] ADDS A2, A2, #1 UXTB A2, A2 CMP A2, #4 STRB V2, [A3, +V1] BLT ||$C$L7|| ADDS A3, A3, #1 UXTB A3, A3 CMP A3, #4 BLT ||$C$L6|| POP {A4, V1, V2, PC} As parallel as allowed by resource constraints Maximally parallel Many control dependencies! Ideally, only data dependencies Patrick Schaumont (VT) 10

  11. Hardware acceleration Sequential Software Parallel Hardware Time T SW T HW If T HW < T SW , overall performance may improve Patrick Schaumont (VT) 11

  12. Hardware acceleration Sequential Software Parallel Hardware Data Time Dependencies T SW T COM T HW Data Dependencies Better : If (T HW + T COM ) < T SW , overall performance may improve Patrick Schaumont (VT) 12

  13. Speedup Sequential Software Parallel Hardware T SW T COM T HW T SW Speedup =? T COM + T HW Patrick Schaumont (VT) 13

  14. Speedup Sequential Software Parallel Hardware T SW T COM T HW T TOTAL T TOTAL Better: Speedup =? T TOTAL ‐ (T SW ‐ (T COM + T HW )) Patrick Schaumont (VT) 14

  15. Speedup Sequential Software Parallel Hardware T SW T COM T HW T TOTAL 1 Speedup = (1 ‐ p) + p/s with p = (T TOTAL / T SW ) ~ parallelizable portion s = T SW / (T COM + T HW ) ~ acceleration Patrick Schaumont (VT) 15

  16. Speedup Sequential Software Parallel Hardware T SW T COM T HW T TOTAL 1 Speedup = (1 ‐ p) + p/s About time! with p = (T TOTAL / T SW ) ~ parallelizable portion s = T SW / (T COM + T HW ) ~ acceleration Patrick Schaumont (VT) 16

  17. Outline 1. Fundamentals of Parallelism 2. Embedded Architecture of MSP430, MSP432 3. Hardware Acceleration in Embedded Architectures 4. AES Hardware Accelerator 5. Direct Memory Access 6. Power Dissipation 7. Literature Patrick Schaumont (VT) 17

  18. Target Platform • MSP432P401R • MSP430FR5994 • ARM Cortex M4 • MSP430 (16 bit) • 256k Flash/64K SRAM • 256K FRAM/ 8KSRAM • $20 • $17 Patrick Schaumont (VT) 18

  19. MSP432P401R Patrick Schaumont (VT) 19

  20. MSP432P401R BUS CPU Memory Program (read) Data (read/write) Patrick Schaumont (VT) 20

  21. MSP432P401R Direct Memory Access CRYPTO Accelerator Patrick Schaumont (VT) 21

  22. MSP432P401R Bus Slaves Bus Masters Patrick Schaumont (VT) 22

  23. MSP430FR5994 Patrick Schaumont (VT) 23

  24. MSP430FR5994 DMA BUS CPU Memory CRYPTO Program (read) Accelerator Data (read/write) Patrick Schaumont (VT) 24

  25. Memory Map (MSP430FR5994) 43FFF • Unified view of all bus slaves in a single memory space 256K FRAM • FRAM, SRAM store program and variables 4000 3BFF • Peripherals contain 8K SRAM memory-mapped registers 1C00 FFF Peripherals 020 Patrick Schaumont (VT) 25

  26. Memory Map (MSP430FR5994) 43FFF • Unified view of all bus slaves in a single memory space 256K FRAM • FRAM, SRAM store program and variables 4000 3BFF • Peripherals contain 8K SRAM memory-mapped registers 1C00 Software Hardware ... BUS FFF MOV.B #1,&P1OUT 1 Peripherals ... P1OUT 020 P1OUT Patrick Schaumont (VT) 26

  27. Example: LED Blinker #include <msp430.h> int main(void) { WDTCTL = WDTPW | WDTHOLD; P1OUT &= ~BIT0; P1IN P1DIR |= BIT0; while (1) { P1OUT ^= BIT0; __delay_cycles(100000); } } P1OUT // Assembly BIC.B #1,&P1OUT+0 OR.B #1,&P1DIR+0 XOR.B #1,&P1OUT+0 PUSHM.A #1, r13 P1DIR MOV.W #33330, r13 SUB.W #1, r13 JNE $1 POPM.A #1, r13 JMP $C$L4 Patrick Schaumont (VT) 27

  28. Outline 1. Fundamentals of Parallelism 2. Embedded Architecture of MSP430, MSP432 3. Hardware Acceleration in Embedded Architectures 4. AES Hardware Accelerator 5. Direct Memory Access 6. Power Dissipation 7. Literature Patrick Schaumont (VT) 28

  29. Hardware Acceleration • Let's design a multiplier peripheral (MSP430) • Our benchmark program unsigned long mymul( unsigned a, unsigned b) { unsigned long r; r = ( unsigned long ) a * b; return r; } volatile int arg1 = 5, arg2 = 3; int main() { return mymul(arg1, arg2); } Patrick Schaumont (VT) 29

  30. Hardware Acceleration • Let's design a multiplier peripheral (MSP430) • Our benchmark program main : MOV.W #5,0(SP) return address MOV.W #3,2(SP) MOV.W 0(SP),r12 b arguments MOV.W 2(SP),r13 a CALLA #mymul rhi mymul : rlo SP SUBA #8,SP MOV.W r13,6(SP) MOV.W r12,4(SP) stack frame CALLA #__mspabi_mpyul just before MOV.W r12,0(SP) ADDA #8, SP MOV.W r13,2(SP) ADDA #8,SP RETA Patrick Schaumont (VT) 30

  31. Hardware Acceleration • Our benchmark program __mspabi_mpyul : // library function MOV.W R12 ,R11 MOV.W R13 ,R14 CLR.W R15 main : CLR.W R12 MOV.W #5,0(SP) CLR.W R13 MOV.W #3,2(SP) CLRC MOV.W 0(SP),r12 arguments RRC.W R11 MOV.W 2(SP),r13 JMP mpyul_add_loop1 CALLA #mymul RRA.W R11 mpyul_add_loop: mymul : JNC shift_test_mpyul SUBA #8,SP mpyul_add_loop1: MOV.W r13,6(SP) ADD.W R14, R12 MOV.W r12,4(SP) ADDC.W R15, R13 CALLA #__mspabi_mpyul RLA.W R14 MOV.W r12,0(SP) shift_test_mpyul: MOV.W r13,2(SP) RLC.W R15 ADDA #8,SP TST.W R11 RETA JNE mpyul_add_loop RET Patrick Schaumont (VT) 31

  32. Hardware Acceleration • Our accelerated program data+control dependency main : MOV.W #5,0(SP) MOV.W #3,2(SP) MOV.W 0(SP),r12 arguments MOV.W 2(SP),r13 CALLA #myhwmul Hardware myhwmul : Multiplier MOV.W r12,&HW1 MOV.W r13,&HW2 software MOV.W #0,&CTL1 HAL MOV.W #1,&CTL1 MOV.W &HW3,r12 data dependency MOV.W &HW4,r13 RETA Patrick Schaumont (VT) 32

  33. Hardware Multiplier System Interconnect per_addr 16 or 20 per_din to 16 per_we other bus slaves per_en bus interface a ctl b 1‐cycle CPU mul rlo rhi from 16 per_dout other 16 bus slaves 16 Patrick Schaumont (VT) 33

  34. Hardware Multiplier Peripheral Design per_din per_din[0] per_din 16 1 16 write_c write_a write_b 16 per_we per_ad per_en 16 16 1 16 1 a ctl b 2 16 16 16 1 address decoding * 16 32 write_c write_a write_b 16 read_r 16 edge edg edg 16 16 edg rlo rhi 0 read_r 2 per_dout Patrick Schaumont (VT) 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend