a risc v isa extension for ultra low power iot wireless
play

A RISC-V ISA EXTENSION FOR ULTRA-LOW POWER IOT WIRELESS SIGNAL - PowerPoint PPT Presentation

A RISC-V ISA EXTENSION FOR ULTRA-LOW POWER IOT WIRELESS SIGNAL PROCESSING Carolynn Bernier, Hela Belhadj Amor, Zdenk Pikryl Oct 1, 2019 ULP WIRELESS DESIGN @ LETI 2003 2005 2010 today RFID Atmel-Starchip VHBR 65nm


  1. A RISC-V ISA EXTENSION FOR ULTRA-LOW POWER IOT WIRELESS SIGNAL PROCESSING Carolynn Bernier, Hela Belhadj Amor, Zdenĕk Přikryl Oct 1, 2019

  2. ULP WIRELESS DESIGN @ LETI 2003 2005 2010 today RFID Atmel-Starchip VHBR 65nm Digbee ULP RF SoC Letibee Foxy UWB UWB Impulse LDR-TCR radio LC FILTER CONFIG BASEBAND INPUT PLL & OUTPUT RX MN TX Hybrid UWB / RFID UWB/RFID Wake-up UMETAG Wake-up radio Radio C. Bernier | October 1, 2019 | 2

  3. SOFTWARE RADIO FOR ULP IOT Motivation : A software-defined “Smart” wireless transceiver for IoT • PHY-agnostic solution for LPWA-IOT • Address « multi-mode » markets and lower hardware bug fix costs Software- • Offer future-proofed designs to our clients Defined • Our clients’ advanced prototypes have evolving needs : satellite-IoT, Transceiver Ultra-wide band localization, LPWA-IoT. • A new experimental platform • Design new “RF software sensors” • Use light-weight ML algorithms to extract information from the RF signal C. Bernier | October 1, 2019 | 3

  4. SOFTWARE RADIO FOR ULP IOT • Bottleneck : Existing software-defined radio (SDR) solutions are NOT ULP ! High cost (200 - 5K USD) General purpose  High power [Akeela, 2018] C. Bernier | October 1, 2019 | 4

  5. SOFTWARE RADIO FOR ULP IOT Solution : Design of ULP-SDR SDR-based IoT node Similar requirements in most IoT 2.4 GHz (ISM) MCU transceivers (BW < 5MHz) • Heterogeneous Application or • Protocol stack multi-core platform 1.6 GHz (satellite) Wide- ULP MEM Configurable band Challenge: or DFE SDR RF PMU UWB Target mW-level Sensor I/F or power consumption Differing requirements in subGHz (ISM) … most IoT transceivers C. Bernier | October 1, 2019 | 5

  6. SYSTEM REQUIREMENTS • Target Architecture • A very small and fast core (signoff ~300 MHz) associated to a TCPM and TCDM • Software DSP limited to decimated sample streams • DFE includes easily configurable and common HW operators : FIR filters, down- converters, AGC… • Real-time processing of complex samples • Samples are temporarily stored in sample buffer and processed in blocks • Integer processing only • Limit size of memory  big impact on power  configurable in size • TCPM (high speed non volatile) • TCDM (stack usage !) • Sample buffer • Limit read/write to TCM • Single-cycle sleep • Wait for next block of samples • Radio = OFF/ON C. Bernier | October 1, 2019 | 6

  7. COMPUTING FOR WIRELESS DSP • Wireless DSP requires linearity and low distortion • Operatiors MUST NOT saturate • Operators MUST NOT overflow  but checking for overflows is too costly • Wireless DSP must conserve dynamic range (DR) • The useful signal is often contained in the least significant bits • Beware of quantification noise  take care when rescaling the signal ! • Most wireless signals are complex : i(t) + j*q(t) • Frequent use of MUL, ADD, SUB, MAG, SHIFT, … instructions on 8/16/32 bit complex data • Demodulation/compensation algorithms are mostly based on correlations  i.e. multiplication • Input signal stream is typically <= 8 bits I 1 Q 1 I 1 I 2 I 2 Q 2 Q 2 Q 1 • i.e. data streams are typically 8 / 16 / 32 bits •  fits well on a 32-bit machine X X X X - + C. Bernier | October 1, 2019 | 7

  8. WHICH PROCESSOR FOR OUR SDR ? • Academic: Dedicated processors Commercial: GP processors, DSP Custom SIMD [Chen, HPCA16] Promising power consumption Dedicated architectures  Previous work: difficult to program M3/M0+ vs. RISCY No software tool- [Belhadj, DATE19] chains  Lessons learned : GP processor can rival Custom MCU [Wu, GlobalSIP16] dedicated SoA processor architectures (with Low frequency clock additional benefits)  Large surface  Lessons learned : size of register file has overheads huge impact on cycle count RISC-V advantage !  Lessons learned : post-increment, HW Inefficient use of advanced CMOS loop, SIMD  not important in our test benches nodes (mix of DSP computing and control) | 8 C. Bernier | October 1, 2019

  9. PROCESSOR CUSTOMIZATION • RISC-V-based acceleration ? • Extend RISC-V ISA using dedicated instructions • Codasip Studio :  An easy task ? Codasip Studio Toolset • Instruction Accurate (IA) model of new instructions • Dedicated to RF DSP HDK(CA) SDK(IA/CA) computing “zero cost” Automatic RTL generation Automatic Toolchain hardware implementation Powerful High level Syntheses generation Verilog VHDL Standards based tools & models Verification Automation VSP and processor validation Virtual prototypes C. Bernier | October 1, 2019 | 9

  10. EXPLORING THE INSTRUCTION JUNGLE • Wanted • Minimal set of USEFUL instructions. • Only 32-bit opcodes for low decoding complexity. • REJECTED Opportunities • • Wide opcodes means up to 5 operands ! More general solution prefered : • First operation on 8-bit data is ALWAYS a complex • Halving variants (e.g. RADD) multiplication • Advanced CMOS allows single-cycle operators • Not clearly indispensable : • Tiny relative cost of ALU operators • CSMUL (complex-scalar multiply ) • Useless : • saturating instructions, MIN/MAX, 8 bit SIMD, CONJ 45 nm, 0.9 V [M. Horowitz, ISSCC 2014] C. Bernier | October 1, 2019 | 10

  11. PROPOSED EXTENSION • 15 instructions using 3 major opcodes • « Zero-cost »  Reconfigurable HW  Systematic output DR adjust • « Low-cost »  4 output / 2 input port register file  Duplicated ALU • « Higher-cost »  3 more 32-bit multipliers C. Bernier | October 1, 2019 | 11

  12. WIRELESS DSP TESTBENCHES Testbench 1: FSK demodulation Testbench 3: 16 and 32-bit FFT • Radix-4 decimation-infrequency, complex FFT with bit-reversed outputs, N = 128, 2048 • Based on source code from a port of the ARM CMSIS DSP library to RISC-V Testbench 2: LoRa preamble synchronization Testbench 4: CORDIC algorithm • Spreading Factor (SF) = 7, 11 • 10 iteration CORDIC algorithm applied to 32-bit complex input data. C. Bernier | October 1, 2019 | 12

  13. Power Model Baseline +Extensions RESULTS All instr. except 1 1.05 NOP and MUL MUL 1.14 1.14 MULC16-32 / - 1.3 MULC16 • Expect at least ~50% power reductions with MULC32 - 1.59 reduced clock and VDD. Testbench Cycle count improvement (IA model) Energy improvement (est.) FSK Demod 22 % LoRa, SF=7 49 % 46 % LoRa, SF=11 52 % 50 % 16-bit FFT, N=128 55 % 53 % 16-bit FFT, N=2048 57 % 55 % 32-bit FFT, N=128 34 % 32 % 32-bit FFT, N=2048 34 % 30 % 32-bit CORDIC, 10 iteration 28 % C. Bernier | October 1, 2019 | 13

  14. FUTURE WORK • Finish CA model & run Power/Area analysis in 22 nm • Reconfigurable hardware blocks designed in CodAL. Ex: 32-bit multiplication src1[15:0] src1[31:16] src1[15:0] src2[15:0] src1[31:16] src2[15:0] src2[31:16] src2[31:16] p 00 [31:0] p 10 [31:0] p 11 [31:0] p 01 [31:0] CASE : 32-bit integer multiplication CASE : 16-bit complex multiplication p 10 [31:0] p 11 [31:0] p 01 [31:0] p 10 [31:0] p 01 [31:0] Two’s compl. p x [32:0] […00,p x ,00..] [p 11 [31:0],p 00 [31:0]] p 00 [31:0] p real [31:0] p mag [31:0] P[63:0] C. Bernier | October 1, 2019 | 14

  15. Special thanks to : Hela Belhadj Amor Zdenĕk Přikryl Jerry Ardizzone And Ivan Miro Panades Yves Durand Henri-Pierre Charles Simone Bacles-Min Romain Lemaire Leti, technology research institute Commissariat à l’énergie atomique et aux énergies alternatives … and all of LISAN ! Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France www.leti.fr

  16. PROCESSOR CUSTOMIZATION • Step 1 : ISA exploration using IA model Used by IA and CA models element opc_name { use instance_data_type as name of instances; assembler {textual form of the instruction}; binary {The instructions's binary coding}; semantics Used by IA model { The instruction's behavior is described using a subset of the ANSI C language. Call to memory }; interface if_ldst }; C. Bernier | October 1, 2019 | 16

  17. RECONFIGURABLE MULTIPLIER (8 BIT EXAMPLE HERE) State 1 : the block performs 8-bit integer multiplication a[7:0] * b[7:0] = P[15:0] p 00 [7:0] = a[3:0] * b[3:0] p 10 [7:0] = a[7:4] * b[3:0] p 01 [7:0] = a[3:0] * b[7:4] p 10 [7:0] p 11 [7:0] = a[7:4] * b[7:4] p 00 [7:0] P[15:0]= p 00 [7:0] + p 10 [7:0] << 4 + p 11 [7:0] p 01 [7:0] << 4 + p 01 [7:0] p 00 [7:0] << 8 C. Bernier | October 1, 2019 | 17

  18. RECONFIGURABLE MULTIPLIER (8 BIT EXAMPLE HERE) State 2 : the block performs a 4-bit complex integer multiplication : (I 1 +j*Q 1 ) * (I 2 + j*Q 2 ) = P real + j*P imag Q 1 I 1 Q 2 I 2 Input is redefined: p 10 [7:0] I 1 [3:0] = a[3:0] Q 1 [3:0] = a[7:4] I 2 [3:0] = b[3:0] p 00 [7:0] Q 2 [3:0] = b[7:4] p 11 [7:0] p 01 [7:0] C. Bernier | October 1, 2019 | 18

  19. RECONFIGURABLE MULTIPLIER (8 BIT EXAMPLE HERE) State 2 : the block performs a 4-bit complex integer multiplication : (I 1 +j*Q 1 ) * (I 2 + j*Q 2 ) = P real + j*P imag Q 1 I 1 Q 2 I 2 P real = I 1 *I 2 - Q 1 * Q 2 P real = p 00 [7:0] - p 11 [7:0] p 10 [7:0] P imag = I 1 *Q 2 + Q 1 * I 2 P imag = p 01 [7:0] + p 10 [7:0] p 00 [7:0] p 11 [7:0] p 01 [7:0] C. Bernier | October 1, 2019 | 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend