Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical - - PowerPoint PPT Presentation

multicore dsp architecture and programming
SMART_READER_LITE
LIVE PREVIEW

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical - - PowerPoint PPT Presentation

System requirements System design System development Summary Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping University, Linkping, Sweden Guest lecture in TDDD56 Multicore and GPU Programming, LiU,


slide-1
SLIDE 1

System requirements System design System development Summary

Multicore DSP Architecture and Programming

  • O. Dahl1

1Electrical Engineering, Linköping University, Linköping, Sweden

Guest lecture in TDDD56 Multicore and GPU Programming, LiU, December 5, 2011

slide-2
SLIDE 2

System requirements System design System development Summary

Personal background

At LiU since 2011-01-01, at ISY (Institutionen för Systemteknik) - Associate Professor in System Integration (a new subject at the department) http://www.da.isy.liu.se/∼olad/ Moved from ST-Ericsson Started at Ericsson November 2006 - worked with applications, software architecture, LTE design, simulation for software development Before that: engineer, consultant, manager, associate professor in Computer Science and Automatic Control Experience in software development, system engineering, system development, simulation, real-time systems, control Ph D Automatic Control, Lund, 1992

slide-3
SLIDE 3

System requirements System design System development Summary

Problem to solve

How to make a wireless modem for 3GPP LTE (and older standards as well e.g. WCDMA, GSM)?

slide-4
SLIDE 4

System requirements System design System development Summary

Challenges

Meet requirements in high-speed wireless mobile communication (> 100Mb/s) standards compliance (3GPP) competitiveness silicon size (square millimeters) power consumption (mW to W) flexibility (many standards, backwards compatibility)

slide-5
SLIDE 5

System requirements System design System development Summary

References

The Smartphone Disruption [Gustafsson, 2011] ST-Ericsson M7400 [ST-Ericsson, 2011] 3gpp [3GPP , 2011] ePUMA [ePUMA, 2011] - with contributions from Joar Sohl and Andreas Karlsson Coresonic [Coresonic, 2011] ST-Ericsson EVP [ST-Ericssson, 2009] System-C and TLM - http://www.systemc.org (temporarily down due to merger with Accellera - see e.g. [Doulos, 2011] until December 7) Virtual platforms e.g. [Corleto, 2009] Wikipedia

slide-6
SLIDE 6

System requirements System design System development Summary

Outline

1

System requirements 3GPP LTE - basic concepts

2

System design DSP DSP - ePUMA ASIC Control processors

3

System development

4

Summary

slide-7
SLIDE 7

System requirements System design System development Summary

Outline

1

System requirements 3GPP LTE - basic concepts

2

System design DSP DSP - ePUMA ASIC Control processors

3

System development

4

Summary

slide-8
SLIDE 8

System requirements System design System development Summary 3GPP

3GPP LTE

Specifications from [3GPP , 2011], e.g. 36.201, 36.211, 36.212 Increased data rates e.g. 100-300 Mbit/s downlink, > 50 MBit/s uplink Scalable channel bandwidth OFDM, MIMO Packet-switched all-IP solution (no circuit switching) Sub-5ms latency Overview e.g. in [Agilent, 2009]

slide-9
SLIDE 9

System requirements System design System development Summary LTE - basic concepts

Digital modulation

Map sequence of bits to a complex number QAM - Quadrature Amplitude Modulation [Wikipedia, 2011a]

slide-10
SLIDE 10

System requirements System design System development Summary LTE - basic concepts

I and Q - complex numbers

Introduce the carrier frequency ωc Send s(t) = I(t) cos(ωct) + Q(t) sin(ωct) Receive, with disturbance n(t), ˆ s(t) = s(t) + n(t) Define s1(t) = ˆ s(t) cos(ˆ ωct) and calculate s1(t) = ˆ s(t) cos(ˆ ωct) = I(t) cos(ωct) cos(ˆ ωct) + Q(t) sin(ωct) cos(ˆ ωct) + n(t) cos(ˆ ωct) = I(t)1 2(cos((ωc − ˆ ωc)t) + cos((ωc + ˆ ωc)t))+ Q(t)1 2(sin((ωc + ˆ ωc)t) + sin((ωc − ˆ ωc)t)) + n(t) cos(ˆ ωct) Low-pass filtering and ˆ ωc ≈ ωc gives 2s1(t) ≈ I(t)

slide-11
SLIDE 11

System requirements System design System development Summary LTE - basic concepts

I and Q - complex numbers

Similarly, define s2(t) = ˆ s(t) sin(ˆ ωct) and calculate s2(t) = ˆ s(t) sin(ˆ ωct) = I(t) cos(ωct) sin(ˆ ωct) + Q(t) sin(ωct) sin(ˆ ωct) + n(t) sin(ˆ ωct) = I(t)1 2(sin((ˆ ωc + ωc)t) + sin((ˆ ωc − ωc)t))+ Q(t)1 2(cos((ˆ ωc − ωc)t) − cos((ˆ ωc + ωc)t)) + n(t) sin(ˆ ωct) Low-pass filtering and ˆ ωc ≈ ωc gives 2s2(t) ≈ Q(t)

slide-12
SLIDE 12

System requirements System design System development Summary LTE - basic concepts

OFDM

Send data on multiple frequencies Send during a symbol interval Tu Use subcarrier spacing ∆f =

1 Tu

In LTE, ∆f = 15kHz (mostly), i.e. Tu ≈ 66.7µs

slide-13
SLIDE 13

System requirements System design System development Summary LTE - basic concepts

OFDM - orthogonality

Fourier transform of a pulse [Wikipedia, 2011b]

slide-14
SLIDE 14

System requirements System design System development Summary LTE - basic concepts

OFDM - orthogonality

Orthogonality, since signals on two subcarriers x1(t) = a1ej2πk1∆ft, x2(t) = a2ej2πk2∆ft fulfil (m+1)Tu

mTu

x1(t)x∗

2(t)dt =

(m+1)Tu

mTu

a1a∗

2ej2π(k1−k2)∆ftdt = 0

for k1 = k2

slide-15
SLIDE 15

System requirements System design System development Summary LTE - basic concepts

OFDM - implementation using FFT

OFDM can be implemented using FFT (Fast Fourier Transform) at receiver side and IFFT (Inverse FFT) at sender side

slide-16
SLIDE 16

System requirements System design System development Summary LTE - basic concepts

OFDM and modulation - sender

[Wikipedia, 2011c]

slide-17
SLIDE 17

System requirements System design System development Summary LTE - basic concepts

OFDM and modulation - receiver

[Wikipedia, 2011c]

slide-18
SLIDE 18

System requirements System design System development Summary LTE - basic concepts

Coding and Decoding

Main coding algorithm is Turbo coding with a coding rate R = 1/3 Convolutional coding (for BCH - broadcast channel) Turbo encoder [Wikipedia, 2011d]

slide-19
SLIDE 19

System requirements System design System development Summary LTE - basic concepts

Parallel signal processing

OFDM symbols received in series from the radio interface, processed in parallel, processing stages include e.g. FFT, demodulation, control decoding, data decoding. Uplink processing proceeds in parallel

slide-20
SLIDE 20

System requirements System design System development Summary LTE - basic concepts

Channel estimation

Estimate properties of channel Compensate for channel effects Communication with base station Reference signal (pilot symbols)

slide-21
SLIDE 21

System requirements System design System development Summary LTE - basic concepts

MIMO

Multiple-antennas Diversity techniques Spatial multiplexing (send more than one data stream)

slide-22
SLIDE 22

System requirements System design System development Summary LTE - basic concepts

And there is more ...

synchronization (time, frequency) cell search receive system information power control uplink synchronization (timing advance) FDD and TDD Random access Paging HARQ and ... this is only L1 ... we have to make a complete protocol stack ... and it has to be mobile (handover etc.)

slide-23
SLIDE 23

System requirements System design System development Summary LTE - basic concepts

What speed do we get?

20Mhz bandwidth, 1200 subcarriers 14 OFDM symbols in one subframe (1 ms) 64QAM - 6 bits per resource element 14*6*1200/1e-3 = 100800000 (without coding, control information, but also without MIMO)

slide-24
SLIDE 24

System requirements System design System development Summary

Outline

1

System requirements 3GPP LTE - basic concepts

2

System design DSP DSP - ePUMA ASIC Control processors

3

System development

4

Summary

slide-25
SLIDE 25

System requirements System design System development Summary

Building blocks

DSP ASIC Control processors

slide-26
SLIDE 26

System requirements System design System development Summary

And there is more ...

Application processors Radio, radio interface Interconnect, buses Memory, caches Power management, thermal management Imaging, video, graphics, display Storage, e.g. flash, memory card

slide-27
SLIDE 27

System requirements System design System development Summary DSP

Leocore

Information from [Coresonic, 2011, Anjum et al., 2011] Leocore ASIP for baseband processing Identify common operations in baseband processing - domain specific architecture Coresonic developer studio SIMT TM- Single Instruction-flow Multiple Tasks Units for complex calculations, control unit (RISC), accelerators for FEC (Viterbi, Turbo) DFE interface, MAC interface

slide-28
SLIDE 28

System requirements System design System development Summary DSP

Master thesis proposal - Parallel Simulation of Multicore DSP Systems for Software Defined Radio

Develop parallel version of simulation tool Utilize multicore on the host Threads - partitioning, synchronization, interaction Static analysis, dynamic analysis Requires competence in concurrent programming and hardware/software interaction. Knowledge of DSP hardware and software is beneficial, but not strictly required C++, some Python more info at [Computer Engineering, 2011]

slide-29
SLIDE 29

System requirements System design System development Summary DSP

EVP

Information from [ST-Ericssson, 2009] EVP Vector processor (SIMD) VLIW instructions - 6 parallel vector operations, 4 parallel scalar operations C control structures Code generator for scrambling and generating channelization codes < 0.5mW/MHz

slide-30
SLIDE 30

System requirements System design System development Summary DSP - ePUMA

ePUMA

Research project at Division of Computer Engineering, cooperation also with Information Coding (ISY) and IDA (parallel programming) Overview Master thesis proposals

slide-31
SLIDE 31

System requirements System design System development Summary DSP - ePUMA

ePUMA

Highly parallel processor for predictable DSP tasks Heterogenous design:

1 master control processor 8 slave processor cores

Exploited parallelism:

Task-parallelism (several processor cores) Data-parallelism (SIMD instructions on slave processors)

slide-32
SLIDE 32

System requirements System design System development Summary DSP - ePUMA

Applications

Some example applications: Baseband processing. Media processing. Radar. Often in constrained environments, such as phones. Ordinary processors often fail because of high power consumption. high cost. low performance.

slide-33
SLIDE 33

System requirements System design System development Summary DSP - ePUMA

Properties of DSP algorithms

Most DSP algorithms share some common traits. Predictable addressing. I.e the addresses of the accessed values are not data dependant. Few branches other than back jumps in loops. Constant iteration counts. Application Specific Instruction set Processors (ASIPs) for DSP take advantage of this to solve the previous problems.

slide-34
SLIDE 34

System requirements System design System development Summary DSP - ePUMA

System overview

Sleipnir 0 Sleipnir 1 Sleipnir 3 Master DMA Main Memory Sleipnir 5 Sleipnir 6 Sleipnir 7 Sleipnir 4 Sleipnir 2 N0 N1 N2 N4 N7 N6 N5 N3

slide-35
SLIDE 35

System requirements System design System development Summary DSP - ePUMA

Memory hierarchy

Off chip main memory On chip interconnection Master LS PM DM 0 DM 1 Master Core Registers Sleipnir 0 LS PM CM LVM 1 LVM 2 LVM 3 Sleipnir Core Registers Sleipnir 7 LS PM CM LVM 1 LVM 2 LVM 3 Sleipnir Core Registers

...

Level 1 Level 2 Level 3

slide-36
SLIDE 36

System requirements System design System development Summary DSP - ePUMA

Sleipnir features

Scratchpad memory based programming - no data cache Up to 16-way SIMD datapath (operates on 128 bit data vectors) Up to 16 real or 4 complex multiplications per cycle (16 bit data) Supported datatypes:

Real fixed-point data: 8, 16, 32 bits Complex fixed-point data: 16, 32 bit real and imaginary parts Single precision floating-point (32 bits)

Special purpose instructions: DCT, butterflies, sort...

slide-37
SLIDE 37

System requirements System design System development Summary DSP - ePUMA

Sleipnir customization

Many parameters can be customized: Instruction set Local memory and register file sizes AGU capabilities Accelerators Parameter Value Local vector memory (LVM) size Up to 8k 128-bit vectors (128kB) Register file size 8-32 vectors (0.125 - 0.5 kB) Constant memory size Up to 256 vectors (4kB) Program memory size Typically 8-16 kB

slide-38
SLIDE 38

System requirements System design System development Summary DSP - ePUMA

Addressing

Normally many cycles are wasted on rearranging data with shuffle-instructions. This is often due to issues with data alignment and bank-conflicts.

slide-39
SLIDE 39

System requirements System design System development Summary DSP - ePUMA

Data access

Consider the following address layout in a single bank memory. The only vectors of length four that can be accessed in one cycle is the row vectors {0,. . . ,3}, {4,. . . ,7}, {8,. . . ,11} and {12,. . . ,15}. Accessing one of the colored column vectors take 4 cycles.

slide-40
SLIDE 40

System requirements System design System development Summary DSP - ePUMA

Multi-bank

By splitting the memory into different banks (which increases the area cost somewhat), the only constraint is that no two elements reside in the same bank. So while we may now access e.g. vectors {x,. . . ,x+3} in one cycle, the columns still take four cycles to access.

  • A
  • B

B B B

slide-41
SLIDE 41

System requirements System design System development Summary DSP - ePUMA

Multi-bank and permutation

Given that the access patterns are known in advance, as is common in DSP algorithms, we may reorder the physical addresses of the logical addresses. An example of a permution that allows single cycle access for the columns can be seen below. No two elements of any column reside in the same memory bank.

  • A
  • B

B B B

slide-42
SLIDE 42

System requirements System design System development Summary DSP - ePUMA

ePUMA system benchmarks

Algorithm Sleipnir Cell GTX280 8 × 8 DCT/Q 5 ≈ 59 8 × 8 DCT 4 ≈ 66

Table: Average required clock cycles

Execution time for DCT/Quantization for ePUMA @ 300 MHz ≈ Cell @ 3.2 GHz.

slide-43
SLIDE 43

System requirements System design System development Summary DSP - ePUMA

Master thesis proposal - Sleipnir accelerator interface design

Some problems are not well handled by processors One solution: Design an accelerator Thesis proposal:

Design an accelerator interface to Sleipnir Implement and evaluate accelerators for Sleipnir

slide-44
SLIDE 44

System requirements System design System development Summary DSP - ePUMA

Master thesis proposal - Memory architecture evaluation

Evaluate different memory configurations for our multicore architecture

Single-bank Multi-bank Multi-bank with permutation Multi-port Cache ...

Investigate impact of memory architecture for different applications Interesting aspects:

Performance Power-consumption Chip area

slide-45
SLIDE 45

System requirements System design System development Summary DSP - ePUMA

Master thesis proposal - FPGA Board Demo of ePUMA

Setting up an FPGA board demo of ePUMA to verify the hardware design Goals:

Setting up demo environment Test some of our excisting demo applications (MotionJPEG and MPEG2-decoder) on real hardware Possibility to set up your own demo!

more info at [Computer Engineering, 2011]

slide-46
SLIDE 46

System requirements System design System development Summary ASIC

ASIC

Decide which blocks to be implemented in hardware Decide on programmability Power consumption

slide-47
SLIDE 47

System requirements System design System development Summary Control processors

Control processor(s)

Modem control Power control ARM Cortex R, M (A) RTOS

slide-48
SLIDE 48

System requirements System design System development Summary

Outline

1

System requirements 3GPP LTE - basic concepts

2

System design DSP DSP - ePUMA ASIC Control processors

3

System development

4

Summary

slide-49
SLIDE 49

System requirements System design System development Summary

Parallel development

Concurrent development of hardware and software Hardware simulation for software development Virtual platform

slide-50
SLIDE 50

System requirements System design System development Summary

SystemC

Event-driven simulation framework Handles time and parallel activities Standardized by OSCI, IEEE C++ class library

slide-51
SLIDE 51

System requirements System design System development Summary

TLM

Transaction level modeling Function calls vs. pin-level simulation Bit-accurate interfaces Varying degrees of timing can be added (loosely timed, approximately timed) Hardware modeling for software verification

slide-52
SLIDE 52

System requirements System design System development Summary

Virtual platform

A virtual representation of the system SystemC and TLM Processor models Peripheral models Commercial tools Model handling - signal processing models, HW verification models, virtual platform models Acceptance and usage, finding bugs, early SW development and verification, release of platform, supporting different RATs, software layer dependencies

slide-53
SLIDE 53

System requirements System design System development Summary

Outline

1

System requirements 3GPP LTE - basic concepts

2

System design DSP DSP - ePUMA ASIC Control processors

3

System development

4

Summary

slide-54
SLIDE 54

System requirements System design System development Summary

Summarizing notes

LTE as an example of multicore digital signal processing Digital signal processors and ASIC blocks Control processors 3GPP Time-to-market Hardware and Software as parallel developent tracks

slide-55
SLIDE 55

System requirements System design System development Summary

3GPP (2011). 3GPP - specification numbering. http://www.3gpp.org/specification-numbering. Agilent (2009). 3GPP Long Term Evolution: System overview, product development, and test challenges. http://cp.literature.agilent.com/litweb/pdf/5989- Anjum, O., Ahonen, T., Garzia, F., Nurmi, J., Brunelli, C., and Berg, H. (2011). State of the art baseband DSP platforms for Software Defined Radio: A survey. EURASIP Journal on Wireless Communications and Networking. Computer Engineering, I. (2011). Master thesis proposals. http://www.da.isy.liu.se/undergrad/exjobb/open/en

slide-56
SLIDE 56

System requirements System design System development Summary

Coresonic (2011). Coresonic - company website. http://www.coresonic.se/. Corleto, J. (2009). Virtual platforms: Enablement challenges. http://www.ovpworld.org/view.php?doc=qualcomm_cor Doulos (2011). SystemC TLM 2.0. http://www.doulos.com/knowhow/systemc/tlm2. ePUMA (2011). ePUMA embedded parallel DSP platform with unique memory access. http://www.da.isy.liu.se/research/current/ePUMA/. Gustafsson, K. (2011). The smartphone disruption.

slide-57
SLIDE 57

System requirements System design System development Summary

In ELLIIT Workshop, Lund. http://www.control.lth.se/media/ELLIIT/Workshop20 ST-Ericsson (2011). THOR M7400 LTE and HSPA+. http://stericsson.com/products/m7400-thor.jsp. ST-Ericssson (2009). Low-power Embedded Vector DSP. http://www.stericsson.com/sales_marketing_resourc Wikipedia, C. d. (2011a). Constellation diagram. http://en.wikipedia.org/wiki/Constellation_diagra Wikipedia, F. t. (2011b). Fourier transform. http://en.wikipedia.org/wiki/Fourier_transform. Wikipedia, O. (2011c).

slide-58
SLIDE 58

System requirements System design System development Summary

OFDM. http://en.wikipedia.org/wiki/Orthogonal_frequency Wikipedia, T. c. (2011d). Turbo code. http://en.wikipedia.org/wiki/Turbo_code.