DSP Introduction DSP Introduction Instructor: Prof. An-Yeu Wu - - PowerPoint PPT Presentation

dsp introduction dsp introduction
SMART_READER_LITE
LIVE PREVIEW

DSP Introduction DSP Introduction Instructor: Prof. An-Yeu Wu - - PowerPoint PPT Presentation

Graduate Institute of Electronics Engineering, NTU DSP Introduction DSP Introduction Instructor: Prof. An-Yeu Wu 2004/September ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Outline Outline Digital


slide-1
SLIDE 1

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

DSP Introduction DSP Introduction

Instructor: Prof. An-Yeu Wu 2004/September

slide-2
SLIDE 2

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P2

Outline Outline

Digital Signal Processing Overview Applications Market Observation DSP Processor Introduction Fundamentals of Digital Signal Processor Recent DSP Relative Topics

slide-3
SLIDE 3

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P3

Digital Signal Processing Digital Signal Processing Overview Overview

slide-4
SLIDE 4

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P4

What is Signal Processing? What is Signal Processing?

slide-5
SLIDE 5

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P5

Signals Signals

slide-6
SLIDE 6

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P6

Signals Signals

Signal is: (Webster’s Dictionary)

1 : SIGN, INDICATION 2a : an act, event, or watchword that has been agreed on as the occasion of concerted action b : something that incites to action

Signal can be characterized in several ways:

Continuous time or Discrete time Continuous valued or Discrete values 1-D signals or 2-D signals (different dimension) Real valued or Complex valued Scalar or Vector Deterministic or Random

slide-7
SLIDE 7

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P7

Characterize Signals Characterize Signals

Continuous time & continuous valued: Analog signal Discrete time & continuous valued: Sampled signal Continuous time & discrete valued: Quantized signal Discrete time & discrete valued: Digital signal

slide-8
SLIDE 8

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P8

Characterize Signals Characterize Signals

Different dimensional signals:

Speech vs. Image vs. Video

Real value & Complex value signals:

Residential electrical power vs. Industrial reactive power

Scalar & Vector signals:

Sea Surface Temperature vs. North Atlantic Current

Deterministic & Random signals:

Speech vs. Background noise

slide-9
SLIDE 9

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P9

Processing Processing

Process is: (Webster’s Dictionary)

2 b (1) : to subject to or handle through an established usually routine set of procedures (2) : to subject to examination or analysis

Processing is application-oriented:

Communication: Modulation, Demodulation Signal enhancement: Filtering, Equalization… Spectral analysis: Transform… Image processing: Reconstruction, Watermarking... Data compression: Transform, Quantization… Security: Encryption, Decryption

slide-10
SLIDE 10

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P10

Real World Signal Processing Real World Signal Processing

Real world signals:

Most signals are analog and continuous. e.g.. sound, vision, pressure, radiation...

Processing real world signal in tradition:

Modeling

Higher complexity: nonlinear, time-variant systems

Environment-sensitive

Temperature, Pressure, Gravity…

slide-11
SLIDE 11

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P11

What is Digital Signal Processing? What is Digital Signal Processing?

Digital Signal Processing:

Digital signal processing is to process real world signals (represented discrete and quantized or naturally digital) using mathematical techniques or algorithmic manipulation to perform transformations or extract information.

slide-12
SLIDE 12

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P12

Digital Signal Processing Digital Signal Processing

Signals in DSP system are sequences of quantized samples (discrete both in time and value). Signals are obtained from physical signals via transducers (e.g., microphones) and than become electric signals (e.g. voltage). Electric signals are converted to digital signal by sampling and quantizing of analog-to-digital converters (ADC). Digital signals may be recorded or converted into analog signals (e.g., voltage) through digital-to-analog converters (DAC). Transducers (e.g., speaker) convert electrical signal back into physical signals.

slide-13
SLIDE 13

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P13

Sample and Quantize Sample and Quantize

⎥ ⎦ ⎥ ⎢ ⎣ ⎢ ⋅ = ε ε y y Q ) (

Sampling interval: T Quantize

slide-14
SLIDE 14

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P14

Example Example

Communication system example: Cellular phone

slide-15
SLIDE 15

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P15

Why Digital Signal Processing? Why Digital Signal Processing?

“Exactness”

Perfect reproduction without error and perfect duplication of processing result Accuracy in digital signal representations can be controlled better by changing word-length of the signal.

“Robustness”

Digital signals can be stored and recovered, transmitted and received, processed and manipulated, all virtually without error.

“Convenient”

Complicated or sophisticated DSP techniques can be easily applied to target signal. Faster system design, and verification in every development cycles.

slide-16
SLIDE 16

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P16

Applications Applications

slide-17
SLIDE 17

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P17

Common DSP Algorithms & Applications Common DSP Algorithms & Applications

Applications – Instrumentation and measurement

– Communications – Audio and video processing – Graphics, image enhancement, rendering – Navigation, radar, GPS – Control - robotics, machine vision, guidance

Algorithms

– Frequency domain filtering – FIR, IIR – Frequency- time transformations – FFT, DCT – Correlation

slide-18
SLIDE 18

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P18

Image Compression Image Compression

JPEG Encoding JPEG Decoding

Spatial domain Frequency domain Quantize--Dequantize

slide-19
SLIDE 19

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P19

Voice Recognition Voice Recognition

slide-20
SLIDE 20

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P20

Audio Application Audio Application

slide-21
SLIDE 21

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P21

Market Observation Market Observation

slide-22
SLIDE 22

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P22

Semiconductor Market Semiconductor Market

Single processors (MPUs) and DRAMs were driving semiconductor industry because of personal computing market. Now DSP has become one major technology driver.

Increasing need to digital processing signals in embeded system e.g. Communication application, multimedia application

slide-23
SLIDE 23

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P23

DSP Market DSP Market

$2Billion market*, 30% growth rate *1996

slide-24
SLIDE 24

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P24

Wireless Trend Wireless Trend

slide-25
SLIDE 25

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P25

Example Example

The prevalence of cellular phone in Taiwan reached 110% in 2004. Incredible growing of prevalence in China and

  • Russia. (millions of mobile phone per month)

Cellular phone is a product with fast retired and replaced generations.

slide-26
SLIDE 26

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P26

Today Today’ ’s DSP Market Split s DSP Market Split

slide-27
SLIDE 27

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P27

DSP Processor Introduction DSP Processor Introduction

slide-28
SLIDE 28

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P28

Review: Processor Classes Review: Processor Classes

General Purpose - high performance

– Pentiums, Alpha's, SPARC – Used for general purpose software – Heavy weight OS - UNIX, NT – Workstations, PC's

Embedded processors and processor cores

– ARM, 486SX, Hitachi SH7000, NEC V800 – Single program – Lightweight OS – eCos, uLinux, … – Need DSP processor support in such oriented application – Cellular phones, consumer electronics (e. g. CD players)

Microcontrollers

– Extremely cost sensitive – Single program, OS is usually needless – Small word size - 8 bit common – Automobiles, toasters, thermostats, ...

Increasing Cost Increasing Volume

slide-29
SLIDE 29

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P29

Comparison Comparison

slide-30
SLIDE 30

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P30

Realization Realization

Digital Signal Processing algorithms can be realized through these technology:

Digital Signal Processor (DSP)

ADI Blackfin processor, TI TMS320CX processor…

General Purpose Microprocessor

Pentium CPU, ARM

Application Specific Integrated Circuit (ASIC)

FFT processor, Equalizer

Field-Programmable Gate Array (FPGA)

slide-31
SLIDE 31

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P31

Digital Signal Processor Digital Signal Processor

A digital signal processor (DSP) is a type of microprocessor.

Processing data in real time. The real-time capability makes a DSP perfect for applications that cannot tolerate any delays. Essentially infinite stream of data need to be processed. Large amount of I/O with analog interface

ADC, DAC

slide-32
SLIDE 32

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P32

DSP features DSP features

Single-cycle multiply-accumulate operations Real-time performance, simulation and emulation Flexibility Reliability Reduced system cost Reduced development cycle Easy to modify DSP algorithm or update system by software reprogramming

slide-33
SLIDE 33

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P33

Comparison Comparison

The FPGA Alternative:

Field-Programmable Gate Arrays have the capability of being reconfigurable within a system. Fast time prototyping and development. Offer greater raw performance per specific operation because of the resulting dedicated logic circuit. FPGAs are significantly more expensive and typically have much higher power dissipation than DSPs with similar functionality. When FPGAs are the chosen performance technology in designs, DSPs are typically used in conjunction with FPGAs to provide greater flexibility, better price/performance ratios, and lower system power.

slide-34
SLIDE 34

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P34

Comparison Comparison

The ASIC Alternative

Application-specific ICs provide extreme efficiency, both power consumption and processing power. Functionality of ASICs cannot be iteratively changed or updated like FPGA or Programmable DSP while in product development. Usually choosed by extremely performance- sensitive cases

slide-35
SLIDE 35

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P35

Comparison Comparison

The General Purpose Processor (GPP) Alternative

In contrast to ASICs that are optimized for specific functions, general-purpose microprocessors (GPPs) are best suited for performing a broad array of tasks. High performace GPPs are usually too expensive for many DSP applications. Such as CPU in our desktop. Low cost GPPs’ comparatively poor real time performance and high power consumption make them rule out in DSP applicatiion. Now in many system GPPs usually play the role of system controller instead of algorithm-computation unit.

Such as Cell phone

slide-36
SLIDE 36

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P36

Implement Choices Implement Choices

Power for one tap computation

slide-37
SLIDE 37

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P37

Fixed Fixed-

  • Point & Floating Point DSPs

Point & Floating Point DSPs

Programmable DSPs come in 2 flavors, fixed and floating point. Floating point DSP:

Expensive, longer instruction cycle Large signal dynamic range Adopted in very presicion-sensitive case

Communication infrastructure Medical image system Military weapons

Fixed point DSP:

Cheaper, shorter instruction cycle Less signal dynamic range: constrained by wordlength Overflow possibility

Multimedia

slide-38
SLIDE 38

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P38

Market Separation: ADI Example Market Separation: ADI Example

slide-39
SLIDE 39

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P39

Fundamentals of DSP Fundamentals of DSP Processor Processor

slide-40
SLIDE 40

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P40

Von Von-

  • Neumann Machine

Neumann Machine

Single memory space for program and data Shared global bus

slide-41
SLIDE 41

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P41

Motivation: FIR Filtering Motivation: FIR Filtering

M most recent samples in the delay line : x(i) New sample moves data down delay line “Tap” is a multiply-add Each tap (M+1 taps total) nominally requires:

Two data fetches Multiply Accumulate Memory write-back to update delay line

Goal: 1 FIR Tap / DSP instruction cycle

slide-42
SLIDE 42

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P42

FIR Implement FIR Implement

− = =

− =

1

) ( ) ( ) (

N i i

i n x i c n y

On Von-Neumann machine, the expressions are executed row by row.

slide-43
SLIDE 43

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P43

FIR on Von FIR on Von-

  • Neumann Machine

Neumann Machine

Bus/Memory bandwidth is bottleneck Control code overhead 11 instructions per tap

loop: lw x0, 0(r0) lw y0, 0(r1) mul a, x0,y0 add y0,a,b sw y0,(r2) inc r0 inc r1 inc r2 dec ctr tst ctr jnz loop

slide-44
SLIDE 44

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P44

FIR on Modified Von FIR on Modified Von-

  • Neumann Machine

Neumann Machine

  • Assume such Von-Neumann machine has:
  • Multiply and Accumulate (MAC) instruction
  • Pipelining, that makes MAC instruction and Read/Write

instruction execute in parallel

  • Then each tap of FIR still needs 4 cycle:

1. Read MAC instruction 2. Read data value x from memory 3. Read coefficient c from memory 4. Write data value to next location in the delay line

  • MAC time is one of the most basics statistics for

comparing the performance of programmable DSPs.

slide-45
SLIDE 45

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P45

Basic Harvard Architecture Basic Harvard Architecture

Separate program and data memory spaces Usually refer to separate program and data buses

slide-46
SLIDE 46

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P46

Example Example

First generation DSP: Texas Instrument TMS320C10 in 1982: Harvard architecture 16-bits fix-point Accumulator-based 390ns MAC time (160ns today) Load-Accumulate instruction

T-Register Accumulator ALU Multiplier Datapath: P-Register Mem

slide-47
SLIDE 47

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P47

X4 and H4 are direct (absolute) memory addresses:

LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 MPY H2 ...

About 2 instructions per tap, but requires unrolling

FIR on TMS320C10 FIR on TMS320C10

slide-48
SLIDE 48

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P48

Modified Harvard Architecture Modified Harvard Architecture

Program bus can be use for coefficient loading for MAC

slide-49
SLIDE 49

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P49

Example Example

Modified Harvard architecture is applied on TMS320C25 in1987 Simultaneous acquisition for instruction & 2 operands Single cycle MAC Simultaneous ALU

  • peration and Multiplier
  • peration

100ns instruction cycle time

slide-50
SLIDE 50

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P50

FIR on TMS320C25 FIR on TMS320C25

MACD = Multiply by Program MEM and Accumulate with delay

slide-51
SLIDE 51

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P51

More about Harvard Architectures More about Harvard Architectures

Harvard architecture has many modified version:

Basic: separated program and data space Mod.1: program space contain read only data Mod.2: use multi-port memory for data space Mod.3: add program cache to enhance throughput of shared program/data memory block. Mod.4: use 2 separated data memory for simultaneous instruction/operands fetch Mod.5: use 4 separated data memory and add an I/O specific memory

Programmer can ignore Harvard architecture until it becomes necessary to optimize the code. While optimizing with multiple memories, programmer must carefully arrange data in memory to take advantage of the multiple memories.

slide-52
SLIDE 52

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P52

Block Diagram Block Diagram

slide-53
SLIDE 53

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P53

Features of Most DSP Processors Features of Most DSP Processors

Data path configured for DSP Specialized instruction set Multiple memory banks and buses Specialized addressing modes Specialized execution control Specialized IO for peripherals

slide-54
SLIDE 54

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P54

Data Path Data Path

DSPs dealing with numbers representing real world

Real number, Fractions, …

DSPs dealing with numbers for addresses

Integers

Support fixed point number as well as integers S.

radix point

  • 1 Š x < 1

S

.

radix point

–2N–1 Š x < 2N–1

slide-55
SLIDE 55

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P55

Fixed Point Arithmetic Fixed Point Arithmetic

Precision must be carefully conserved.

Precision lost through quantization errors arising from A/D, D/A conversion and multiplication.

Signal-to-quantization-noise ratio increases linearly with signal level.

Each additional bit yields 6dB improvement in SNR Dynamic range can be extended by using more bits to present numbers.

Overflow must be prevent.

slide-56
SLIDE 56

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P56

Overflow in Fixed Point DSP Overflow in Fixed Point DSP

DSP are descended from analog :what should happen to output when “peg” an input?

Modulo Arithmetic

Overflow detection and prevention:

Saturation arithmetic:

Set to most positive (2N–1–1) or most negative value(–2N–1) when overflow detected.

Shifting product:

Arithmetic shift right (shift down), with sign extension

slide-57
SLIDE 57

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P57

DSP Data Path: Multiplier DSP Data Path: Multiplier

Specialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplier Design to perform multiply-accumulate (MAC) in processing core n-bit multiplier => 2n-bit product Often concatenate with a shifter to prevent overflow

slide-58
SLIDE 58

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P58

DSP Data Path: Accumulator DSP Data Path: Accumulator

Don’t want overflow or have to scale accumulator Option 1: accumulator wider than product: guard bits

Motorola DSP: 24b x 24b => 48b product, 56b Accumulator

Option 2: shift right and round product before adder

Accumulator ALU Multiplier G Accumulator ALU Multiplier Shift

slide-59
SLIDE 59

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P59

DSP Data Path: Rounding DSP Data Path: Rounding

Even with guard bits, will need to round when store accumulator into memory 3 DSP standard options

Truncation: chop results => biases results up Round to nearest: < 1/2 round down, 1/2 round up (more positive) => smaller bias Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even

slide-60
SLIDE 60

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P60

DSP Memory DSP Memory

In contrast with RISC uP, DSP processors usually contain internal memories, not cache. Multi-ported memories/ multiple independent memory banks are difficult/expensive to implement off chip.

Pin count requirement of DSP processor will increase dramatically if implementing multiple memory bank off chip. More expensive package Physical memory with multi-port is also much more expensive.

DSP processors mix the strategy.

Adopt 1~2 additional external bus for off-chip memory space Multiple internal memory banks with software-controlled paging system and external page pool.

slide-61
SLIDE 61

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P61

Example Example

Motorola DSP56001

32-bit word and three memory banks, each with 32-bit address (64 pin for one memory space) 192 pins are required to implement 3 parallel memory banks totally external. Multiplexed bus (64 pins) is applied on 56001

Motorola DSP96002

Same processor core with DSP56001 Bring 1 additional bus outside chip, more 64 pins

It can simutaneous access 2 memory spaces.

DSP96002 is a 200 pin version of DSP56001

slide-62
SLIDE 62

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P62

DSP Memory (cont.) DSP Memory (cont.)

FIR Tap implies multiple memory accesses Most DSP processors have multiple data ports Some DSPs use ad hoc techniques to reduce memory bandwidth demand

Instruction repeat buffer: do 1 instruction 256 times Often use maskable interrupts, thereby increasing interrupt response time

Some recent DSPs have instruction caches

Even then may allow programmer to “lock in” instructions into cache Option to turn cache into fast program memory or data memory

slide-63
SLIDE 63

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P63

Memory Hierarchy Memory Hierarchy

Registers Out

  • f
  • rder

I/D Cache Physical memory TLB Registers DMA Controller I Cache Internal memories External memories TLB: Translation Look aside Buffer DMA: Direct Memory Access

RISC DSP

slide-64
SLIDE 64

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P64

Memory Addressing Memory Addressing

Have standard addressing modes: immediate, displacement, and register indirect addressing. Goal: to keep MAC data path as busy as possible Assumption: any extra instructions for each tap imply more clock cycles of overhead in inner loop

Complex addressing is a better choice Prevent using data path to calculate address

Auto-increment / Auto-decrement register for indirect addressing:

lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 Option to do it before addressing, positive or negative

slide-65
SLIDE 65

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P65

Addressing (cont.) Addressing (cont.)

DSPs dealing with continuous I/O stream

I/O buffer for data on delay-lines

Use circular buffer to save memory.

Use modulo/circular addressing mode for circular buffer

Also used in sliding window algorithms:

Convolution Correlation FIR filters

slide-66
SLIDE 66

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P66

Addressing for FFT Addressing for FFT

FFTs start or end with data in reverse order:

0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) 3 (011) => 6 (110) 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111)

Avoid overhead of address checking instructions or bit- reversing operations for FFT. Have an optional bit reverse addressing mode for use with auto-increment addressing. Many DSPs have bit reverse addressing for radix-2 FFT

slide-67
SLIDE 67

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P67

DSP Instructions DSP Instructions

May specify multiple operations in a single instruction Must support Multiply-Accumulate (MAC) Support parallel move of data in register Usually have special loop support to reduce branch

  • verhead (such as pipeline stall)

Loop an instruction or sequence by an iterator No branch instruction is taken for looping In many DSP processor, if iterator=0, usually means looping maximum number of times (infinite looping).

Auto/Manual shift-left arithmetic instruction for saturation Conditional execution for branch reduction

slide-68
SLIDE 68

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P68

Pipeline in DSP Processor Pipeline in DSP Processor

Pipelining effectively speeds up the computation, but it can have serious impact on programmability. An instruction is fetched at the same time that

  • perands for a previously fetched instruction are

being fetched. Three fundamental techniques are adopted:

Interlocking Time-stationary coding Data-stationary coding

slide-69
SLIDE 69

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P69

Interlocking Interlocking

If some immediate instruction prompts to access shared resource, the control hardware will delay the execution of the arithmetic operation. Interlocking stalls pipeline and therefore decreases performance Programmer is not aware of interlocking. Some DSP manufacturer, such as TI, supplies a simulator that gives the detailed timing of any sequence of instructions, including interlocking information, for code optimizing.

slide-70
SLIDE 70

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P70

Time Time-

  • Stationary Coding

Stationary Coding

Instruction specifies the operations that occur simultaneously in one instruction cycle. Parallelism rather than pipelining e.g. AT&T DSP16 a0=a0+p p=x*y y=*r0++ x=*pt++

Simultaneously update a0( accumulate ), p( multiply ) ,y( operand through pointer dereference ) , x( operand through pointer dereference )

Advantages:

Timing of a program is clear. Very fast interrupts: programmers explicitly control over the pipeline, and there is no need to flush the pipe prior to invoking the interrupt.

slide-71
SLIDE 71

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P71

Data Data-

  • Stationary Coding

Stationary Coding

Instructions specify all of the operations performed on a set of

  • perands from memory.

These instructions specify what happens to data, rather than what happens at a particular time in the hardware. Operations proceed in parallel, specified by neighbor instructions. Data-stationary coding is no less efficient than time-stationary coding. Fast interrupt are more difficult in data-stationary coding than time-stationary coding. e.g. AT&T DSP32

r5++ = a1 = a0+ *r7 * *r10 ++r17

Parallel write to memory and location specified by r5 Accumulate with value in memory Dereference and multiply Update pointer for operands

slide-72
SLIDE 72

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P72

Branch in DSP Branch in DSP

Some problems conspire to make it difficult to achieve a efficient branching.

If program address space is large, the destination address may not fit in an instruction word. More fetching from instruction memory may be required. Alternatives are paging and PC-relative addressing. In conditional branching, the fetch of the next instruction cannot

  • ccur before the condition codes in the ALU can be tested

Solutions

Use delayed branch: fetch more instructions independent of branch and execute before branch occurs; or separate data arithmetic instructions several cycles prior to the test. Design low-overhead looping instructions for tight inner loop, rather than use branch instructions for loop. Design conditional instructions to institute conditional branches.

slide-73
SLIDE 73

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P73

Recent DSP Relative Topics Recent DSP Relative Topics

slide-74
SLIDE 74

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P74

VLIW Architectures for DSP VLIW Architectures for DSP

What is VLIW Superscalar vs. VLIW Characteristics of VLIW processor Example of VLIW on DSP Advantage and Disadvantage of VLIW

slide-75
SLIDE 75

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P75

What is VLIW? What is VLIW?

Abbreviation of Very Long Instruction Word Until 1997, most DSP processors are similar

Specialized execution unit and instruction set Difficult to program in assembly Unfriendly compiler targets One instruction per instruction cycle such as multiply- accumulate and store

VLIW architecture use simple, regular instruction set and execute multiple instructions per cycle. Strategy: More parallelism create higher performance Better compiler targets

slide-76
SLIDE 76

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P76

Superscalar vs. VLIW Superscalar vs. VLIW

slide-77
SLIDE 77

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P77

Characteristics of VLIW Processor Characteristics of VLIW Processor

Multiple independent instructions per cycle.

Packed into single large instruction word / packet Instructions may be positional or include routing information with in each sub-instruction word

Independent execution unit with complement feature

Each instruction packet may be dispatched to several execution unit.

More regular, orthogonal, RISC-like instructions Large, uniform register set Wide program and data buses

slide-78
SLIDE 78

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P78

Example: ADI TigerSHARC Example: ADI TigerSHARC

It’s a static superscalar DSP can execute simultaneously from

  • ne to four 32-bit instructions encoded in a single instruction line.

Combine VLIW with SIMD (single instruction multiple data)

The programmer has the option of directing both computation blocks to operate on the same data (broadcast distribution) or different data (merged distribution). Each computation block can execute four 16-bit or eight 8-bit SIMD computations in parallel.

slide-79
SLIDE 79

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P79

Advantages of VLIW Architecture Advantages of VLIW Architecture

Better performance

More regular execution unit More instructions executed in parallel than traditional DSP with time-stationary coding instructions

Better compiler target:

Program sequence, to tell independent and dependent instructions. Compile-time specified dispatch rather than specified in silicon

Potentially easier to program for DSP Potentially scalable

Able to add more execution unit in processor core, allow more sub-instructions to be packed into one VLIW instruction

slide-80
SLIDE 80

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P80

Disadvantages of VLIW Architecture Disadvantages of VLIW Architecture

New type of programmer/compiler complexity:

Programmer or code-generation tool must keep tracking of instruction scheduling Deep pipelines and long latencies can be confusing, and may make it hard to reach peak performance.

Increase memory complexity

Higher memory bandwidth is required

Higher power consumption Confusing performance-evaluation: MIPS/MFLOPS rating strategy is mislead.

slide-81
SLIDE 81

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P81

DSP vs. General Purpose MPU DSP vs. General Purpose MPU

The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).

DSP are judged by whether they can keep the multipliers busy 100% of the time.

The "SPEC" of DSPs is 4 algorithms:

Infinite Impulse Response (IIR) filters Finite Impulse Response (FIR) filters FFT, and Convolution

Algorithm is everything for DSP Processor Software compatibility is not a concern

Programmers often write in assembly language to minimize requiring for ROM and optimize performance.

slide-82
SLIDE 82

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P82

Summary Summary

DSP system background knowledge and application overview DSP market observation DSP processor architecture and its evolution Modern DSP processor architecture overview Next time:

Processor specific topics: Blackfin architecture

slide-83
SLIDE 83

ACCESS IC LAB

Graduate Institute of Electronics Engineering, NTU

P83

Reference Reference

[1] http://www.webster.com/ [2] http://www.BDTI.com/ [3] Gregory K. Wallace, “The JPEG Still Picture Compression Standard”, Communications of the ACM, Volume 34, Issue 4 (April 1991),Pages: 30 – 44, 1991, ISSN:0001-0782, http://portal.acm.org/citation.cfm?id=103089&coll=portal&dl=ACM&CFID=26765 382&CFTOKEN=77630149 [4] “TMS320C1X Digital Signal Processors Datasheet”, http://focus.ti.com/docs/prod/folders/print/tms320c10.html [5] http://www.ee.ucla.edu/~schaum/ee201a_S02/ [6] “Quick Guide to Developing with ADI DSPs - DSP Selection”, http://www.analog.com/processors/resources/beginnersGuide/quickguide1.html [7] Edward A. Lee,“Programmable DSP Architectures: Part I”, IEEE ASSP Magazine, p.4~p.19, October 1988 [8] Edward A. Lee,“Programmable DSP Architectures: Part II”, IEEE ASSP Magazine, p.4~p.19, January 1989