Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 - - PowerPoint PPT Presentation

nov 12 1997 bob brodersen http infopad eecs berkeley edu
SMART_READER_LITE
LIVE PREVIEW

Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 - - PowerPoint PPT Presentation

CS 152 Computer Architecture and Engineering Introduction to Architectures for Digital Signal Processing Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 Processor Applications General Purpose - high performance


slide-1
SLIDE 1

1

CS 152 Computer Architecture and Engineering Introduction to Architectures for Digital Signal Processing

  • Nov. 12, 1997

Bob Brodersen (http://infopad.eecs.berkeley.edu)

slide-2
SLIDE 2

2

Processor Applications

  • General Purpose - high performance

– Pentiums, Alpha’s, SPARC – Used for general purpose software – Heavy weight OS - UNIX, NT – Workstations, PC’s

  • Embedded processors and processor cores

– ARM, 486SX, Hitachi SH7000, NEC V800 – Single program – Lightweight, often realtime OS – DSP support – Cellular phones, consumer electronics (e.g. CD players)

  • Microcontrollers

– Extremely cost sensitive – Small word size - 8 bit common – Highest volume processors by far – Automobiles, toasters, thermostats, ...

Increasing Cost Increasing volume

slide-3
SLIDE 3

3

The Processor Design Space

Cost Performance Microprocessors Performance is everything & Software rules Embedded processors Microcontrollers Cost is everything Application specific architectures for performance

slide-4
SLIDE 4

World’s Cellular Subscribers

100 200 300 400 500 600 700 1993 1994 1995 1996 1997 1998 1999 2000 2001

Millions Year Digital Analog

Source: Ericsson Radio Systems, Inc.

Will provide a ubiquitous infrastructure for wireless data as well as voice

slide-5
SLIDE 5

5

Multimedia I/O Architecture

Low Power Bus

Radio Modem Embedded Processor

Fifo Video Decomp

Video Audio

FB Fifo

Graphics Pen

Sched ECC Pact Interface

Data Flow

SRAM

slide-6
SLIDE 6

6

Embedded applications

  • Future chips will be a mix
  • f processors, memory

and dedicated hardware for specific algorithms and I/O

µP DSP Coms Video Unit custom Memory Uplink Radio Downlink Radio Graphics Out Video I/O Voice I/O Pen In

E.g. Multimedia terminal electronics

slide-7
SLIDE 7

7

Requirements of the Embedded Processors

  • Optimized for a single program - code often in on-chip

ROM or off chip EPROM

  • Minimum code size (one of the motivations initially for

Java)

  • Performance obtained by optimizing datapath
  • Low cost

– Lowest possible area – Technology behind the leading edge – High level of integration of peripherals (reduces system cost)

  • Fast time to market

– Compatible architectures (e.g. ARM) allows reuseable code – Customizable core

  • Low power if application requires portability
slide-8
SLIDE 8

8

Area of processor cores = Cost

Nintendo processor Cellular phones

slide-9
SLIDE 9

9

Another figure of merit Computation per unit area

Nintendo processor Cellular phones ???

slide-10
SLIDE 10

10

National Semiconductor - Embedded Processor Family

  • Simple architecture
  • 3 stage pipeline - fetch - decode - execute
  • Minimum power and size

– Short pipeline avoids branch prediction and bypass – Versions range from 8-64 bit - choose minimum that meets requirements

slide-11
SLIDE 11

11

Code size

  • If a majority of the chip is the program stored in

ROM, then code size is a critical issue

  • The Piranha has 3 sized instructions - basic 2 byte,

and 2 byte plus 16 or 32 bit immediate

slide-12
SLIDE 12

12

Example application (single chip system)

slide-13
SLIDE 13

13

The DSP Module (DSPM)

  • Vector instructions directly supported
  • Pipelined datapath supprts single cycle: Multiply,

Add, Shift, Load/Store and Pointer adjustment

  • Operates in parallel to processor core
  • Saturation, overflow and rounding for ALU
  • perations
  • Automatic support for cyclic buffers (modulo

arithmetic)

slide-14
SLIDE 14

14

The National DSP Module Architecture

Single cycle MAC support is typical for DSP acceleration Three simultaneous addresses Zero overhead repeat X Y Z

slide-15
SLIDE 15

15

The 486 “Embedded Processor” Look familiar???

slide-16
SLIDE 16

16

The “Embedded” Features of the 486 GX

  • Said to be designed “for embedded battery-
  • perated and hand-held applications” (???)
  • Fully static design (clock can stop and all state is

kept)

  • “Auto Clock Freeze” stops circuits which are not

being used in a given instruction (gated clocks)

  • Stop Clock (60 µW), Stop Grant - clock runs but

no program execution (40-85 mW)

  • Split power supply - 2.0-3.3 Volt core, 3.3V. I/O,
slide-17
SLIDE 17

17

Power = C V2 fclock

130 mW 350 mW 430 mW 290 mW 190 mW 540 mW 490 mW 730 mW 17 mW 23 mW 30 mW 20 mW Power

Note the clock rates

slide-18
SLIDE 18

18

Characterizing programs for their energy consumption

Process Subframe 330µW ComputeLag 107µW IFilterCodebook 63µW QuantizeGains 46µW CodebookSearch 44µW ComputeWeightedInput 22µW UpdateFilterState 8µW OrthogonalizeCodebook 6µW ThetaToCodeword 8µW

ComputeLag(...) { R=dotprod(res,res); for (lag=0..127) { lp=getLT(lt); G = dotprod(lp, lp); } }

Top four functions account for 90 % of the power 65% of power dissipation in dot-vector products

(data obtained from profiling of C++-code, weighted with estimated instruction energy costs)

slide-19
SLIDE 19

19

An architecture optimized for multiply- accumulate

Energy/Flexibility Tradeoff’s

Arm 6 core (5V, 20 MHz): .02 MIPS/mW ZSP DSP Superscaler (3V, 200 MHz) .3 MOPS/mW Reconfigurable Dot-Vector Processor (1.5V, 30 MHz) 5.9 MIPS/mW * MOPS = millions of operations/sec = millions of MACS/sec

AddressGen AddressGen

Memory Memory MAC MAC Control Processor

L C G

slide-20
SLIDE 20

20

DSP Application - equalization

  • The audio data streams from the source (computer) through the

digital analysis and synthesis

  • Hard realtime requirement - the processing must be done at the

sample rate

slide-21
SLIDE 21

21

Common DSP algorithms and applications

  • Applications

– Instrumentation and measurement – Communications – Audio and video processing – Graphics, image enhancement, 3-D rendering – Navigation, radar, GPS – Control - robotics, machine vision, guidance

  • Algorithms

– Frequency domain filtering - FIR and IIR – Frequency-time transformations - FFT – Correlation

slide-22
SLIDE 22

22

Sampled data processing

This analog circuit really is just an solution of the differential equation calculated using the physics of electric fields and currents: This RC low pass filter takes this time waveform (signal) and turns it into this filtered version

Vout(t) Vin(t)

R C

) ( ) ( t V t V dt dV RC

in

  • ut
  • ut

= +

To implement this digitally we need to convert this expression to discrete

  • time. First we need to convert from a continuous time representation of

the signal to discrete time sequences: Vout (t) => Y1 Y2 Y3 … Yn and Vin(t) => X1 X2 X3 … Xn

slide-23
SLIDE 23

23

Discrete time representation

Now what is the processing that goes on to implement the filtering? Using a discrete approximation to the derivative we obtain the discrete time equivalent of the continuous time differential equation: ∆t = tsample=1/fsample The sampled version of Vin(t) is a sequence of numbers 6,8,4,12, …. This then provides the input to the digital signal processing algorithm Digital signal processor

1 1 1 − − −

= +       ∆ −

n n n n

X Y t Y Y RC

Y1Y2Y3 …. X1X2X3 ….

slide-24
SLIDE 24

24

A computational structure

This can be rewritten as: since the new sample is only a function of past samples it can be computed using the following procedure:

n n n n n

X Y X RC t Y RC t Y β α + =       ∆ +       ∆ − =

− − − 1 1 1

1

Σ Delay X Xn X β α Yn

αYn-1

slide-25
SLIDE 25

25

Direct mapping architecture

  • These calculations need to be finished after every sample

period, since Yn depends on Yn-1 and new data is continuously coming => hard real time requirement

  • In each sample period there are 2 multiply adds and one

accumulate.

  • We could directly map this structure into hardware and

then the delay becomes a pipeline register and we would need two multipliers and an adder - this is the most direct approach, almost no control, but also no flexibility Σ

Delay X

Xn

X

β α Yn

αYn-1

slide-26
SLIDE 26

26

Filter structures

slide-27
SLIDE 27

27

  • The critical hardware unit in a DSP is the multiplier - much
  • f the architecture is organized around allowing use of the

multiplier on every cycle

  • This means providing two operands on every cycle,

through multiple data and address busses, multiple address units and local accumulator feedback

1 2 3

D

5 4

Σ

D X

Xn

X

β α Yn

αYn-1

1 3 2 4 5 6 6

Mapping of the filter onto a DSP execution unit

slide-28
SLIDE 28

28

IIR and FIR filters

  • Infinite Impulse Response (IIR) filter - has a feedback loop and the

response to an impulse goes on forever

  • The impulse response completely characterizes the filter response, so a

more direct (purely digital) approach is the finite impulse response filter or FIR.

Σ

D X X

β α Yn

αYn-1 1 000 h1 h2 h3 h4 h5

slide-29
SLIDE 29

29

FIR filter frequency response

  • FIR filters are a very general structure and form

the base of much more sophisticated processing, e.g. adaptive filters which make possible 56 kbit modems

15 stages 128 stages

slide-30
SLIDE 30

30

Transformations result in different critical paths for direct map architectures

D D D D

X

Σ

X

Σ

X

Σ

X

Σ

X

X Y h1 h2 h3 h4 h5

MAC computations

X

Σ

X

Σ

X

Σ

X

Σ

X

X Y h1 h2 h3 h4 h5

D D D D

Critical path = 4 adders + multiply Critical path = 1 adder + multiply

slide-31
SLIDE 31

31

Delay Lines

  • Shift register

– Very inefficient in area and power

  • since shift register cells are much larger than RAM
  • ALL data must move every cycle
  • Delay using circular buffers - use of modulo arithmetic

X5 X4 X3 X2 X1 0 [X1] X5 2 [X3] 1 [X2] Read Write X5 X6 X7 3 [X4] X1 4 0 [X6] 2 [X3] 1 [X7] 3 [X4] 4 [X5] 0 [X6] 2 [X3] 1 [X2] 3 [X4] 4 [X5] 0 [X1] 2 [X3] 1 [X2] 3 [X4] 4 [X5] Write address = (N modulo 5) -1 Read address = N modulo 5 N=4 ? X2 X3 X4 X6 N=5 N=6 N=7 X7 X1 N= time index

slide-32
SLIDE 32

32

FFT support

  • “Flow diagram” of FFT

algorithm - again based

  • n multiply adds

A x0 x4 => A * (x0+x4)

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Bit reversed addressing - what is the pattern? 000 000 001 010 010 100 011 110

slide-33
SLIDE 33

33

Address calculation unit for DSP

  • Supports modulo and bit reversal

arithmetic

  • Often duplicated to calculate multiple

addresses per cycle

slide-34
SLIDE 34

34

Lets look at an application - Supporting the Road Warrior of 1999

GSM IS-54 IS-96 DCS1800 PCS-1900 DECT PHS PDC etc

slide-35
SLIDE 35

35

Physical Layer Standards

Parameter A MPS IS54 GSM JCP DECT CT2 PHP 802.11FH

Origin

EIA/TIA EIA/TIA ETSI ETSI U K Japan IEEE

A ccess

FDD FDM/FDD/TD M FDM/ FDD/TDM FDM/TDM/TD D FDM/TDD TDM/TDD FH/FDM

Modulation

FM pi/4QPSK GMSK, diff pi/4D QPSK GFSK GFSK pi/4-D QPSK (G)FSK

Baseband filter

Root raised cosine Root raised

  • cos. beta=0.3

Root raised cosine Gaussian BT=0.5 Gaussian BT=0.5 Root Nyquist

slide-36
SLIDE 36

36

Software radio solution?

BATTERY (40+ lbs)

slide-37
SLIDE 37

37

How to implement a software radio

  • Convert to digital representation as close to the

antenna as possible

  • Determine the best architecture to perform the DSP

(FFT’s, filters, correlators, …)

LNA and AGC A/D Digital Signal Processing

fsamp

slide-38
SLIDE 38

38

Example of the digital processing - Direct sequence spread spectrum (CDMA)

  • Modulator (transmit side)

X

tbit tchip Spreading code Data Input Spread output data

  • Demodulator (transmit side) - a correlator is needed

to decode the data

X

tbit tchip

Acc

Again we have a MAC requirement, accumulations are performed at the chip rate

slide-39
SLIDE 39

39

Effficiency of direct mapping - CDMA digital baseband architecture

Phase Control Comparator Correlator Delay Walsh Decode Correlator Clock Mux Loop Gain P/N Descrambler 256 MHz Clk

Timing Recovery Data Recovery

RAKE Combiner Correlator (x3) Analog RF Section Data Mux

Digital Clocks

Correlator (x3)

Channel Es timator

Correlator

Adjacent Cell Scan

(Bits Out)

(Delay Locked Loop) 128 MHz 64 MHz 1 MHz

~1000 Mops using 27 mW at 1.5 volts - 30Mops/mW

slide-40
SLIDE 40

40

Summary How is DSP different?

  • Essentially infinite streams of data which need to

be processed in real time

  • Relatively small programs and data storage

requirements

  • Intensive arithmetic processing with low amount
  • f control and branching (in the critical loops)
  • High amount of I/O with analog interface
  • Loosely coupled multiprocessor operation
slide-41
SLIDE 41

41

Summary How are DSP µP’s different

  • Single cycle multiply accumulate (multiple busses and

array multipliers)

  • Complex instructions for standard DSP functions (IIR and

FIR filters, convolvers)

  • Specialized memory addressing

– Bit reversal (FFT) – Modular arithmetic for circular buffers (delay lines)

  • Zero overhead loops and repeat instructions
  • I/O support

– Serial and parallel ports – DMA – A/D and D/A interface

  • Limited use of data and instruction caches
  • Compiler support for hazard elimination
slide-42
SLIDE 42

42

Tradeoff off between high performance µP and DSP’s

  • Advantages of General Purpose µP’s

– High volume production advantages – High level language and tool support – Efficient implementation of non-DSP tasks – Higher clock rates and more advanced technology

  • Advantages of DSP µP’s

– Software and developpment support for signal processing applications (filters, FFT’s, etc.) – Real Time OS and application libraries – Minimal support chips – Variety of versions allow cost/performance/power tradeoffs – Low cost