A Scalable Processor with Embedded Software for Large- Scale - - PowerPoint PPT Presentation

a scalable processor with embedded software for large
SMART_READER_LITE
LIVE PREVIEW

A Scalable Processor with Embedded Software for Large- Scale - - PowerPoint PPT Presentation

A Scalable Processor with Embedded Software for Large- Scale Scientific Applications Daniel Alex Finkelstein and Haldun Hadimioglu Polytechnic University, Brooklyn, NY, USA Outline 1. Motivation/Goals and Related Work 2. Peripheral Context


slide-1
SLIDE 1

A Scalable Processor with Embedded Software for Large- Scale Scientific Applications

Daniel Alex Finkelstein and Haldun Hadimioglu Polytechnic University, Brooklyn, NY, USA

slide-2
SLIDE 2

Outline

  • 1. Motivation/Goals and Related Work
  • 2. Peripheral Context
  • 3. Experimental Platform
  • 4. Associative Streaming Memory Processor
  • 5. Conclusions

2

slide-3
SLIDE 3

Motivation & Goals

  • 1. Address the memory wall problem
  • More intelligence in memory and storage usage
  • Target the CPU, main memory, and peripherals
  • Exploit regular scientific large-scale applications
  • For high-speed processing, reconfigurable fabric is

needed

  • Use embedded RISC cores for ‘slow’ tasks and

algorithms too difficult to directly map onto FPGA fabric.

3

slide-4
SLIDE 4

Motivation & Goals

  • 2. RISC + FPGA fabric combination
  • Schedules large & fine grained data movements
  • Associates data streams, data within streams, and

mixes data streams on-chip or using latency- controlled peripherals (cache)

  • 3. Scalability to handle large problems: Efficient Multi-Chip

Configurations

  • 4. Different mixture of processor-memory-fabric-

peripheral composition

4

slide-5
SLIDE 5

Related Work

1.

Molen & Garp

  • Uses both FPGA reconfigurable logic and processor cores, but for

application acceleration. Garp also allows FPGA direct access to main memory.

2.

RAW

  • Programmable insofar as the instructions and data can be rerouted

through the tiles via switching.

3.

RAMP

  • Though a simulation environment, some of the (many) proposed features

include dataflow architectures for programming languages, high- bandwidth peripherals, and reusable logic cores for the FPGA fabric.

4.

RSVP

  • Decoupled operand prefetch, vector stream units, and detailed vector

stream descriptors for media-rich applications, but can be generalized.

5

slide-6
SLIDE 6

Supercomputing Applications

Our criteria:

  • Large data sets
  • Floating point
  • perations
  • Regular data structures

6

  • SPEC CPU 2000 FP suite
  • Stable, predictable

memory profiles

slide-7
SLIDE 7

7 Intel 975X Express Chipset 3 Gbps SATA Intel 975X Express Chipset 85.6 Gbps DDR2 DRAM Intel IXP2855 Network Processor 57.6 Gbps RDRAM Intel IXP2855 Network Processor 10 Gbps Ethernet

Peripheral Bandwidth

DDR DRAM 400 Mbps Virtex 4 DDR2 DRAM 667 Mbps Virtex 4 Rocket I/O 75 Gbps Virtex-II Pro Infiniband 10 Gbps Virtex-II Pro X

Xilinx Intel

Peripherals are getting faster, but are usually separated from the processors by several levels of the memory hierarchy.

slide-8
SLIDE 8

ML310 Development Board

8

slide-9
SLIDE 9

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores

8

slide-10
SLIDE 10

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores 256 MB DDR DRAM

8

slide-11
SLIDE 11

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores 256 MB DDR DRAM SanDisk

8

slide-12
SLIDE 12

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores 256 MB DDR DRAM SanDisk IDE interfaces

8

slide-13
SLIDE 13

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores 256 MB DDR DRAM SanDisk IDE interfaces PCI

8

slide-14
SLIDE 14

ML310 Development Board

XC2VP30 FPGA with 2 PowerPC 405 RISC Cores 256 MB DDR DRAM SanDisk IDE interfaces PCI Ethernet

8

slide-15
SLIDE 15

PowerPC - Custom IP Interface

9

slide-16
SLIDE 16

PowerPC - Custom IP Interface

9

slide-17
SLIDE 17

FPGA configurable fabric

PowerPC - Custom IP Interface

9

PLB PLB2OPB OPB custom IP I D 32 64 64 PowerPC Core

slide-18
SLIDE 18

PowerPC - Custom IP Interface Memory Options

10

PLB PLB2OPB OPB custom IP I D 32 64 64 PowerPC Core

slide-19
SLIDE 19

PowerPC - Custom IP Interface Memory Options

10

OCM (Block RAM)

64 I 32 D

PLB PLB2OPB OPB custom IP I D 32 64 64 PowerPC Core

slide-20
SLIDE 20

PowerPC - Custom IP Interface Memory Options

10

OCM (Block RAM)

64 I 32 D

DDR DRAM

16 32 64 32 64

PLB PLB2OPB OPB custom IP I D 32 64 64 PowerPC Core

slide-21
SLIDE 21

PowerPC: A High Level Controller

11

slide-22
SLIDE 22

PowerPC: A High Level Controller

11

  • PowerPC (PPC) core

(with some hardware additions) can execute user programs

PPC User Code

slide-23
SLIDE 23

PowerPC: A High Level Controller

11

  • PowerPC (PPC) core

(with some hardware additions) can execute user programs

  • PPC core may also

perform control and monitoring functions for the processor system

PPC User Code High Level Controller PPC User Code

slide-24
SLIDE 24

High Level Controller Features

  • Peripheral latency

monitoring

  • Peripheral data prefetch

control

  • Peripheral resource

scheduling

  • Coarse-grained data

association, mixing, and buffering

12

Store delays between data request and data arrival for each peripheral

Begin retrieving data streams and process immediately or buffer locally, based on latency values

Control access to peripherals Vectorized data streams allow the controller to associate streams with each other (dependencies), re-order data into memory buffers (useful in matrix- matrix multiplication and transposes), etc.

slide-25
SLIDE 25

Low Level Controller Features

  • The LLC is the Custom

IP (shown earlier) connected to the OPB

  • The LLC resides within

the FPGA’s configurable logic blocks (CLBs)

  • The LLC performs both

computations on the data streams and fine- grained data mixing

13

Low level controller is unique for each application. Reusable functional logic can be dynamically mapped onto CLBs at compile-time or run-time. Control signals exchanged with HLC to indicate status of buffers, dout_rdy, etc. Data element arithmetic

  • perations (int/FP) performed at

this level. Elements in streams may be remixed to satisfy output constraints, dependent instructions, etc.

slide-26
SLIDE 26

HLC - LLC Interaction

14

FPGA CLB Fabric PPC User Code High Level Controller Low Level Controller 32 register interface

slide-27
SLIDE 27

Conclusions

15

  • Fast peripherals need not be limited by

intermediate controllers, buses, and operating systems.

  • Controller logic can be integrated alongside

traditional processor components in a single package.

  • Intelligent use of memory peripherals on local

buses reduces processing latencies.

  • Embedded software saves resources better

devoted to time-sensitive computation.

slide-28
SLIDE 28

Acknowledgments

We wish to acknowledge the support of Xilinx in providing us with ML310 development boards, design tools, and technical support. This work was supported in part by a research fellowship from the U.S. Department of Education GAANN.

16

slide-29
SLIDE 29

Thank you.

contact: dfinke01@cis.poly.edu

17