Table of Contents Chapter 1 Introduction Chapter 2 First - - PowerPoint PPT Presentation

table of contents
SMART_READER_LITE
LIVE PREVIEW

Table of Contents Chapter 1 Introduction Chapter 2 First - - PowerPoint PPT Presentation

Table of Contents Chapter 1 Introduction Chapter 2 First Prototypes of an Associative Computing (ASC) Processor Design and Implementation of an FPGA-Based Chapter 3 A Scalable Pipelined ASC Processor With Scalable


slide-1
SLIDE 1

Nov 3rd, 2006 Dissertation Defense

Design and Implementation of an FPGA-Based Scalable Pipelined Associative SIMD Processor Array with Specialized Variations for Sequence Comparison and MSIMD Operation

Hong Wang Department of Computer Science Kent State University

Dissertation Defense 2 Nov 3rd, 2006

Table of Contents

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

Dissertation Defense 3 Nov 3rd, 2006

Associative Computing

Associative computing is particularly well suited to processing

records of data in a tabular format

As illustrated, each Processing Element (PE) of the SIMD

associative computing array can store a record of this tabular data in its memory

PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 Student Name ID Grade John Smith 07 66 Gary Heath 05 95 Peter Smith 11 87 John Smith 04 78 Tarry Stanley 02 100 Will Hanson 01 84 Jane Antony 06 64 Mark Bloggs 13 88 Gill Pister 09 75 Min Lee 10 83 Goby Carmen 03 83 Gillian Roger 08 26 Mask RSPD Mask RSPD Mask RSPD 1 1 1 1 1 1 1

Search STEP1 STEP2

Dissertation Defense 4 Nov 3rd, 2006

Implementing Associative Computing in the ASC Processor

Associative search: the Control Unit broadcasts the search key

to all PEs to compare with local memory. If search is successful, those PEs are designated responders, and they set their Responder bit and the top of their Mask Stack to ‘1’

Process the responders sequentially: STEP instruction uses

Responder Resolution Unit and Mask Stack to process responding PEs one by one.

Searching for maximum/minimum value in a field uses Falkoff

Algorithm, process bit slices from left to right.

Student Name ID Grade Mask RSPD Mask RSPD Mask RSPD John Smith 07 66 Gary Heath 05 95 1 1 1 Peter Smith 11 87 John Smith 04 78 Tarry Stanley 02 100 1 1 1 1 Search STEP1 STEP2

slide-2
SLIDE 2

Dissertation Defense 5 Nov 3rd, 2006

Database Processing

In the following slides I present some applications of our

processor

Relational Database Processing: O(|B|) Intersection, Union, Cartesian Product and Join are basic

  • perations in Database processing. Using associative Search

and STEP operations, we can achieve much faster processing time

Student ID Class PE7 04 239 PE8 11 111 PE9 07 239 PE10 07 124 PE11 05 124 PE12 04 111 PE13 05 111 PE14 04 111 PE15 07 124 PE16 11 124

CR

Student ID Class 04 239 11 111 07 239 07 124 05 124 04 111 05 111 04 111 07 124 11 124

CR

Intersection Union

Step 1 2 3 4 Step 1 2 3 4 Relation A Relation B Relation A Relation B

Dissertation Defense 6 Nov 3rd, 2006

Image Processing (Edge Detection Using Convolution)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

  • 1

1

  • 1

1

  • 1

1 1 1 1 1 1 1 1

Input Image Output Image Weight Dissertation Defense 7 Nov 3rd, 2006

Assoc. Control Unit (CU) APE APE APE APE APE R R R R R 1 2 3 4 5 text$ counter$ match$ @ A B A A patt_counter 2 patt_length

AB

patt_string Assoc. Control Unit (CU) APE APE APE APE APE R 1 2 3 4 5 text$ counter$ match$ @ A B A A patt_counter 2 patt_length

AB

patt_string

j

Assoc. Control Unit (CU) APE APE APE APE APE R R R R R 1 2 3 4 5 text$ counter$ match$ 1 @ A B A A 1 patt_counter 2 patt_length

AB

patt_string

j

Assoc. Control Unit (CU) APE APE APE APE APE R 1 2 3 4 5 text$ counter$ match$ 1 @ A B A A 1 2 patt_counter 2 patt_length

AB

patt_string

j

0 2 Assoc. Control Unit (CU) APE APE APE APE APE R R R R R 1 2 3 4 5 text$ counter$ match$ @ A B A A patt_counter 2 patt_length

AB

patt_string

j

Assoc. Control Unit (CU) APE APE APE APE APE R 1 2 3 4 5 text$ counter$ match$ @ A B A A 0 1 patt_counter 2 patt_length

AB

patt_string

j

0 1 Assoc. Control Unit (CU) APE APE APE APE APE R 1 2 3 4 5 text$ counter$ match$ 1 @ A B A A 1 patt_counter 2 patt_length

AB

patt_string

j

Assoc. Control Unit (CU) APE APE APE APE APE R R R R R 1 2 3 4 5 text$ counter$ match$ 1 @ A B A A 2 patt_counter 2 patt_length

AB

patt_string 2

String Matching

Dissertation Defense 8 Nov 3rd, 2006

Table of Contents

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

slide-3
SLIDE 3

Dissertation Defense 9 Nov 3rd, 2006

Scalable ASC (Associative Computing) Processor

memory and supporting circuitry PE and Memory

Network

PE and Memory PE and Memory PE and Memory Common Registers

Responder Resolution Unit

PE Array Control Unit

Instruction Bus Data Bus From Control Unit Dissertation Defense 10 Nov 3rd, 2006

Implementing 1-D and 2-D PE Interconnection Network

The network is implemented

as a large 8xN bit wide NWIN register (where N is the number of PEs), an 8xN bit NWOUT register

Data enters the network

through the NWIN register, which stores data for PE j in bits from 8j to 8j+7, and then that data is routed to the proper place in the NWOUT register

PE0 PE1 PE2 PE(n-1) PE(n-2) PE(n-3) NWIN Register NWOUT Register Control Signal PE0 PE1 PE2 PE(n-1) PE(n-2) PE(n-3) Dissertation Defense 11 Nov 3rd, 2006

Implementing 1-D and 2-D PE Interconnection Network

This version of ASC processor supports both a 1-D and 2-D PE

interconnection network for those applications that require a network

!"# !"$ !"% !"& !"' !"( !") !"* !"+ ,

  • .

,

  • .

!"# !"$ !"% ,-. ,-. ,-. ,

  • .

,

  • .

,

  • .

Dissertation Defense 12 Nov 3rd, 2006

Table of Contents

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

slide-4
SLIDE 4

Dissertation Defense 13 Nov 3rd, 2006

ASC Processor’s Pipelined Architecture

I have implemented a scalable pipelined SIMD Associative

(ASC) Processor using Altera FPGAs

Field Programmable Gate Arrays (FPGAs) are typically used for

designs and can be thought of as programmable hardware

Five single-clock-cycle pipeline stages are split between the

SIMD Control Unit (CU) and the PEs

In the Control Unit

Instruction Fetch (IF) Part of Instruction Decode (ID)

In the Scalar PE (SPE), in each Parallel PE (PPE)

Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB) Dissertation Defense 14 Nov 3rd, 2006

ID/EX Latch EX/MEM Latch MEM/WB Latch Data Memory Register File IF/ID Latch Instruction Memory Decoder

Control Unit (CU) Sequential PE (SPE) Parallel PE (PPE) Array

Immediate Data Broadcast Register Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

Dissertation Defense 15 Nov 3rd, 2006 Register File Data Switch Comparator ID/EX Latch Mask EX/MEM Latch MEM/WB Latch Data Memory MUX

Processing Element (PE)

Comparator implements associative search, pushes ‘1’ onto

top of stack for responders, ‘0’ otherwise

Top of mask of ‘0’ disables ID/EX Latch

Dissertation Defense 16 Nov 3rd, 2006

Pipelined ASC Processor’s Performance

Our pipelined ASC Processor has been implemented on an

Altera APEX20KC1000 FPGA with 70 8-bit PEs

Other 8-bit processor cores implemented on this FPGA / speed grade

have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz

Our pipelined ASC Processor has a clock speed of 56.4 MHz,

comparable with these other processors

With the 5-stage pipeline, our ASC Processor can approach a peak

performance of 300 MHz

slide-5
SLIDE 5

Dissertation Defense 17 Nov 3rd, 2006

Reconfigurable PE Interconnection Network

Our pipelined ASC Processor also has a reconfigurable PE

interconnection network

Reconfigurable PE network supports associative computing by

allowing arbitrary PEs in the PE Array to be connected via

Linear array (currently implemented), or 2D mesh (shown in the next chapter)

without the restriction of physical adjacency

Each PE in the PE Array can choose its own connectivity

Responders choose to stay in the PE interconnection network, and Non-Responders choose to stay out of the PE interconnection network,

so that they are bypassed by any inter-PE communication

Dissertation Defense 18 Nov 3rd, 2006

ID/EX Latch EX/MEM Latch MEM/WB Latch Data Memory Register File IF/ID Latch Instruction Memory Decoder

Control Unit (CU) Sequential PE (SPE) Parallel PE (PPE) Array

Immediate Data Broadcast Register Data

Pipelined ASC Processor with Reconfigurable Interconnection Network

Dissertation Defense 19 Nov 3rd, 2006 Data Switch Register File Register Data (from SPE) Immediate Data (from CU) Left Neighbor Right Neighbor Top of Mask Stack Comparator & ID/EX Latch

Reconfigurable Network Implementation

Data switch

Passes register, broadcast, and immediate data to the PE and to its two

neighbors

Routes data from the PE’s neighbors to its EX stage

Reconfigurable network — supports Bypass Mode to remove

the PE non-responders from the network

Will be needed by MASC Processor

Dissertation Defense 20 Nov 3rd, 2006

ASC Processor’s Network Performance

Performance of ASC Processor degrades as number of PEs is

increased with Bypass Mode present

Due to the long path from the first PE to the last PE in the PE array

4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz

with Bypass Mode present

When the number of PEs is increased to 50, the clock frequency drops

to 22 MHz

In the future, this delay may be reduced using a pipelined or

  • ther multi-hop architecture
slide-6
SLIDE 6

Dissertation Defense 21 Nov 3rd, 2006

Table of Contents

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

Dissertation Defense 22 Nov 3rd, 2006

Overview of LCS Algorithm

Given two strings, find the LCS common to both strings Example:

String 1: AGACTGAGGTA String 2: ACTGAG

AGACTGAGGTA

  • -ACTGAG - - -

list of possible alignments

  • -ACTGA - G- -

A- -CTGA - G- - A- -CTGAG - - -

The time complexity of this algorithm is clearly O(nm)

Dissertation Defense 23 Nov 3rd, 2006

Overview of LCS Algorithm

1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 3 1 1 1 4 4 4 4 3 2 2 2 3 3 3 3 4 3 3 3 2 5 5 5 4 3 3 3 2 6 5 4 3 2 2 6 6 6 5 5 4 3

0 0 0 0 0 0 0 0 0 0 0 0 A G A C T G A G G T A

A C T G A G

Dissertation Defense 24 Nov 3rd, 2006

PE’s Form Coteries

5 x 5 coterie network with switches shown in “arbitrary”

  • settings. Shaded areas denotes coterie (the set of PEs Sharing same circuit)
slide-7
SLIDE 7

Dissertation Defense 25 Nov 3rd, 2006

Reconfigurable Network in the ASC Processor

Key to reconfigurability is the

Data Switch inside each PE:

The Data Switch is expanded

to connect to its four neighbors (N-E-S-W) to form a 2D Reconfigurable Network

Data switch has bypass mode

to allow PE communication to skip non-responders, so as to support associative computing

S N W E

DATA Communication

Dissertation Defense 26 Nov 3rd, 2006

1,1 1,2 1,3 1,4 1,5 2,1 3,1

2,8

2,4 1,11 1,10 1,9 1,8 1,7 1,6 3,5 4,1 6,1 5,1 4,6

4,2 3,9 5,3

5,7

6,4

6,8

4,10 5,11

A G A C T G A C T G A A C T G A C

LCS Algorithm on Reconfigurable 2D Mesh

Dissertation Defense 27 Nov 3rd, 2006

Table of Contents

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

Dissertation Defense 28 Nov 3rd, 2006

MASC is an MSIMD (multiple SIMD) version

  • f ASC that supports

multiple Instruction Streams (ISs) In our dynamic MASC Processor, tasks are assigned to available ISs from a common pool as those ISs become available

MASC Architecture

Task Manager (TM) and Instruction Stream (IS) Pools in the MASC Processor

Multiple Instruction stream IS0 IS1 IS2 TM0 TM2 TM1 TM POOL IS POOL PE0 PE1 PE (n-1) PE (n)

!!!!!!!!!!!!!!!!!!!!

Task Allocation Control Signal Control and Instructions

slide-8
SLIDE 8

Dissertation Defense 29 Nov 3rd, 2006

IDLE TASK ALLOC FETCH IDLE JOIN JOIN CALL TM WAIT FOR IS Task Manager Wait for IS to Return Instruction Stream Wait For TM Signal TM For Return 1. Initially, IS0 is executing the program 2. When a conditional operation is encountered, IS0 transfers program control to TM0 (first available TM in the TM pool) 3. TM0 allocates Task0 to IS0 4. TM0 allocates Task1 to IS1 (first available IS in the IS pool) 5. TM0 waits for Task0 and Task1 to finish 6. [ IS0 and IS1 perform lines 1 through 5 as necessary ] 8. After both Task0 and Task1 finish, TM0 transfers program control back to IS0

Dissertation Defense 30 Nov 3rd, 2006

PE Structure

CU Selector mask 000 CU ID TM signals IS signals CU signals compare PE

A dedicated CU Selector unit

is added to the IE stage of each Parallel PE

The CU Selector is a large

multiplexer choosing which TM

  • r IS will control the PE and to

which broadcast network the PE should listen

Dissertation Defense 31 Nov 3rd, 2006

IS0 TM0 IS0 TM1 IS0 IS2 TM1 IS1 IS1 IS0 TM0 IS0 Parallel Select Start Begin Case 1 Parallel Select Case 2 Parallel Select Case 1 End Parallel Select (inner) Begin Case 1(inner) Parallel Select Start(inner) Parallel Select Case 1 (inner) Parallel Case 2 (inner) Begin Case 2(inner) END Case 1 Case 2 END Parallel Select Program Continue

Dissertation Defense 32 Nov 3rd, 2006

Instruction Stream Tree Structure

The implementation can easily be scaled up as necessary. The number of TMs limits the degree

  • f nesting, and the number of ISs

limits the number of tasks executed in parallel.

TM0 TM1 TM2 TM3 TM4 TM5

IS IS IS IS IS IS IS IS IS IS

slide-9
SLIDE 9

Dissertation Defense 33 Nov 3rd, 2006

Table of Content

Chapter 1 – Introduction Chapter 2 – First Prototypes of an Associative Computing

(ASC) Processor

Chapter 3 – A Scalable Pipelined ASC Processor With

Reconfigurable PE Interconnection Network

Chapter 4 – A Specialized ASC Processor with Reconfigurable

2D Mesh for Solving the Longest Common Subsequence (LCS) Problem

Chapter 5 – An ASC Processor to Support Multiple Instruction

Stream Associative Computing (MASC)

Chapter 6 – Conclusions and Future Work

Dissertation Defense 34 Nov 3rd, 2006

Contributions

My first contribution was to construct the first working ASC

Processor and the first scalable ASC Processor [WMPP’04]

My second contribution was to develop and implement a scalable

pipelined ASC Processor [PDCS’05]

My third contribution was to build on the reconfigurable PE

interconnection network developed in conjunction with my pipelined ASC processor to develop a specialized version of this processor designed to support an innovative LCS algorithm that is useful in bioinformatics such as genome sequence comparison [PDCS’06 #1]

My fourth and final contribution was to develop a first prototype of

the MASC Processor, a MSIMD version of ASC to better support control parallelism and increase PE utilization [PDCS’06 #2]

Dissertation Defense 35 Nov 3rd, 2006

Future Work

One area of future work would be to explore more computationally-

intensive associative computing algorithms that could benefit from this pipelined ASC Processor

Another area for future research would be to further explore the use

  • f the specialized ASC Processor for genome sequence

comparison.

Yet another area for future research would be to explore the use of

this first MASC Processor for various applications