Outline Motivation Network Processor Complexity Methodology and - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Motivation Network Processor Complexity Methodology and - - PDF document

Faraydon Karim ST Microelectronics La Jolla, CA 92121 Faraydon.karim@st.com Outline Motivation Network Processor Complexity Methodology and Architecture Faraydon Karim MPSoC02 o c Motivation Speed Requirement


slide-1
SLIDE 1

Faraydon Karim ST Microelectronics La Jolla, CA 92121 Faraydon.karim@st.com

Faraydon Karim MPSoC02

  • c

Outline

Motivation Network Processor Complexity Methodology and Architecture

slide-2
SLIDE 2

Faraydon Karim MPSoC02

  • c

Motivation

Speed Requirement Communication Requirement

Faraydon Karim MPSoC02

  • c

Need for Network Processor

RISC Processor

Perform ance Configurability (Evolving standards) Com plexity Of Network Functions ASIC Netw ork Processor OC-12 OC-768

slide-3
SLIDE 3

Faraydon Karim MPSoC02

  • c

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 OC3 OC12 OC48 OC192

L2 switching L3 routing QoS/CoS Monitoring Load Balancing Firewall VPN Intrusion Detection Virus Scanning

Today’s processors 1-3K MIPs MIPs

Need for highly concurrent SoC architectures *Sterling Research Report, 2000

MIPS Requirements for Network Processing

Faraydon Karim MPSoC02

  • c

Why special-purpose NP?

Media Cell/Packet size Packets/Sec Time/Packet 10 Mb Ethernet 64 - 1518 14.88k - 800k 67.2-1,240 uS 100 Mb Ethernet 64 - 1518 148k – 8k 6.72 – 124 uS Gb Ethernet 64 - 1518 1.48M – 80k 672nS – 12.4 uS OC-3 53 ~300k ~3.3 us OC-12 53 ~1.2M ~833 nS OC-48 53 ~4.8M 208 nS OC-192 53 ~19.2M 52 nS OC-768 53 ~76.8M 13 nS

Processing Time budgets

slide-4
SLIDE 4

Faraydon Karim MPSoC02

  • c

Requirements for a Network Processor

Requirements for OC-768 network processing

114 million packets/sec (44 bytes/packet) Processing time < 9ns/packet Assumption: forwarding + classification = ~500 instructions Requirement: 57 GIPs Need for multiple GHz processors Packet Classification Lakshman and Stiliadis Proceedings of ACM SIGCOMM, Sept. 98 50 memory accesses/packet Requirement: 5.7 x 109 memory accesses/sec Need for multiple memory components

Need for multi-processor/distributed memory architecture Need for concurrent, high-speed on-chip communication

Faraydon Karim MPSoC02

  • c

Requirement .....

Requires huge computing power

~5.7GIPS for OC-192 . . . and getting worse

Requires huge memory bandwidth

data comes in at 10Gbps (OC-192) and 40Gbps (OC-768)

Inherently parallel

frame doesn’t depend on previous or next one

Data-driven

driven by data (operand) availability asynchrony

slide-5
SLIDE 5

Faraydon Karim MPSoC02

  • c

Network Processor Complexity

Functional Complexity Architecture Complexity System Design Complexity Verification Complexity

Faraydon Karim MPSoC02

  • c

Functional Complexity

State-of-the art Functions of general-purpose processors:

Well known properties Existing processors are well defined Simulation with established benchmarks

Network Processors are application-specific processors

Application space known ... However, very complex set of functions:

packet classification, forwarding, scheduling

Properties to verify not all known Evolving standards

Can test suites be developed?

slide-6
SLIDE 6

Faraydon Karim MPSoC02

  • c

Functional Complexity

Segmentation and Reassembly (SAR) Protocol Recognition and Classification

Identify frames based on information such as protocol,

destination/source address, etc

Queuing and Access Control

Queue frames awaiting further processing (prioritization)

Traffic Shaping and Engineering

Meet delay/jitter requirements

Quality of Service (QoS)

Tag frames for processing in subsequent devices Source: Agere, Inc

Faraydon Karim MPSoC02

  • c

Architectural Complexity

Network processing is a dataflow problem

Locality inter-packet is poor. uP cache does not help. A lot of pointer-chasing which requires Cache thrashing uP stalls during these indirections

IPC dramatically reduces because of memory latencies. Caches exploit locality. Data structures accessed per packet exhibit poor temporal locality. Time budget requirement per packet is too high for regular microprocessors.

slide-7
SLIDE 7

Faraydon Karim MPSoC02

  • c

Architectural Complexity

The faster the network port the likelier for more unrelated streams. A lot of alignment issues.

Branch prediction ineffective

> 90% taken for DSP 50/50 for some network applications

Faraydon Karim MPSoC02

  • c

Architectural Complexity

  • Network has two conflicting requirements

programmability and speed.

  • Network processors must support those two requirements

where the traditional micro processors can’t.

  • Current Network Processors have relied on

duplicating/copying the ASIC paradigm on a chip.

  • Either copying some off-the-shelf processors with a few

additions and tying them together the same old fashion way.

  • Or making some minimal modification for product

differentiation purpose.

  • Besides, many of the current Network Processors are very

difficult to program.

  • System houses demand platform solutions from
  • manufacturers. They can no longer afford point

product solutions.

slide-8
SLIDE 8

Faraydon Karim MPSoC02

  • c

Architectural Complexity Computations

Provide Specialized Network Instructions to achieve more with less instructions. Fuse several appropriate primitives to enhance performance as it is done in the case of Multiply Accumulate Add more predicate to reduce branch penalties

Faraydon Karim MPSoC02

  • c

Architectural Complexity Computations

Use more computational processing units as needed In:

pipeline fashion Parallel fashion

slide-9
SLIDE 9

Faraydon Karim MPSoC02

  • c

Network Processor Architecture

Micro Processor Nano Processor Nano Processor Nano Processor Nano Processor Nano Processor Nano Processo Host Bus Interface Unit ST Net work Interface Unit Circular Buffer Octagon Connection IPA-TLC Memory Controller & Buffers Nano Processo Nano Processor 10Mb/100Mb/1Gb Ethernet MAC 10Mb/100Mb/1Gb Ethernet MAC ATM SONET

128-bit CPIX Bus (166MHz)

... ... ... ...

PHYs PHYs PHYs PHYs

  • Multiple Nano-processors
  • Complex on-chip

interconnects

  • High-speed memory

components

  • High-speed Interfaces

Faraydon Karim MPSoC02

  • c

Nano-Processor Programming Model

Register File

Control Store

System Registers

ALU

Decode Unit

Special Hardware Branch Processor

Load/ Store Search Engine

Multithread buffers Special Hardware Special Hardware Special Hardware Special Hardware

Data Buffer

Circular buffer Addressing

slide-10
SLIDE 10

Faraydon Karim MPSoC02

  • c

Octagon On-Chip Communication

Network Processor using Octagon

7 6 5 3 2 1 4 P1 M1 P0 M0 P2 M2 P3 M3 P4 M4 P5 M5 P6 M6 P7 M7

Octagon Node Model

Request Generator Processor Memory Arbiter

L L A R A R Ingress Egress MUX/DEMUX

Scheduler

Faraydon Karim MPSoC02

  • c

System-level Design Complexity

System H/W Architecture Evaluation/Partitioning

S/W design H/W design

System integration

Logic MCU DSP DRAM ADC DAC Analog

H/W emulation

Interface design

  • Arch. modelling:

Transaction -> Cycle HW/SW Performance eval. RTL-to-layout Tools

System function

Domain-specific modelling tools

H/W-S/W cosim System S/W Architecture

  • Appln. Stacks

Device Drivers Instruction-set sim (Function->cycle)

Cycle-based spec signoff

C compiler Source-level debug RTOS

  • Perf. profiling

PLD board

Verification needs to be performed at every step individually and collectively

slide-11
SLIDE 11

Faraydon Karim MPSoC02

  • c

Design Validation Challenges

Due to: Functionality Complexity Architecture Complexity Embedded Application Software Complexity Design Methodology Complexity

Faraydon Karim MPSoC02

  • c

Functional Complexity

State-of-the art Verification/Validation of general-purpose processors:

Property checking of well-established properties Validation test suites of known processor functionalities Simulation with established benchmarks

Network Processors are application-specific processors

Application space known ... However, very complex set of functions:

packet classification, forwarding, scheduling

Properties to verify not all known Evolving standards

Can test suites be developed?

slide-12
SLIDE 12

Faraydon Karim MPSoC02

  • c

Architectural Complexity

State-of-the art in verification/validation:

processor: formal and simulation-based techniques for a single processor hw/sw co-designs: co-simulation of single processor-based co-designs

However, network processors/ASICs are very complex hardware/software co-designs

Multiple embedded processors Multi-threading, parallel processing, pipelining Mix of homogenous and non-homogenous processors

nano-processors and control processor

Multiple co-processors/hardware accelerators

for packet forwarding, packet classification, queue management

Faraydon Karim MPSoC02

  • c

Software Complexity

Complex set of application, firmware, and development software Need for comprehensive set of software debugging tools Need for real-time verification through hardware prototyping environments

Cycle-accurate ISS/Network Simulator API Library Optimized Firmware Library Third party Routing Applications NanoPU NanoPU Compiler Assembler Linker Embedded RTOS

Architecture Performance Analysis

Network Models

H/W Prototyping Environment

NPU NPU Programmer’s Model

NanoPU NanoPU debugger Instruction-set Simulator

slide-13
SLIDE 13

Faraydon Karim MPSoC02

  • c

Test Challenges

Use of GHz clocks, multiple clock domains

To meet OC-192/OC-768 speeds Need for at-speed test

Ultra-Deep Sub-micron technologies

Need test for noise problems

Deployment in optical networks

Need for mixed-signal/mixed-domain test

Faraydon Karim MPSoC02

  • c

Need for At-Speed Test of GHz Chip

May lead to excessive test cost

  • v. high cost for GHZ external tester

Self-test a viability LFSR-based self-test may not be well- suited for multi-processor SoC Need to look at alternative self-test methodologies

slide-14
SLIDE 14

Faraydon Karim MPSoC02

  • c

1 2 3 6 7 5 4 4 3 7 2 1 6 5

Complex, Long On-Chip Interconnects

Nodes 8 15 22 (shown) Horizontal Links #/max length(mm) 12/8 24/8 36/8 Nodes 8 15 22

Octagon Crossbar

Vertical Links #/max length(mm) 12/0.156 24/0.156 36/0.156 Horizontal Links #/max length(mm) 8/8 15/16 22/22 Vertical Links #/max length(mm) 32/0.108 120/0.192 242/0.276

  • Very large number of wide buses
  • 24 32-bit busses for 8-node Octagon
  • 40 32-bit busses for 8-node Crossbar
  • Interconnects very long
  • proportional to # of nano-processors

Faraydon Karim MPSoC02

  • c

Significant DSM Noise Potential

Most affected:

Long Interconnects in High speed/low voltage DSM Socs

Network Processors are susceptible since

Use of Ghz, v. long, nano-meter interconnects

2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 0.1 µm 0.18 µm 0.25 µm 0.35 µm

Glitch Height (V) Length (mm)

2 4 6 8 10 0.0 0.2 0.4 0.6 0.8

0.1µm 0.13µm 0.18µm 0.25µm 0.35µm Delay Time (ns) Length (mm)

slide-15
SLIDE 15

Faraydon Karim MPSoC02

  • c

Mixed-Signal/Mixed-Domain Testing

Optical Networks: Integration of digital, analog, and optical components in network chips

Faraydon Karim MPSoC02

  • c

Methodology & Architecture

Abstraction Layers New Pyramid of Design Putting it all together

slide-16
SLIDE 16

Faraydon Karim MPSoC02

  • c

Abstraction Layers

Silicon Gate-Level RTL / u-Arch Architecture Functional Abstraction Layer Modeling Type Circuit Boolean Cycle-based Processes Functional

Faraydon Karim MPSoC02

  • c

The New Pyramid of Design

  • Identify The Application and

Analyze all its component

Application Scheduling Architecture

_ Find parallelable Functions

  • Find Pipelinable Functions

Select The Components for desired Cost/Performance

slide-17
SLIDE 17

Faraydon Karim MPSoC02

  • c

Application

Understand the Application Draw the Flow of the Application

Faraydon Karim MPSoC02

  • c

Edge Router IP Packet processing steps

Egress Module: Traffic manager block Status memory for WRED [Thr1, Thr2, AvgQsize, …] Out queue lookup table Status memory for MDDR Output Queuing memory [packbuff ptr} Output Packet buffer memory Queues servicing (and shaping) DRR, MDRR Congestion control WRED/ drop tail Packet classifying [CoS, OutPrt}→out queue Switch Fabric Interface Switch fabric scheduler protocol Packet re-assembly

IP packet flow within the egress module

From the Switch Fabric CSIX –L1 interface header+payload packetbuff ptr Packet to L2 Output port interface

slide-18
SLIDE 18

Faraydon Karim MPSoC02

  • c

Edge Router IP Packet processing steps

Ingress Module: Packet Processor

Packet transfer to traffic manager engine Forwarding Packet preparation (FWH+updated header+payload) Packet modification Header updating *CoS field *TTL Decrementing *Checksum calculation Forwarding header (FWH) preparation (CoS, drop prec.,

  • ut Module/Port, multicast,…)

IP traffic conditioning and statistics Policing/metering (token bucket, marking non conforming packets, Drop precedence value assigned) IP packet classification/filtering (multifield,MF) Packet Parsing and lookup key preparation IP packet/header validation: Header length field check Packet length & min. length check Protocol version number Header Checksum IP packet lifetime control Statistics Memory Metering Packet flows status memory [flow parameters, token counters status] Lookup engine Lookup Tables and ACLs Packet buffer memory [Packet header+payload]

IP packet flow within the ingress packet processor/classifier block

Flow identified: (flow info record: QoS tag, egress port/line card, PTRs to metering and statistics mems) Packet header + payload Packet header Local delivery Table lookup and ACLs update (PCI) Entry:[Source/Dest addr., ToS field, Protocol Type, TCP/UDPSource/ Dest port…] Pkt non conforming Packet to traffic manager SPI-n modified or streaming interface (NP Forum) Sequence of data chunks (64 bytes) Packet from PHY (SPI-n interface) Sequence of data chunks (64 bytes)

Faraydon Karim MPSoC02

  • c

Scheduling

Define The Processes Divide each to sub-processes Find the process and sub-processes that can run independently to events Chose the components that can execute these events.

slide-19
SLIDE 19

Faraydon Karim MPSoC02

  • c

Scheduling -2

An event is the smallest part of application that can not be sub-divided to run in parallel

  • r meaningfully in sequence

Name the components that can execute each event according to the cost and performance Select the communication devices that can move data between components according to the desired cost and performance

Faraydon Karim MPSoC02

  • c

Scheduling -3

Draw the final events flow Measure the desired speeds Find several closest flow for the desired goal. Select the easiest achievable one

slide-20
SLIDE 20

Faraydon Karim MPSoC02

  • c

Architecture

Select the component(s) for the events Connect the components with good communication architecture Control the move and dispatch of all events Re-evaluate the final goals.

Faraydon Karim MPSoC02

  • c

NP Architecture

RAM

RAM

Boot ROM

Pointer Logic& Lookup table

Circular Buffer Control Processor

PCI Interface

Network interface

Checksum & Policy key

RAM

IP-TLC

RAM

Octagon Connection

MACs DMA DMA

Interface cntl.

Voyagers

Special Purpose Processors

System Registers

CSIX

Nano- Processor Bank

slide-21
SLIDE 21

Faraydon Karim MPSoC02

  • c

Conclusions

Systems are getting too complex Design bottom up can not satisfy the desired cost/performance There must design through several layers of abstractions Applications must drive the component selections