Automated Task Distribution in Multicore Network Processors using - - PowerPoint PPT Presentation
Automated Task Distribution in Multicore Network Processors using - - PowerPoint PPT Presentation
Automated Task Distribution in Multicore Network Processors using Statistical Analysis Arindam Mallik, Yu Zhang, Gokhan Memik Electrical Engineering and Computer Science Dept. Northwestern University Network Demand Gap Gap increases with the
2008-1-9 2
Network Demand Gap
ANCS 2007
Gap increases with the time [Intel]
The Path to ASIPs
Application Specific IC design
Costly Unpredictable
Fuels the rise of programmable devices or ASIPs
(Application Specific Instruction Processors)
Networking Multimedia Graphics
ASIPs
Architectures have been explored in great depth Modest progress on programming environments But, the success of users is dependent on their ability to
program effectively
ANCS 2007 3 2008-1-9
2008-1-9 4
Why Network Processors ?
Traditional processors in networks
General-purpose CPU
Not fast enough to handle new link speeds
ASIC
Good performance, but lack flexibility. New applications
- r protocols make the old processor obsolete
Frequent new applications
Solution: Network Processors
Programmable processors optimized for
networking applications
Reusability of the same processor core for
different network applications
ANCS 2007
Overview
Chip Multiprocessors
Most current processor architectures Ideal for networking application
Data level parallelism Task level parallelism
Dominating from the start - Intel IXP
Low scalability of interconnect networks
Importance of local communication Uniform task distribution
ANCS 2007 5 2008-1-9
2008-1-9 6
Outline
Introduction Click Router Architecture Statistical Task Allocation Results Conclusion
ANCS 2007
2008-1-9 7
Modularity in Networking Apps
Presence of well defined data segments
(packets)
Independent packet processing Overlooked modularity
Set of independent tasks performed on each
packet - module
Majority of networking applications – collection of
standard modules (ttl, checksum calculation)
ANCS 2007
2008-1-9 8
Click Architecture
Unit of processing
‘element’–(From/ToDevice, GetHeader, Discard, Count…) element encapsulates processing actions and state elements have input and output ports language level compositions of elements
Router configuration
directed graph of elements (cycles ok), connected by
‘connections’ (at ports)
Each packet follows connections Configuration string
parameters and initial state to instantiate an element
ANCS 2007
2008-1-9 9
Click Configuration Example
Configuration checking the TTL value of a
packet
FromDevice(eth0) DecIPTTL ToDevice(eth1) Discard
ANCS 2007
2008-1-9 10 Packet Source CheckIPHeader Strip(14)
IPv4 Router Example
Different destinations DropBroadcast0 DecIPTTL Discard Discard DropBroadcast1 DecIPTTL Discard Discard Packet Source CheckIPHeader Strip(14) StaticIP- Lookup Discard 8 Different sources ANCS 2007
2008-1-9 11
Statistical Task Allocation
Systolic Array Architecture
Execution cores arranged in pipelined fashion Global communication using shared bus
Goal : Uniform Task Allocation
Automated Each core sends partially processed packet to the
next one
ANCS 2007
2008-1-9 12
Module Distribution Algorithm
Profiling
Statistical Analysis of packet processing time
Streamlining
Find total execution time of a packet Use DFS on the element tree
Task Distribution
Assign elements to different stages/modules
Local optimization
ANCS 2007
2008-1-9 13
Statistical Analysis of Packet Processing
Individual Elements
Executed for 5000 packets Execution time recorded for each packet Mean (μ) and standard deviation (σ) calculated
from the statistics
expression (μ+kσ) estimates variation of utilization
ANCS 2007
2008-1-9 14
- Prob. Distn. of IPv4 Elements
Processing time threshold Elements Mean (μ) SD (σ) μ μ +σ μ+2σ μ+3σ μ+4σ strip0 241.28 29.31 50 0.64 0.64 0.64 chkip0 713.01 59.77 50 0.64 0.64 0.64 0.64 RtLkUp 336.56 266.88 20.03 20.03 10.01 0.03 0.03 DBC0 212.30 21.18 34.32 28.57 1.29 0.18 0.18 DcTTL0 317.78 20.34 26.45 12.98 2.09
ANCS 2007
2008-1-9 15
- Prob. Distn. of IPv4 Router Stages
Processing time threshold Stages Mean (μ) SD (σ) μ μ+σ μ+2σ μ+3σ μ+4σ Stage0 227.38 24.14 35.06 20.00 3.64 0.00 0.00 Stage1 691.18 30.48 23.19 14.29 1.86 0.08 0.00 Stage2 500.43 29.52 27.18 24.31 5.66 0.11 0.11 Stage3 314.72 20.33 27.78 23.14 7.14 0.28 0.00
ANCS 2007
2008-1-9 16
Optimized Strategies
Base Task Distribution - BTD
Uniform task allocation depending on the mean
execution time
Extended Task Distribution - ETD
Slack kσ added to estimated processing time
Selective Replication - SR
Replicate modules parallelize packet processing
Extended Selective Replication - ESR
Select elements with longer execution time
ANCS 2007
2008-1-9 17 DecIPTTL CheckIPHeader Strip(14)
Module Distribution Illustration
Different destinations DropBroadcast0 DecIPTTL Discard Discard DropBroadcast1 DecIPTTL Discard Discard DecIPTTL CheckIPHeader Strip(14) StaticIP- Lookup Discard 8 Different sources
2 Stages 4 Stages
ANCS 2007
2008-1-9 18
Relative Throughput Analysis
1 2 3 4 5 6 7 8 BTD ETD SR ESR
Relative Processor Throughput 2 4 8
ANCS 2007
Processor throughput for DRR application
2008-1-9 19
Resource Utilization Analysis
70 75 80 85 90 95 100 2 4 8
Processor Utilization BTD ETD SR ESR
ANCS 2007
Resource utilization in DRR application
2008-1-9 20
Contributions
Analyzed modularity in networking
applications using statistical methods
Proposed intelligent task allocation based on
variation in processing time
Generic nature of the task allocation method
applicable to CMP task distribution
ANCS 2007
2008-1-9 21
Acknowledgements
Click Development Group Anonymous reviewers
THANK YOU yzh702@eecs.northwestern.edu
ANCS 2007