C ONTENT-ADDRESSABLE MEMORY (CAM) is a thus, slow updates retarded - - PDF document

c
SMART_READER_LITE
LIVE PREVIEW

C ONTENT-ADDRESSABLE MEMORY (CAM) is a thus, slow updates retarded - - PDF document

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018 73 Fast Content Updating Algorithm for an SRAM-Based TCAM on FPGA Farkhanda Syed , Zahid Ullah, and Manish K. Jaiswal used to emulate TCAMs function. Todays FPGAs have high


slide-1
SLIDE 1

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018 73

Fast Content Updating Algorithm for an SRAM-Based TCAM on FPGA

Farkhanda Syed , Zahid Ullah, and Manish K. Jaiswal

Abstract—Static random-access memory (SRAM)-based ternary content-addressable memory (TCAM), an alternative to traditional TCAM, where inclusion of SRAM improves the memory access speed, scalability, cost, and storage density compared to conventional TCAM. In order to confidently use the SRAM-based TCAMs in application, an update module (UM) is essential. The UM replaces the old TCAM contents with fresh contents. This letter proposes a fast update mechanism for an SRAM-based TCAM and implements it on Xilinx Virtex-6 field-programmable gate array. To the best of authors’ knowl- edge, this is the first ever proposal on content-update-module in an SRAM-based TCAM, which consumes least possible clock cycles to update a TCAM word. Index Terms—Field-programmable gate array (FPGA)-based content-addressable memory (CAM), SRAM, ternary content- addressable memory (TCAM), UE-TCAM, update module (UM).

  • I. INTRODUCTION

C

ONTENT-ADDRESSABLE MEMORY (CAM) is a hardware used in a variety of searching-based applica-

  • tions. Here, a search key is provided as an input. The CAM

searches the entire memory against the search key concur- rently and returns the address of the matched CAM word at

  • utput. CAM supports all logical values; binary as well as
  • ternary. The binary logic supported CAMs are called binary

CAMs (BiCAMs) while the three-valued logic supported CAMs are called ternary CAMs (TCAMs). In case of TCAM, there is a possibility that more than one TCAM word match the search key. In such a case, a priority encoder is required to choose high priority address as a final matched address. Due to the parallel searching mechanism, each TCAM cell exhibits a separate matching circuitry; thus, TCAM has low bit storage density. Besides, traditional TCAMs are imple- mentable on application-specific integrated circuit only and has limited configurability. The demand for TCAM to be denser, reconfigurable, and easy for integration provokes the idea of implementing field-programmable gate array (FPGA)- based TCAM, where static random-access memory (SRAM) is

Manuscript received September 4, 2017; accepted October 21, 2017. Date

  • f

publication November 6, 2017; date

  • f

current version September 7, 2018. This manuscript was recommended for publication by

  • P. Panda. (Corresponding author: Farkhanda Syed.)
  • F. Syed and Z. Ullah are with the Department of Electrical Engineering,

CECOS University of IT and Emerging Sciences, Peshawar 25000, Pakistan (e-mail: farkhandasyed15@gmail.com; zahidullah@cecos.edu.pk).

  • M. K. Jaiswal is with the Department of Electrical and Electronic

Engineering, University

  • f

Hong Kong, Hong Kong (e-mail: manishkj@eee.hku.hk). Color versions of one or more of the figures in this paper are available

  • nline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/LES.2017.2770225

used to emulate TCAM’s function. Today’s FPGAs have high clock rate, large amount of embedded memories called block RAMs (BRAMs) of reconfigurable nature, and low power con- sumption [1]. FPGA is evolving rapidly because of its support for networking-based applications. Also processing cores and specific embedded designs are available on it which makes FPGA faster and denser. Hence, performance gap between native TCAMs and FPGA is becoming narrower with the passage of time [2]. SRAM-based TCAMs [3]–[5] are better than conventional TCAMs when lookup operation is required. However, when it comes to updating algorithm, no comparison between conven- tional TCAM and SRAM-based TCAM is provided till now. In conventional TCAM, the stored entries are sorted in ascend- ing order, which does not allow one to reduce the worst case updating latency less than O(N). Here N is the total number

  • f entries in TCAM [6]. It is understood that search opera-

tion and update operation cannot be performed simultaneously; thus, slow updates retarded the lookup performance in applica-

  • tions. For example, in IP networking, owing to slow updates,

buffering of incoming packets is required to avoid packet loss during update process. However, it may cause head-of-line blocking; thus, a large buffer space other than the main packet buffer memory is required, which is undesirable for many applications [6]. The rest of this letter is arranged as follows. Section II pro- vides discussion on prior update algorithms for conventional TCAM and SRAM-based TCAM available in the literature. Section III consists of motivations and key contributions of the proposed work. Section IV provides explanation of hybrid

  • partitioning. Section V explains the overall architecture of

modified update module (UM) integrated with FPGA-based

  • TCAM. Section VI discusses implementation results and per-

formance evaluation. Section VII contains conclusions and directions to the future work.

  • II. PRIOR WORKS AND DISCUSSION

Wang et al. [6] tried to provide a consistent policy TCAM table during update process. This is to unlock the TCAM table for lookup operation during an update process. However, rules in a policy are sorted where rules with high priority are placed at lower memory locations; hence, worst case complexity of updating algorithm still remains O(N) where N is the total number of rules. Thus, it costs design complexity and increases power consumption. Besides, 15% empty slots are required every time a system undergoes update process; hence, memory usage becomes inefficient.

1943-0663 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

slide-2
SLIDE 2

74 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018

Shah and Gupta [7] tried to lower the worst case updat- ing latency by providing two algorithms. These algorithms are based on the prefix length and its ordering. Author maintains an ordered prefixes instead of ordering all addresses. This way average memory shifts in worst case becomes lower than the previous case; however, an empty pool of space is still needed to handle the update process. Besides, multiple entry move- ments are still required for a single entry update in worst case. Wang and Tzeng [8] successfully eliminate the need of sorted prefixes by using the leaf technique; however, this may lead to prefix duplication. During update process, all dupli- cated prefixes also need to be modified and hence multiple write operations are still needed to update a single prefix. Mishra et al. [9] proposed a new scheme to speed up the lookup operation. Here, two TCAMs are used named leaf TCAM (LTCAM) and interior TCAM (ITCAM) for storing filters (IP prefixes). LTCAM stores highest priority filters and all filters are nonoverlapping filters while ITCAM stores the remaining filters along with a priority graph in order to ensure correctness of lookup results. A multidimensional trie is required to find filters, which need to be stored in the LTCAM. This scheme increases lookup speed with the increase in mem-

  • ry resources. To update a filter, the multidimensional trie

and priority graph also need update which increases algorithm complexity. Syed and Ullah [10] provided an updating algorithm for an SRAM-based TCAM. Here, the complexity of updating algorithm remains worst for all cases and is 2w + 1 clock

  • cycles. This design is in simplest form, consuming worst case
  • latency. The problem arises in TCAM, where more than one

TCAM word need to be updated. Here, a counter module is embedded to generate all possible TCAM words. A matching logic in UM matches the counter output with the required input words. If match occurs, the TCAM word is updated;

  • therwise, not.
  • III. MOTIVATIONS AND CONTRIBUTIONS
  • A. Motivations

By comparing the lookup performance of conventional TCAM with SRAM-based TCAM, the latter shows better results in terms of power, speed, and scalability. However, in order to merge these SRAM-based TCAM in applications, an updating algorithm is the major requirement. The memory access time of SRAMs is relatively smaller than the memory access time of conventional TCAMs; therefore, updating pro- cess of SRAM-based TCAM might be faster. Although [10] provides updating algorithm for an SRAM-based TCAM; how- ever, the logic is not sophisticated. What if an algorithm is smart enough to call and update only the required SRAM contents? This will not only reduce the algorithm’s overall complexity but will ultimately reduce power consumption.

  • B. Contributions

Following are the key contributions of our proposed updat- ing algorithm in comparison with the prior updating algorithm for FPGA-based TCAM. 1) The proposed updating algorithm for SRAM-based TCAM consumes least possible clock cycles to update a ternary word. In previous work [10], 2w clock cycles are consumed to update a single TCAM word while the updating latency of the proposed updating algorithm varies with number of don’t care bits in TCAM word. 2) We also provide a comprehensive trade-off details between the worst case updating latency and memory

  • resources. Lower the memory resources utilized higher

will be the worst case updating latency and vice versa. 3) Since power consumption of the proposed updating algo- rithm varies with the number of don’t care bits in TCAM word, worst case power consumption may not be always needed.

  • IV. HYBRID PARTITIONING

Hybrid partitioning is a scheme, which dissects the original TCAM table into vertical as well as horizontal partitions of equal size [11]. Each dissected block resulted from hybrid partitioning is known as hybrid partition (HP). Total N number

  • f vertical partitions and L number of horizontal partitions

(layers) are created, where L and N must be nonzero whole

  • numbers. Thus, in total N × L HPs are created.

Let us consider D×W sized TCAM, where D indicates total number of TCAM words while W is the width of TCAM word. By using hybrid partitioning, we actually divide the width of TCAM word into N subwords, i.e., each subword carries w = W/N bits. Similarly, we also divide the TCAM addresses into L layers, i.e., each layer carries K = D/L TCAM addresses. Thus, a resultant HP is of size K × w.

  • V. OVERALL ARCHITECTURE OF PROPOSED

UPDATING ALGORITHM The proposed updating algorithm is the modified version of the updating algorithm discussed in [10]. The layer architec- ture of our proposed updating algorithm is shown in Fig. 1. The input INDEX is the original TCAM address which needs to be stored in SRAM. The considered SRAM-based TCAM follows hybrid partitioning [11], where each layer stores a sub set of original TCAM addresses in ascending order; thus, based on the value of INDEX, only the required layer is

  • enabled. In Fig. 1, the enable 1 (en1) signal indicates that

SRAMs of layer1 is updating. In order to represent a ternary word two input words B_W and M_W are required. The M_W is the don’t care bits repre- sentative, i.e., if ith bit of TCAM word is don’t care bit then ith bit of M_W becomes 1 otherwise 0. However, B_W can rep- resent any of all binary combinations of the TCAM word. For example, with 11x as an input TCAM word, M_W becomes 001, while, B_W can be either of 111 or 110. The subword

  • f B_W is represented by B_sw while subword of M_W is

represented by M_sw. The overall architecture of the proposed updating algo- rithm is composed of address generator, MUX, and UM. The address generator generates write addresses (Wr_addr) and read addresses (R_addr) to the SRAM units. Due to hybrid partitioning, there is a separate address generator module and UM for each SRAM unit. MUX forwards the outputs of enabled SRAM units to UM, which updates the contents of the corresponding SRAM units.

slide-3
SLIDE 3

SYED et al.: FAST CONTENT UPDATING ALGORITHM FOR SRAM-BASED TCAM ON FPGA 75

  • Fig. 1.

Layer architecture of modules representing proposed updating algorithm.

  • Fig. 2.

Tree diagram representing logic of address generator module.

  • A. Address Generator

In address generator module, a counter is required which produces new C_sw every clock cycle. The logic of address generator is represented by a tree diagram shown in Fig. 2. Here at each decision point, each bit of R_addr is gener- ated starting from the least significant bit. The decision is based on the M_sw because it represents don’t care bits in

  • riginal TCAM word. If at each decision point in tree dia-

gram M_sw[n] is 1, follow right branch otherwise follow left

  • branch. Each branch actually utters the corresponding bit of

R_addr. The beauty of this logic is that it only generates required SRAM addresses; thus, only required SRAM con- tents are fetched for updating. The C_sw is incremented and new read address is generated every clock cycle. For example, address generator is required to gener- ate addresses for subword = 11x. Here, B_sw = 111 and M_sw = 001. Also M_sw[0] = 1, M_sw[1] = 0, M_sw[2] = 0, B_sw[0] = 1, B_sw[1] = 1, and B_sw[2] = 1. In tree diagram, based on bits of M_sw, right, left, and left branch are followed assigning C_sw[0], B_sw[1] and B_sw[2] to R_addr[0], R_addr[1] and R_addr[2], respectively. Hence, R_addr becomes 110 at first clock cycle when C_sw = 000. Similarly, at second clock cycle C_sw is incremented and becomes 001. The same path is followed and next R_addr is generated which is 111.

  • B. Update Module

UM takes contents of the corresponding SRAM unit from the corresponding MUX. The input INDEX is also

TABLE I FPGA IMPLEMENTATION RESULTS FOR TCAM OF SIZE 512×36

Algorithm 1 Update Process Running in Each UM

INPUT: K-bits output from coresponding MUX, d INPUT: INDEX INPUT: Operation controlled signal, CS (add/delete) OUTPUT: Updated content of SRAM, din 1: for i← 0 to 2X − 1 do 2: if CS = add then 3: SRAM[i][INDEX]= 1 4: else 5: SRAM[i][INDEX] = 0 6: end if 7: end for

provided to it. The value of INDEX decides the bit position of SRAM’s content, which needs alteration. Moreover, UM also requires control signal to decides whether to add the certain TCAM address or delete it by changing corresponding bit of SRAM to 1 or 0, respectively. Algorithm 1 shows the update process running for each SRAM unit.

  • VI. FPGA IMPLEMENTATION RESULTS AND

PERFORMANCE EVALUATION We implemented the proposed updating algorithm for TCAM of size 512 × 36 on Xilinx Virtex-6 FPGA. For hybrid partitioning, we assumed N = 4 and L = 8. The implementation results of different parameters are shown in Table I.

  • A. Updating Latency

The parameter used for the evaluation of updating algo- rithm is updating latency, which shows number of iterations required to update a TCAM word. In prior updating algo- rithm [10], to update a single entry, worst case updating latency remains intact no matter what the input is. This is because of counter module, which dumbly generates read and write addresses to SRAM unit. In the proposed updating algo- rithm, we replace counter module by a sophisticated address

  • generator. It generates only required addresses to SRAM unit.

The updating latency of our proposed updating algorithm changes from word to word depending on the number of don’t care bits in TCAM word. However, the architecture fol- lows hybrid partitioning, which dissects the TCAM word into

  • subwords. All subwords are treated in parallel; therefore, the

decision is based on the number of don’t care bits per subword. Each subword can have different number of don’t care bits; thus, updating latency of each SRAM unit can be different. In order to calculate the updating latency of the SRAM-based TCAM, the subword with maximum number of don’t care bits is selected. Let us call the selected subword by SSW and num- ber of ternary bits per SSW by X. The normal case updating latency is generalized by Latency = 2X(iterations). (1)

slide-4
SLIDE 4

76 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018

  • Fig. 3.

Tradeoff between memory resources and worst case updating latency.

Similarly, the worst case happens when all bits of SSW are don’t care bits or all bits of TCAM word are don’t care bits. In both scenarios, X = w where w is total number of bits in a subword. Hence, the worst case updating latency becomes equal to the depth of SRAM unit and is generalized by Latency = 2w(iterations). (2) The complexity of our proposed updating algorithm also varies with varying don’t care bits per TCAM word. In normal case scenario, the complexity is 2X + 1. In worst case scenario X = w, and updating algorithm complexity resides to 2w + 1.

  • B. Trade-Off Between Worst Case Updating Latency

and Memory Resources As we know that the worst case updating latency depends

  • n the total number of bits per each subword, i.e., 2w. We also

know that after hybrid partitioning of D × W sized TCAM, w becomes W/N. Thus, by increasing value of N, w reduces and so is the worst case updating latency, which is 2w. Hence, the updating latency of our proposed algorithm shows inverse relation with value of N. Similarly, we know that N and L are the important parame- ters of hybrid partitioning scheme and are responsible to infer minimum or maximum number of BRAMs on FPGA. The size

  • f the dissected SRAM unit is 2w ×K, which is resulted from

specific N and L values. Here K, the width of SRAM unit is equal to D/L. The memory resources available on FPGA are BRAMs of either 18 K or 36 K bits. There are fixed aspect ratios to infer size of BRAM on FPGA. Let A × B denotes the aspect ratios available on FPGA to infer a BRAM while m and n are natural numbers representing multiples of A and B, respectively. In order to find total number

  • f BRAMs required per each SRAM unit, try to find aspect

ratio which best fits into SRAM such that 2w <= mA and K <= nB. Let Z denotes total number of BRAMs required for each SRAM unit and is generalized by Z = m × n. (3) We know that L × N are the total number of SRAM units. At this stage, we have equation which tells us about required BRAMs per each SRAM unit. Thus, the total number of BRAMs consumed by SRAM-based TCAM for specific values

  • f L and N is generalized by

T = Z × (L × N). (4)

  • Fig. 3 shows trade-off between the memory resources and

worst case updating latency for 512 × 36 sized TCAM. We keep value of L constant because we only need to show the effect of varying N value on total number of consumed

  • BRAMs. We have calculated the worst case updating latency

and number of inferred BRAMs for possible values of N. Here, the best value of N in terms of BRAMs utilization is 4 while best value of N in terms of worst case updating latency is 12.

  • VII. CONCLUSION

In this letter, we successfully implemented a fast updat- ing algorithm for an FPGA-based TCAM on Xilinx Virtex-6

  • FPGA. The updating latency of our proposed updating algo-

rithm varies from word to word, i.e., smaller the number of X per SSW, smaller is the updating latency. Similarly, as it is understood that high latency consumes more power than low latency; therefore, power consumption of our proposed updat- ing algorithm also varies accordingly. Moreover, the worst case updating latency of the proposed updating algorithm is con- trollable because it relates with the value of N. Besides, no extra memory resources are required to keep the table sorted during update process as traditional TCAM does. Our future work targets further improvement in the proposed updating algorithm for the SRAM-based TCAM. REFERENCES

[1] W. Jiang, “Scalable ternary content addressable memory implementation using FPGAs,” in Proc. Archit. Netw. Commun. Syst., San Jose, CA, USA, Oct. 2013, pp. 71–82. [2] C. A. Zerbini and J. M. Finochietto, “Performance evaluation of packet classification on FPGA-based TCAM emulation architectures,” in Proc. IEEE Glob. Commun. Conf. (GLOBECOM), Anaheim, CA, USA, Dec. 2012, pp. 2766–2771. [3] Z. Ullah, M. K. Jaiswal, and R. C. Cheung, “E-TCAM: An efficient SRAM-based architecture for TCAM,” Circuits Syst. Signal Process.,

  • vol. 33, no. 10, pp. 3123–3144, 2014.

[4] Z. Ullah, M. K. Jaiswal, R. C. C. Cheung, and H. K. H. So, “UE-TCAM: An ultra efficient SRAM-based TCAM,” in Proc. TENCON IEEE Region 10 Conf., Nov. 2015, pp. 1–6. [5] G. P. Mullai and C. S. Joice, Implementation of Z-Ternary Content- Addressable Memory Using FPGA. New Delhi, India: Springer, 2016,

  • pp. 855–863.

[Online]. Available: https://link.springer.com/chapter/ 10.1007/978-81-322-2656-7_77 [6] Z. Wang, H. Che, M. Kumar, and S. K. Das, “CoPTUA: Consistent policy table update algorithm for TCAM without locking,” IEEE Trans. Comput., vol. 53, no. 12, pp. 1602–1614, Dec. 2004. [7] D. Shah and P. Gupta, “Fast updating algorithms for TCAM,” IEEE Micro, vol. 21, no. 1, pp. 36–47, Jan./Feb. 2001. [8] G. Wang and N.-F. Tzeng, “TCAM-based forwarding engine with min- imum independent prefix set (MIPS) for fast updating,” in Proc. IEEE

  • Int. Conf. Commun., vol. 1. Istanbul, Turkey, Jun. 2006, pp. 103–109.

[9] T. Mishra, S. Sahni, and G. Seetharaman, “PC-DUOS: Fast TCAM lookup and update for packet classifiers,” in Proc. IEEE Symp. Comput.

  • Commun. (ISCC), Jun. 2011, pp. 265–270.

[10] F. Syed and Z. Ullah, “Updating algorithm for SRAM-based TCAM and its implementation on FPGA,” Int. J. Comput. Sci. Inf. Security, vol. 15,

  • no. 1, pp. 116–120, 2017.

[11] Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based ternary content addressable memory,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 12, pp. 2969–2979, Dec. 2012.