A 2.0 Gb/s Throughput Decoder for QC-LDPC Convolutional Codes

This paper proposes a decoder architecture for low-density parity-check convolutional code (LDPCCC). Specifically, the LDPCCC is derived from a quasi-cyclic (QC) LDPC block code. By making use of the quasi-cyclic structure, the proposed LDPCCC decoder adopts a dynamic message storage in the memory and uses a simple address controller. The decoder efficiently combines the memories in the pipelining processors into a large memory block so as to take advantage of the data-width of the embedded memory in a modern field-programmable gate array (FPGA). A rate-5/6 QC-LDPCCC has been implemented on an Altera Stratix FPGA. It achieves up to 2.0 Gb/s throughput with a clock frequency of 100 MHz. Moreover, the decoder displays an excellent error performance of lower than 10-13 at a bit-energy-to-noise power-spectral-density ratio (Eb/N0) of 3.55 dB.


I. INTRODUCTION
Low-density parity-check (LDPC) codes, first invented by Gallager in 1960's [1], have been found to be capable of approaching the channel capacity.Later, LDPC convolutional codes (LDPCCCs) have been shown to outperform LDPC block codes in terms of error performance (e.g., lower error floors and higher coding gains) under a similar decoding complexity [2].The comparisons between LDPCCCs and LDPC block codes from the perspectives of hardware complexity, delay requirements, memory requirements have been discussed in [3] and [4].
LDPCCC has inherited the basic structure of convolutional code and enables a continuous encoding and decoding of messages of varying lengths.Such a property has made LDPCCC a promising solution in many applications.When designing an LDPCCC for an application, furthermore, many factors such as code rate, sub-block length, coding gain, throughput, error performance and the encoder/decoder complexity may have to be taken into consideration.High data rate optical communications require powerful error correction codes with low redundancies to achieve an error floor lower than a bit error rate (BER) of 10 −13 , preferably 10 −15 [5], [6].Motivated by such applications, the goal of this work is to design and implement an efficient decoder architecture such that codes can achieve high throughput, high coding gain, high code rate and low error floor.
Designing high-throughput decoder architectures for LDPC block codes has been extensively studied.In [7], a highthroughput memory-efficient decoder architecture that jointly optimizes the code design, the decoding algorithm and the architecture level has been proposed.A practical coding system design approach has been presented in [8] whereby the LDPC codes are constructed subject to decoder hardware constraints.Simulation results have shown that the codes constructed suffer from only minor performance loss compared with unconstrained ones.In [9], a quasi-cyclic LDPC (QC-LDPC) decoder architecture that achieves a throughput of 172 Mbps has been studied.The high throughput is achieved by reducing the critical path through modifying the decoding algorithm as well as the check-node and variable-node processor architectures.In [10], the throughput of a QC-LDPC decoder is further improved by parallelizing the processing of all layers in layered decoding.Subsequently, the decoder can achieve a maximum throughput of 2.2 Gbps with an operating frequency of 950 MHz and 10 min-sum decoding iterations.In [11], the authors have proposed a high-speed flexible shift-LDPC decoder that can adapt to different code lengths and code rates.The decoder employs the Benes network to handle the complicated interconnections for various code parameters.It adopts the single-minimum min-sum decoding and achieves a throughput of 3.6 Gbps with an operating frequency of 290 MHz.
Although LDPCCC decoders may "borrow" some design techniques used in the LDPC block decoder architectures, overall they are very different from the block code counterparts due to the distinct code construction mechanism and unique characteristics of LDPCCCs.High-throughput LDPCCC decoder architectures based on parallelization have been studied in [12], [13].Such architectures can achieve a throughput of over 1 Gbps with a clock frequency of 250 MHz.They, however, are confined to time-invariant LDPCCCs and cannot be easily applied to time-varying ones, which usually produce a better error performance.In [14], a register-based decoder architecture attaining up to 175 Mbps throughput has been proposed.This architecture has successfully implemented a pipeline decoder with 10 processing units.Nonetheless, its register-intensive architecture has limited its power efficiency.In [15], [16], a low-cost low-power memory-based decoder architecture that uses a single decoding processor has been proposed.On one hand, the serial node operation uses a small portion of the field-programmable gate array (FPGA) resources.On the other hand, such a design has posed a significant limitation on the achievable throughput.Subsequently, the memory-based designs with parallel node operations have been proposed and have led to a substantial improvement in throughput [17]- [19].The high throughput accomplished under these designs, however, is achieved at the cost of a complicated switch network.
To the best of the authors' knowledge, the previously proposed LDPCCC decoder architectures mainly handle random time-varying LDPCCCs.In this paper, we propose a decoder architecture for LDPCCCs with regular structures.In particular, the proposed decoder caters for a class of LDPCCCs that have a quasi-cyclic structure and can be derived from a QC-LDPC block code [20].The motivation of considering codes with regular structures is twofold.First, LDPCCCs with regular structures have recently attracted much interest both theoretically and empirically [21], [22].Second, following the insights from LDPC block codes, regular codes can make the decoder structure much simpler and at the same time achieve good error performance.Therefore, developing an efficient architecture for regular codes is of high importance in practice.
The contributions in our paper are distinct from previous works in many aspects including complexity, throughput, reliability and scalability.Firstly, we eliminate all switch networks, which are included in most of the previous implementations and are very complex for a high-rate LDPCCC.Instead, we propose the use of dedicated block processing units, with which we can provide higher throughput with similar decoder complexity.Second, the quantized sum-product algorithm (QSPA) applied in our LDPCCC decoder is more reliable compared with the min-sum-based LDPCCC decoder, i.e., QSPA outperforms the min-sum-based decoder in terms of error performance.Furthermore, our proposed QSPA implementation has a complexity only linearly proportional to the check-node degree.Third, it is known that more decoding iterations can enhance the error performance of the decoder.In our decoder design, each decoding iteration is accomplished by one processor and the processors are serially connected.Our decoder architecture also enables us to change the number of processors easily without re-designing the whole decoder.Thus, our decoder is scalable in terms of the number of processors.We have implemented our decoder architecture for a rate 5/6 LDPCCC in an Altera Stratix FPGA.The decoder has produced a throughput of 2.0 Gbps with a clock running at 100 MHz.Moreover, the LDPCCC has an excellent error performance, achieving an error of lower than 10 −13 at a bitenergy-to-noise-power-spectral-density ratio (E b /N 0 ) of 3.55 dB.
The rest of the paper is organized as follows.Section II reviews the construction of QC-LDPCCCs and the decoding process for such codes.Section III describes the proposed decoder architecture and pipeline schedule.Section IV presents the implementation complexity of the decoder architecture.The FPGA simulation results are also presented in this section.Finally, Section V concludes the paper.

A. Structures of LDPCCC and QC-LDPCCC
The parity-check matrix of an unterminated time-varying periodic LDPCCC is shown in (1) where m s is termed as the memory of the parity-check matrix; and Given a quasi-cyclic LDPC (QC-LDPC) block code with a base matrix of size n c × n v and an expansion factor of z [23], we can construct a QC-LDPCCC1 as follows.
1) Expand the parity-check matrix of the QC-LDPC block code into a zn c × zn v matrix H b .2) Represent the zn c × zn v parity-check matrix H b as a M ×M matrix, where M is the greatest common divisor of n c and n v , i.e., M = gcd(n c , n v ).Then we have where l and H b u which correspond to the lower triangular part and the strictly upper triangular part of H b , respectively.H b l and H b u are therefore denoted, respectively, by 4) Unwrap the parity-check matrix of the block code to obtain the parity-check matrix of a QC-LDPCCC in the form of (1), i.e., The above construction process is illustrated in Fig. 1.By comparing (1) and (3), it can be observed that the period of the QC-LDPCCC is T = M and the memory m s satisfies M = m s +1.It can also be observed that the relative positions between the variable nodes and the check nodes do not change.Hence the girth of the QC-LDPCCC is no less than that of the original QC-LDPC block code [24].Therefore, we can construct a large-girth QC-LDPCCC by first designing the submatrices to obtain a large-girth QC-LDPC block code and then performing the unwrapping operation.

B. Decoding Algorithm for LDPCCC
LDPCCC has an inherent pipeline decoding process [2].The pipeline decoder consists of I processors, separated by c(m s + 1) code symbols, with I being the maximum number of decoding iterations.Throughout the decoding process, we assume that messages in log-likelihood-ratio (LLR) form are being used.
At the start of each decoding step (say at time t 0 ), the incoming channel messages associated with the c new variable nodes ] enter the first processor.Moreover, the corresponding variable-to-check messages for these variable nodes have the same values as the incoming channel messages.At the same time, the messages associated with the variable nodes v t0−i(ms+1) are shifted from the i-th processor to the (i+1)-th processor, where i where α mn is the check-to-variable message from check node m to variable node n; β mn is the variable-to-check message from variable node n to check node m; N (m) is the set of variable nodes connected to check node m; and N (m)\n is the set N (m) excluding variable node n.Next, the processors perform variable-node updating for v t0−(i−1)(ms+1)−ms , i = 1, 2, ..., I, using where λ n is the channel message for variable node n; M(n) is the set of check nodes connected to variable node n; and Finally, the a posteriori probabilities (APPs) for the c variable nodes v t−(I−1)(ms+1)−ms leaving the last processor are computed using based on which the binary value of each individual variable node is determined.Thus, each decoding step consists of inputting new channel messages to the decoder, shifting messages, updating checkto-variable messages, updating variable-to-check messages, computing APPs and decoding the output bits.As a result, after an initial delay of (m s + 1)I decoding steps, there is a continuous output of the decoded bits.

III. DECODER ARCHITECTURE
In the hardware design of an LDPCCC decoder, the processor complexity, memory requirement, throughput and error performance are closely related.It is worthwhile to study their tradeoffs so as to design a decoder meeting the application requirements.Following the notations presented in the construction of a QC-LDPCCC, we can roughly characterize the factors affecting the decoder as follows.Suppose the decoding process is divided into G stages.A smaller G provides a higher level of parallelism that the decoder can achieve.The error performance of an LDPCCC improves as z increases and/or I increases and/or R decreases.Furthermore, the information throughput is proportional to zR/G while the memory usage is proportional to zIn 2 v (1 − R).Also, the processor complexity in terms of combinational logics is proportional to zIn 2 v (1 − R)/G.More details about the complexity of memory usage are shown in Section III-B.
It can be seen that the error performance of an LDPCCC can generally be improved at the cost of a higher processor complexity, more memory usage or a lower throughput.For instance, with the sub-matrix size z × z fixed, as the code rate R decreases, the error performance becomes better at the cost of a lower information throughput.Furthermore, both the processor complexity and the memory requirement become higher due to an increase in the number of check nodes.With the code rate and the throughput fixed, as the sub-matrix size increases, the error performance improves with the same processor complexity but more memory usage.The experiment results presented in Section IV will provide a rough guideline on how to choose the parameters in order to achieve a targeted error performance, processor complexity and memory usage.
In most of the previous works, a generic processing unit such as that shown in Fig. 2(a) is applied in the LDPCCC decoder.For this type of design, a switch network and some corresponding control logics are required.The complexity overhead of the switch network is not a concern in the previous works mainly because the number of edges between the check nodes and the variable nodes is small.When the number of edges between the check nodes and the variable nodes is large, e.g., for a high-throughput and high code-rate LDPCCC, the routing and hardware complexity of the switch network becomes a critical issue.
In our proposed decoder, we use dedicated Block Processing Units (BPUs) instead of generic processing units.Consequently, the complexity of routing and switching the messages are no longer required i.e., the complex switch network is eliminated.As shown in Fig. 2(b), we use M BPUs in one processor.One BPU is used during each decoding step of one codeword and M BPUs are used to facilitate the pipeline of M distinct codewords simultaneously.In general, our approach can obtain a M times speed-up in throughput with the pipeline of M distinct codewords.Details will be described in Section III-C.

A. Architecture Design
A high-throughput decoder requires parallel processing of the LDPCCC.We propose a partially parallel decoder architecture that utilizes parallelization on both the node level and the iteration level.The number of rows and the number of columns of the sub-matrices H b i,j in (2) (corresponding to H i (t) in ( 1)) are c − b = zn c /M and c = zn v /M , respectively.Our proposed decoder architecture is illustrated in Fig. 3.The decoder consists of I processors where I is the maximum number of decoding iterations.Since the memory of a QC-LDPCCC constructed using the method in Section II is m s = M − 1, the variable nodes and the check nodes in each processor are separated by a maximum of M − 1 time instants.Denote the c − b check nodes and the c variable nodes that enter a particular processor by respectively.Then the check nodes and the variable nodes that are about to leave the processor are given by respectively.At each decoding step, a BPU is responsible for processing the check nodes that enter the processor (i.e., u t0 ) and the variable nodes that are about to leave the processor (i.e., v t0−M+1 ).At the start of each decoding step, c − b check nodes are to be processed.We divide them into G groups and consequently we divide a complete decoding step into G stages.At the i-th are processed in parallel.The variable-to-check messages expressed in the sign-and-magnitude format are input to a group of (c − b)/G check-node processors (CNPs).Among the resulting check-to-variable messages, those between the check nodes in u t0 and the variable nodes not in the set v t0−M+1 will be written to the local RAMs, waiting to be further processed by other BPUs.On the other hand, the updated check-to-variable messages between the check nodes in u t0 and the variable nodes in v t0−M+1 are converted to the format of 2's complement before being processed by the variable-node processor (VNP).Since each check node is connected to a total of c/z variable nodes in are connected to the newly updated check nodes and hence c(c−b)/Gz VNPs are needed in one BPU.Finally, the updated variable-to-check messages are converted back to the format of sign-and-magnitude and they will be shifted to the next processor together with their associated channel messages in the next decoding step.
In the BPUs, the CNPs update the check nodes according to (4).However, in practical implementations we need to quantize the messages to reduce the complexity.In our implementation, we adopt a four-bit quantization, where the quantization step is derived based on density evolution [25] and differential evolution [26].Empirical results show that its error performance is only 0.1 dB worse than the floating-point sum-product algorithm (SPA).
We consider a check node with degree d.For a full quantized-SPA (QPSA) implementation, there should be d inputs, each of length 4-bits.Consequently, the size of the look-up table (LUT) becomes 2 4d , which equals 2 96 (as we use d c = 24) in our design.We can observe that it is impractical to implement such an enormous LUT.Here, we propose to implement the CNP with quantization (QSPA) by first pairing up the input messages and then calculating the extrinsic messages excluding the input itself.More specifically, suppose the variable nodes connected to check node m is listed as [n 1 , n 2 , . . ., n d ] and the corresponding input messages are denoted by [s 1 , s 2 , . . ., s d ].The updated check-to-variable message to variable node n i is then calculated as where Thus, (7) can be implemented based on a simple LUT tree, as shown in Fig. 4. In fact, it can be easily verified that each LUT is of size 2 8 = 256 and the total number of units required is always 2d = 48.Thus, our proposed tree-structured implementation ensures that the CNP complexity remains low, namely in O(d c ).Moreover, the VNP is basically an adding operation which can be implemented using an adder tree.
. Implementation of a CNP using a tree of look-up tables.

B. Memory storage
For clarity of presentation, we first assume M = n c .Hence we have c − b = z and c = zn v /n c .As mentioned earlier, we divide the decoding step into G stages with z/G check nodes being processed in parallel.We consider the t 0 -th block row of H cc [0,∞] shown in Fig. 1.This block row consists of 1 × (n v /n c ) sub-matrices, each having a size of z × z.Thus, this block row corresponds to z check nodes and zn v /n c variable nodes in the Tanner graph.We also assume that the 1 × (n v /n c ) sub-matrices are either the identity matrix or cyclic-right-shifted identity matrices.Suppose u t0 and v t0 just enter a particular processor and u t0−M+1 and v t0−M+1 are about to be shifted out of the same processor.The memory requirement is explained as follows.

1) Storage of check-to-variable and variable-tocheck messages:
We denote the check nodes by u t0 = [u t0,1 , u t0,2 , . . ., u t0,z ].We further divide them into G groups with the i-th group being denoted by [u t0,1+(i−1)z/G , u t0,2+(i−1)z/G , . . ., u t0,z/G+(i−1)z/G ] (i = 1, 2, . . ., G).As explained previously, in processing u t0 , [u t0,1+(i−1)z/G , u t0,2+(i−1)z/G , . . ., u t0,z/G+(i−1)z/G ] are processed in parallel at the i-th stage of a decoding step.Therefore in order to avoid the collisions of memory access, z/G different RAMs are needed for storing the z/G messages on the edges if each of the z/G check nodes is connected to only one variable node.From the construction of the QC-LDPCCC, moreover, each check node has a regular degree of n v , i.e., each check node is connected to n v variable nodes.
Consequently, a total of zn v /G RAMs are needed for storing the edge-messages passing between the check nodes in u t0 and their connected variable nodes to avoid the collisions of memory access.Further, each processor has M sets of such check nodes, i.e., u t0 , u t0−1 , . . ., u t0−M+1 .As a result, zn v M/G RAMs are allocated in one processor to store the edge-messages, i.e., check-to-variable or variable-to-check messages.In addition, the data-depth and the data-width of the RAMs are equal to G and the number of quantization bits, respectively.
2) Storage of channel messages: For the channel messages, the memory storage mechanism is similar.The set of z variable nodes corresponding to every z × z sub-matrix are first divided into G groups.Then z/G RAMs, each of which having G entries, are allocated to store the channel messages.Moreover, the variable nodes in v t0 correspond to n v /n c submatrices and each processor contains M variable-node sets denoted by v t0 , v t0−1 , . . ., v t0−M+1 .Consequently, a total of zn v M/n c G = zn v /G RAMs are allocated to store the channel messages in one processor.The data-depth and the data-width of the RAMs are equal to G and the number of quantization bits, respectively.
For a general case where M is not necessarily equal to n c , zn c n v /G RAMs are needed to store the edge-messages and zn v M/n c G RAMs are required to store the channel messages in one processor.In modern FPGAs, the total number of internal memory bits is usually sufficient for storing the messages of codes with a reasonable length and with a reasonable number of decoding iterations.However, the number of RAM blocks is usually insufficient.Note that the operations of the pipeline processors are identical, the connections between the RAMs and the BPUs are the same and the addresses of accessing the RAMs are the same.By taking advantage of the homogeneity of the processors, we can combine the RAMs in different processors into one large RAM block.In particular, for the RAMs handling edge-messages, we can combine the I sets of zn c n v /G RAM blocks distributed in the I processors into one set of zn c n v /G RAM blocks.Similarly, for the RAMs storing the channel messages, I sets of zn v M/n c G RAM blocks are combined into one set of zn v M/n c G RAM blocks.The data-depth of the RAMs remains the same while the data-width becomes I times wider.Note that the memory combination is a unique feature of LDPCCC and is not boasted by LDPC block codes2 .
Another advantage of such a memory storage mechanism is that the address controller is a simple counter incrementing by one at every cycle, thanks to the quasi-cyclic structure.Specifically, at the start of each decoding step, the addresses of accessing the RAMs are initialized based on the paritycheck matrix H cc [0,∞] .As the decoding process proceeds, the addresses are incremented by one after every stage, until all G stages are completed.

C. Pipeline scheduling
Conventional LDPCCC decoder architectures [13] [12] [14] adopt the pipeline design shown in Fig. 5.Each processor sequentially does the following: shift the messages in, update the check nodes, write the data to memories, input the messages to VNP and update the variable nodes.This pipeline schedule only utilizes pipelining on the iteration level following the standard decoding process.In this paper, we propose a more efficient pipeline scheduling based on our dynamic memory storage structure.
We first describe the pipeline schedule for a single codeword.Instead of writing the updated messages from CNP and those from VNP in two separate stages, we combine them with the shifting operation.The updated messages from VNP and the channel messages associated with the updating variable nodes are directly output to the next processor, which completes the writing and shifting operations at the same time.Since some of the updated messages from CNP need not be processed by VNP, they are written to the local memories at the same time.Note that the memory locations into which the messages are shifted are exactly those storing the original messages loaded by the BPU.Therefore, there would not have any memory collisions during the process.
It can also be inferred from this process that the types of messages stored in the memories are dynamically changing.The messages associated with u t0 are all variable-to-check messages by the time u t0 first enters a processor and is ready to be processed by CNP.After each decoding step, some of the messages are substituted by the updated variable-to-check messages from the previous processor.When M decoding steps are completed, all the check-to-variable messages originally associated with u t0 will be completely substituted by variable-to-check messages.Yet, they are now messages for u t0+M+1 and are ready for CNP in a new round of decoding.
Figure 6(a) describes the pipeline for a single codeword assuming G = 3 and M = 4. Comparing Fig. 5 and Fig. 6(a), it can be observed that decoding a group of check nodes using the proposed pipeline scheduling only takes 4/7 of the time cost in conventional scheduling.The homogeneity of the pipeline processors also facilitates a pipeline processing of multiple codewords.As shown in Fig. 6(a) where a single codeword is being decoded, the processing time of different BPUs are separated in the sense that while one BPU is processing, the other BPUs remain idle.To further increase the throughput, we can schedule other BPUs to process other codewords.Since the total number of blocks in a processor is M , we can incorporate a maximum of M different codewords in one processor, i.e., allowing BPU i to process Codeword-i, for i = 1, 2, • • • , M .Depending on the number of codewords incorporated, the throughput can be increased by a factor of M at the cost of additional memory storage and additional hardware complexity of the BPUs. Figure 6(b) illustrates the pipeline schedule for four codewords with G = 3 and M = 4.
Using our proposed pipeline schedule, the throughput of the decoder is (n v − n c )z/M information bits for every G + d cycles, where d is the time delay for each pipeline stage such that G + d cycles are used by one BPU.As there are more decoding stages, i.e., G increases, the throughput tends to (n v − n c )zf /M G bits/s with a running clock of f Hz.

An illustrative example of the RAM storage and decoding process
Example: we consider a QC-LDPCCC with G = 2, z = 4, n c = 2 and n v = 4. Since M = gcd(n c , n v ) = 2, each processor has M = 2 BPUs.In each processor, zn c n v /M G = 8 RAMs are dedicated to store edgemessages and zn v /n c G = 4 RAMs are dedicated to store channel messages.Assume that the check nodes u t0 = [u t0,1 , u t0,2 , . . ., u t0,4 ] just enter a processor and the variable nodes 7 shows the dynamic storage of the edge-messages in the RAMs at different time instances.
Step 1) It shows the RAM storage at the start of processing u t0 and v t0−1 by BPU 1 .It can be seen that RAM 1 to 8 store the variable-to-check messages for u t0 which is ready to be processed.RAM 13 to 16 store the latest check-tovariable messages for u t0−1 , which are updated in the previous decoding step by BPU 2 .RAM 9 to 12 store the variableto-check messages that are newly updated in the previous decoding step and are shifted from the previous processor.
Step 2) It shows the RAM storage after the first stage of BPU 1 processing.At the first stage, BPU 1 will process u t0,1 and u t0,2 and their connected variable nodes in v t0−1 , e.g., . CNP reads the variable-tocheck messages from the first set of entries located in RAM 1 to 8. The newly updated check-to-variable messages between u t0 and v t0 from CNP are input to the first set of entries in RAM 1 to 4 (i.e., from where the check-to-variable messages are read), while the newly updated check-to-variable messages between u t0 and v t0−1 are input to the VNP and the resulting variable-to-check messages are shifted to the next processor.As a result, the updated variable-to-check messages between v t0+1 and u t0+2 are written to RAM 5 to 8 and those between v t0+1 and u t0+1 are written to RAM 13 to 16.
Step 3) It shows the RAMs after the second stage of BPU 1 processing.At the second stage, BPU 1 will process u t0,3 and u t0,4 and their connected variable nodes in v t0−1 , e.g., . CNP reads the variable-tocheck messages from the second set of entries located in RAM 1 to 8. The newly updated check-to-variable messages between u t0 and v t0 from CNP are input to the second set of entries in RAM 1 to 4 (i.e., from where the check-to-variable messages are read), while the newly updated check-to-variable messages between u t0 and v t0−1 are input to the VNP and the resulting variable-to-check messages are shifted to the next processor.As a result, the updated variable-to-check messages between v t0+1 and u t0+2 are written to RAM 5 to 8 and those between v t0+1 and u t0+1 are written to RAM 13 to 16.
The RAM updating at the decoding step of BPU 2 is analogous to Steps 2) and 3) above.After the second stage of BPU 2 , RAM 1 to 8 will have the variable-to-check messages ready for u t0+2 and their connected variable nodes in v t0+1 .The RAM storage is similar to that in Step 1) with the time instances incrementing by M = 2.A new round of BPU 1 updating will follow according to Steps 2) and 3).Also note that once the address controller is initialized at the start of the G stages, the read/write address of accessing the RAMs are simply incremented by 1.
Based on the above results, the following guidelines can be used in designing a LDPCCC decoder.
• To increase the decoder throughput while maintaining a similar BER performance and the same number of memory bits, we can reduce the memory depth G at the cost of more combinational logics.• To reduce the cost of combinational logics while maintaining a similar BER performance and throughput, we can increase z and use a smaller number of processors I.Under such circumstances, the total memory bits may increase.• To reduce the memory bits while maintaining a similar BER performance and throughput, we can use a smaller z and a larger I at the cost of combinational logics.
In addition, we attempt to compare our implementation results with those found from the literature.Since the objective of our work is to achieve high throughput and good error performance, the code length and code rate of the codes used in our experiments are relatively large.While we can find quite a number of decoders in the literature, none of them consider codes with length comparable to the ones we use.All of them assume lengths which are relatively short and consequently they have high error floors and small coding gains.The "closest" one we can find is the QC-LDPC block decoder described by Wang and Cui [9], who target a highspeed decoder and adopt a length-8176 QC-LDPC code in the experiment.In Table I, we add the implementation results of the decoder in [9].Although the decoder in [9] seems to be less complex than our designs, its throughput (0.2 Gbps) is only 1/10 of ours (2 Gbps).If 10 decoders in [9] are put together in order to achieve the same throughput as our decoders, the total complexity of the decoders will become larger than ours.Furthermore, the decoder in [9] displays an error floor at a BER of 10 −10 while our decoder does not.In fact, at a BER of 10 −10 , our decoders can achieve an extra coding gain of 0.8 dB to 1 dB over the decoder in [9].Thus, our proposed decoder is superior in achieving high throughput, high coding gain and low error floor.
We also compare the BER performance of LDPCCCs and their block-code counterparts under similar processor com- plexity and throughput.Compared with a single-processor decoder of an LDPC block code with the same iteration number I, the LDPCCC decoder with I processors, the length of the coded bits stored in each processor being the code length of the block code, incurs I times more complexity, but achieves I times higher throughput.In order for the LDPC block decoder to attain the same throughput, I times more processors are needed to decode in parallel.Under such circumstances, the overall complexity of the LDPC block decoder will increase by I times and becomes the same as the LDPCCC counterpart.Therefore, the fairness of comparing LDPCCC with its blockcode counter part based on which the LDPCCC is derived is validated from the perspective of processor complexity and throughput.
Figure 9 shows the BER performance of LDPCCCs and their block-code counterparts.The results of the LDPC block codes are obtained from computer simulations (using C programming) based on 4-bit quantized messages.It can be seen that the BER performance of LDPCCCs are generally superior.For instance, the LDPCCC with z = 422 and I = 18 has a gain of 0.2 dB at a BER of 2 × 10 −5 over its blockcode counterpart.Another observation is that the advantage of LDPCCC over its block-code counterpart becomes obvious as the number of decoding iterations increases.For example, the performance of LDPCCC with z = 1024 and I = 10 has a similar performance of its block-code counterpart at a BER of 2 × 10 −5 ; and it outperforms its block-code counterpart by 0.1 dB at a BER of ×10 −6 when the number of decoding iterations increases to 12, i.e., I = 12.As a result, when the number of decoding iterations is large, LDPCCC is considered to be a better choice in terms of error performance.

Fig. 2 .
Fig. 2. Generic Processing Unit and Dedicated Block Processing Unit.

Fig. 6 .
Fig.6.Proposed Pipeline.Bi: processing of block i; S-W: Shift messages and write messages to the next processor; R: Input messages to the block processing unit; CNP: check-node processing; VNP: variable-node processing.
(a) z = 422 and I = 18; (b) z = 512 and I = 18; (c) z = 1024 and I = 12; (c) z = 1024 and I = 10.Recall that z × z represents the sub-matrix size of each entry in the 4 × 24 base matrix while I denotes the number of iterations (i.e., processors) used in the LDPCCC decoders.

Fig. 8 .
Fig. 8. Bit-error-rate (BER) results for the LDPCCCs with different sizes.The results are obtained from FPGA experiments under AWGN channels and 4-bit quantization.

Fig. 9 .
Fig. 9. Comparison of BER results between LDPCCCs and LDPC blockcode counterparts under AWGN channels.The results of the LDPCCCs and the LDPC block codes are represented by solid lines and dashed lines, respectively.The results of the LDPCCCs are obtained from FPGA experiments under 4-bit quantization while and those of the LDPC block codes are obtained from computer simulations (using C programming) based on 4-bit quantized messages.
[9] IMPLEMENTATION COMPLEXITY OF THE QC-LDPC BLOCK DECODER IN[9]IS SHOWN FOR COMPARISON.