## Implementation of Decoders for LDPC Block Codes and LDPC Convolutional Codes Based on GPUs

Yue Zhao and Francis C.M. Lau, Senior Member, IEEE

Abstract—In this paper, efficient LDPC block-code decoders/simulators which run on graphics processing units (GPUs) are proposed. We also implement the decoder for the LDPC convolutional code (LDPCCC). The LDPCCC is derived from a pre-designed quasicyclic LDPC block code with good error performance. Compared to the decoder based on the randomly constructed LDPCCC code, the complexity of the proposed LDPCCC decoder is reduced due to the periodicity of the derived LDPCCC and the properties of the quasi-cyclic structure. In our proposed decoder architecture,  $\Gamma$  ( $\Gamma$  is a multiple of a warp) codewords are decoded together and hence the messages of  $\Gamma$  codewords are also processed together. Since all the  $\Gamma$  codewords share the same Tanner graph, messages of the  $\Gamma$  distinct codewords corresponding to the same edge can be grouped into one package and stored linearly. By optimizing the data structures of the messages used in the decoding process, both the read *and* write processes can be performed in a highly parallel manner by the GPUs. In addition, a thread hierarchy minimizing the divergence of the threads is deployed, and it can maximize the efficiency of the parallel execution. With the use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder can obtain hundreds of times speedup compared with a serial CPU-based simulator.

Index Terms—LDPC, LDPC convolutional code, CUDA, graphics processing unit (GPU), OpenMP, parallel computing, LDPC decoder, LDPCCC decoder

#### **1** INTRODUCTION

L OW-DENSITY parity-check (LDPC) codes were invented by Robert Gallager [1] but had been ignored for years until Mackay rediscovered them [2]. They have attracted much attention recently because they can achieve excellent error correcting performance based on the belief propagation (BP) decoding algorithm.

However, the BP decoding algorithm requires intensive computations. For applications like optical communication [3], [4] which requires BERs down to  $10^{-15}$ , using CPU-based programs to simulate the LDPC decoder is impractical. Fortunately, the decoding algorithm possesses a high data-parallelism feature, i.e., the data used in the decoding process are manipulated in a very similar manner and can be processed separately from one another. Thus, practical decoders with lowlatency and high-throughput can be implemented with dedicated hardware such as field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) [5], [6], [7], [8], [9], [10], [11], [12]. However, high performance FPGAs and ASICs are very expensive and are non-affordable by most researchers. Such hardware solutions also cost a long time to develop.

• The work described in this paper was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU 519011E).

In addition, the hardware control and interconnection frame are always associated with a specific LDPC code. If one parameter of an LDPC code/decoder changes, the corresponding hardware design has to be changed accordingly, rendering the hardware-based solutions nonflexible and non-scalable.

Recently, graphics processing units (GPUs) used to process graphics only have been applied to support general purpose computations [13]. In fact, GPUs are highly parallel structures with many processing units. They support floating point arithmetics and can hence conduct computations with the same precision as CPUs. GPUs are particularly efficient in carrying out the same operations to a large amount of (different) data. Compared with modern CPUs, GPUs can also provide much higher data-parallelism and bandwidth. Consequently, GPUs can provide a cheap, flexible and efficient solution of simulating an LDPC decoder. Potentially, the simulation time can be reduced from months to weeks or days when GPUs, instead of CPUs, are used. In addition, the GPU programming codes can be re-used without much modification should more advanced GPUs be produced by manufacturers.

In [14], [15], a compressed parity-check matrix has been proposed to store the indices of the passing messages in a cyclic or quasi-cyclic LDPC code. Further, the matrix is stored in the constant cache memory on the GPU for fast access. The messages are stored in a compressed manner such that the global memory can be accessed in a coalesced way frequently. However,

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

<sup>•</sup> Yue Zhao and Francis Lau are with the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University.

the coalesced memory access occurs only during the data-read process and is not always guaranteed due to a lack of data alignment. In [13], [16], [17], the sumproduct LDPC decoder and the min-sum decoder have been implemented with GPUs. Moreover, by combining sixteen fixed-point 8-bit data to form one 128-bit data, the LDPC decoder in [13] decodes sixteen codewords simultaneously and achieves a high throughput. Although the method in [13] allows coalesced memory access in *either* the read *and* write processes is yet to be achieved.

Furthermore, the LDPC convolutional codes (LDPC-CCs), first proposed in [18], have been shown to achieve a better error performance than the LDPC block code counterpart of similar decoding complexity. There are many features of LDPCCC that make it suitable for real applications. First, the LDPCCC inherits the structure of the convolutional code, which allows continuous encoding and decoding of variable-length codes. Thus the transmission of codewords with varying code length is possible. Second, the LDPCCC adopts a pipelined decoding architecture — in the iterative decoding procedure, each iteration is processed by a separate processor and the procedure can be performed in parallel. So a high-throughput decoder architecture is possible. In [19], [20], the concepts and realization of highly parallelized decoder architectures have been presented and discussed. To the author's best knowledge, there is not any GPU-based implementation of the LDPCCC decoder vet. The reason may lie in the complexity structure of the LDPCCC compared to the LDPC block code, particularly the random time-varying LDPCCC.

As will be discussed in this paper, an LDPCCC derived from a well designed QC-LDPC code possesses not only the good BER performance, but also the regular structure that results in many advantages in practical implementations. Due to the structure inherited from the QC-LDPC code, the LDPCCC decoder enables an efficient and compact memory storage of the messages with a simple address controller.

In this paper, we develop flexible and highly parallel GPU-based decoders for the LDPC codes. We improve the efficiency by making (i) the threads of a warp follow the same execution path (except when deciding whether a bit is a "0" or a "1") and (ii) the memory accessed by a warp be of a certain size and be aligned. The results show that the decoders based on the GPUs achieve remarkable speed-up improvement — more than 100 times faster than the serial CPU-based decoder.

We also develop a GPU-based decoder for the LDPC convolutional codes. We propose a decoder architecture for LDPCCC derived from QC-LDPC block-code. By taking advantage of the homogeneous operations of the pipeline processors, we compress the index information of different processors into one lookup table. Combined with an efficient thread layout, the decoder is optimized in terms of thread execution and memory access. Simulation results show that compared with the serial CPU-based decoder, the GPU-based one can achieve as many as 200 times speed-up. The GPU-based decoder, moreover, outperforms a quad-core CPU-based decoder by almost 40 times in terms of simulation time.

#### 2 REVIEW OF LDPC CODES AND LDPC CONVOLUTIONAL CODES

#### 2.1 Structure of LDPC Codes and QC-LDPC Codes

A binary (N, K) LDPC code is a linear block code specified by a sparse  $M \times N$  parity-check matrix **H**, where M = N - K. The code rate of such an LDPC code is  $R \ge K/N = 1 - M/N$ . The equality holds when **H** is full rank.

The **H** matrix contains mostly 0's and relatively a small number of 1's. Such a sparsity structure is the key characteristic that guarantees good performance of LDPC codes. A *regular* LDPC code is a linear block code with **H** containing a constant number  $w_c$  of 1's in each column and a constant number  $w_r$  of 1's in each row. Moreover,  $w_r$  and  $w_c$  satisfy the equation  $w_r = w_c \times \frac{N}{M}$ . Otherwise the code is defined as an *irregular* LDPC code.

A bipartite graph called Tanner graph [21] can be used to represent the codes and to visualize the messagepassing algorithm. In the Appendix, Figure **??** is the underlying Tanner graph of the **H** in (**??**). The *N* upper nodes are called the message nodes or the variable nodes and the *M* nodes in the lower part of Fig. **??** are called the check nodes. An edge in the Tanner graph represents the adjacency of the variable node *i* and the check node *j*. It corresponds to a nonzero (i, j)-th entry in the **H** matrix.

QC-LDPC codes form a subclass of LDPC codes with the parity-check matrix consisting of circulant permutation matrices [22], [23]. The parity-check matrix of a regular (J, L) QC-LDPC code is represented by

$$\mathbf{H} = \begin{bmatrix} \mathbf{P}^{a_{1,1}} & \mathbf{P}^{a_{1,2}} & \cdots & \mathbf{P}^{a_{1,L}} \\ \mathbf{P}^{a_{2,1}} & \mathbf{P}^{a_{2,2}} & \cdots & \mathbf{P}^{a_{2,L}} \\ \vdots & \ddots & \ddots & \vdots \\ \mathbf{P}^{a_{J,1}} & \mathbf{P}^{a_{J,2}} & \cdots & \mathbf{P}^{a_{J,L}} \end{bmatrix},$$
(1)

where J denotes the number of block rows, L is the number of block columns,  $\mathbf{P}$  is the identity matrix of size  $p \times p$ , and  $\mathbf{P}^{a_{j,l}}$   $(1 \leq j \leq J; 1 \leq l \leq L)$  is a circulant matrix formed by shifting the columns of  $\mathbf{P}$  cyclically to the right  $a_{j,l}$  times with  $a_{j,l}$ 's being nonnegative integers less than p. The code rate R of  $\mathbf{H}$  is lower bounded by  $R \geq 1-J/L$ . If one or more of the submatrix(matrices) is/are substituted by the zero matrix rendering non-uniform distributions of the check-node degrees or variable-node degrees, the QC-LDPC code becomes an irregular code.

## 2.2 Belief Propagation Decoding Algorithm for LPDC Codes

LDPC codes are most commonly decoded using the belief propagation (BP) algorithm [24], [25]. Referring to

the Tanner graph shown in Fig. **??**, the variable nodes and the check nodes exchange soft messages iteratively based on the connections and according to a two-phase schedule.

Given a binary (*N*, *K*) LDPC code with a parity-check matrix **H**, we define *C* as the set of binary codewords **c** that satisfy the equation  $\mathbf{c}\mathbf{H}^{\mathrm{T}} = \mathbf{0}$ . At the transmitter side, a binary codeword  $\mathbf{c} = (c_0, c_1, \ldots, c_{N-1})$  is mapped into the sequence  $\mathbf{x} = (x_0, x_1, \ldots, x_{N-1})$  according to  $x_n = 1 - 2c_n$ . We assume that **x** is then transmitted over an additive white Gaussian noise (AWGN) channel and the received signal vector is then given by  $\mathbf{y} = (y_0, y_1, \ldots, y_{N-1}) = \mathbf{x} + \mathbf{g}$ , where  $\mathbf{g} = (g_0, g_1, \ldots, g_{N-1})$  consists of independent Gaussian random variables with zero mean and variance  $\sigma^2 = N_0/2$ .

Let  $\mu_n$  be the initial log-likelihood ratio (LLR) that the variable node *n* is a "0" to that it is a "1", i.e.,

$$\mu_n = \ln\left(\frac{\Pr(c_n = 0|y_n)}{\Pr(c_n = 1|y_n)}\right).$$
(2)

Initially,  $\mu_n$  is calculated by  $\mu_n = (4/N_0) \cdot y_n = \frac{2y_n}{\sigma^2}$  [26]. Define  $\mathcal{N}(m)$  as the set of variable nodes that participate in check node m and  $\mathcal{M}(n)$  as the set of check nodes connected to variable node n. At iteration l, let  $\beta_{mn}^{(l)}$  be the LLR messages passed from variable node n to check node m;  $\alpha_{mn}^{(l)}$  be the LLR messages passed from check node m to variable node n; and  $\beta_n^{(l)}$  be the a posteriori LLR of variable node n. Then the standard BP algorithm can be described in Algorithm **??** in the Appendix [2], [27].

Note that the decoding algorithm consists of 4 main procedures: initialization, horizontal step, vertical step and making hard decisions. For each of these procedures, multiple threads can be used in executing the computations in parallel and all the threads will follow the same instructions with no divergence occurring, except when making hard decisions.

#### 2.3 Structure of LDPC Convolutional Codes

A (time-varying) semi-infinite LDPC convolutional code can be represented by its parity check matrix in (3). where  $m_s$  is referred to as the syndrome former memory of the parity-check matrix. Besides, the sub-matrices  $\mathbf{H}_i(t), i = 0, 1, ..., m_s$ , are binary  $(c-b) \times c$  matrices given by

$$\mathbf{H}_{i}(t) = \begin{vmatrix} h_{i}^{(1,1)}(t) & \cdots & h_{i}^{(1,c)}(t) \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ h_{i}^{(c-b,1)}(t) & \cdots & h_{i}^{(c-b,c)}(t) \end{vmatrix}.$$

If  $\mathbf{H}_i(t)$  are full rank for all time instant t, the matrix  $\mathbf{H}$  in (3) defines a rate R = b/c convolutional code ignoring the irregularity at the beginning.

**Definition 1.** A LDPC convolutional code is called a regular  $(m_s, J, K)$ -LDPC convolutional code if the parity-check matrix  $\mathbf{H}_{[0,\infty]}$  has exactly K ones in each row and J ones in

each column starting from the  $(m_s \cdot (c-b) + 1)$ -th row and  $(m_s \cdot c + 1)$ -th column.

**Definition 2.** An  $(m_s, J, K)$ -LDPC convolutional code is periodic with period T if  $\mathbf{H}_i(t), i \in \mathbb{Z}^+$  is periodic, i.e.,  $\mathbf{H}_i(t) = \mathbf{H}_i(t+T), \forall i, t.$ 

A code sequence  $\mathbf{v}_{[0,\infty]} = [\mathbf{v}_0, \mathbf{v}_1, ..., \mathbf{v}_\infty]$  is "valid" if it satisfies the equation

$$\mathbf{v}_{[0,\infty]}\mathbf{H}_{[0,\infty]}^T = \mathbf{0}$$
(3)

where  $\mathbf{v}_i = (v_i^{(1)}, v_i^{(2)}, ..., v_i^{(c)})$  and  $\mathbf{H}_{[0,\infty]}^T$  is the syndrome-former (transposed parity-check) matrix of  $\mathbf{H}_{[0,\infty]}$ .

#### 2.4 Deriving LDPC Convolutional codes from QC-LDPC block codes

There are several methods to construct LDPC convolutional codes from LDPC block codes. One method is to derive time-varying LDPCCC by unwrapping randomly constructed LDPC block codes [18] and another is by unwrapping the QC-LDPC codes [28], [29]. We consider a construction method by unwrapping a class of QC-LDPC block code. Details of the method are shown in the Appendix.

**Example 1.** Consider a QC-LDPC code with 4 block rows and 24 block columns, i.e., J = 4 and L = 24. It is first divided into  $4 \times 4$  equally sized sub-blocks<sup>1</sup>, i.e.,  $\Lambda = 4$ . Then the parity-check matrix of LDPCCC is derived. The construction process is shown in Fig. 1.

#### 2.5 Decoding Algorithm for LDPCCC

In  $\mathbf{H}_{[0,\infty]}$ , two different variable nodes connected to the same check node cannot be distant from each other more than  $m_s$  time units. This allows a decoding window that operates on a fixed number of nodes at one time. Since any two variable nodes that are at least  $m_s + 1$  units apart can be decoded independently, parallel implementation is feasible. The LDPCCC can therefore be decoded with pipelined BP decoding algorithm [18]. Specifically, for a maximum iteration number of *I*, *I* independent processors will be employed working on different variable nodes corresponding to different time. In each processor, the variable nodes and the check nodes exchange soft messages iteratively based on the connections and according to a two-phase schedule.

Fig. 2 shows a decoder on the Tanner graph. It is based on the LDPCCC structure shown in Example 1. The code has a rate of R = 5/6 and a syndrome former memory of  $m_s = 3$ . We refer the *c* incoming variable nodes (bits) as a frame. Note that every *c* bits form a frame and every  $m_s + 1$  frames are involved in the same constraints. The *I* processors can operate concurrently. At every iteration, every processor first updates the (c - b) neighboring

1. Here we use sub-block to denote the  $(pJ/\Lambda) \times (pL/\Lambda)$  matrix as to distinguish it with the sub-matrix within it, i.e., the  $p \times p$  matrix.

$$\mathbf{H}_{[0,\infty]} = \begin{bmatrix} \mathbf{H}_{0}(0) \\ \mathbf{H}_{1}(1) & \mathbf{H}_{0}(1) \\ \vdots & \vdots & \ddots \\ \mathbf{H}_{m_{s}}(m_{s}) & \mathbf{H}_{m_{s}-1}(m_{s}) & \cdots & \mathbf{H}_{0}(m_{s}) \\ & \mathbf{H}_{m_{s}}(m_{s}+1) & \mathbf{H}_{m_{s}-1}(m_{s}+1) & \cdots & \mathbf{H}_{0}(m_{s}+1) \\ & \ddots & \ddots & \ddots \\ & \mathbf{H}_{m_{s}}(t) & \mathbf{H}_{m_{s}-1}(t) & \cdots & \mathbf{H}_{0}(t) \\ & & \ddots & \ddots & \ddots \\ \end{bmatrix},$$
(3)



Fig. 1: Illustration of constructing a LPDCCC from a QC-LDPC block code.



Fig. 2: Continuous decoding of LDPC convolutional code with *I* processors. Each circle denotes a group of *c* variable nodes and each square denotes a group of (c - b) check nodes. Each edge represents the connection between the *c* variable node and the (c - b) check nodes.

check nodes of the c variable nodes that just come into this processor. Then every processor will update the cvariables which are leaving this processor.

The computations of the check-node updating and variable-node updating are based on the standard BP algorithm Suppose  $\mathbf{v}_{[0,\infty]} = [\mathbf{v}_0, \mathbf{v}_1, \dots, \mathbf{v}_{\infty}]$ , where  $\mathbf{v}_t = (v_t^{(1)}, v_t^{(2)}, \dots, v_t^{(c)})$  is the *t*th transmitted code-word. Then the codeword  $\mathbf{v}_{[0,\infty]}$  is mapped into the sequence  $\mathbf{x}_{[0,\infty]} = [\mathbf{x}_0, \mathbf{x}_1, \dots, \mathbf{x}_{\infty}]$  according to  $\mathbf{x}_t = (x_t^{(1)}, x_t^{(2)}, \dots, x_t^{(c)})$  and  $x_t^{(j)} = 1 - 2v_t^{(j)}$   $(j = 1, 2, \dots, c)$ .

Assuming an AWGN channel, the received signal  $\mathbf{y}_{[0,\infty]} = [\mathbf{y}_0, \mathbf{y}_1, \dots, \mathbf{y}_\infty]$  is further given by  $\mathbf{y}_t = (y_t^{(1)}, y_t^{(2)}, \dots, y_t^{(c)})$  where  $y_t^{(j)} = x_t^{(j)} + g_t^{(j)}$  and  $g_t^{(j)}$  is an AWGN with zero mean and variance  $\sigma^2 = N_0/2$ .

Using the same notation as in Sect. 2.2, the pipelined BP decoding algorithm applying to LDPCCC is illustrated in Algorithm ?? in the Appendix. Same as the LDPC decoding algorithm, the LDPCCC decoding algorithm consists of 4 main procedures: initialization, horizontal step, vertical step and making hard decisions.

Moreover, for each of these procedures, multiple threads can be used in executing the computations in parallel and all the threads will follow the same instructions with no divergence occurring, except when making hard decisions.

### 3 IMPLEMENTATION OF DECODERS FOR LDPC CODES AND LDPCCCS

#### 3.1 GPU-based LDPC Decoder

(Please refer to the Appendix for a brief description of graphics processing unit and CUDA programming.) We implement our decoders using the standard BP decoding algorithm. According to the CUDA programming model, the granularity of a thread execution and a coalesced memory access is a warp. Full efficiency is realized when all threads in a warp take the same execution path and the coalesced memory access requirement is satisfied. Thus, we propose to decode  $\Gamma$  codewords simultaneously, where  $\Gamma$  is an integer multiple of a warp (i.e., multiple of 32). For each decoding cycle,  $\Gamma$  codewords will be input, decoded, and ouput together and in parallel.

Recall that an LDPC code can be represented by its parity-check matrix or a Tanner graph. A non-zero element in the parity-check matrix corresponds to an edge in the Tanner graph.

In the LDPC decoder, messages are bound to the edges in the Tanner graph (or the 1's in the parity-check matrix **H**). So we store the messages according to the positions of 1's. Besides, the channel messages corresponding to the variable nodes are required. To reuse the notation, we denote the data structure storing the messages between the variable nodes and the check nodes as **H** while the the data structure storing the channel messages as **V**. The difficulty of the CUDA memory arrangement lies on the fact that for practical LDPC codes with good performance, the positions of the 1's are scattered in the parity-check matrix.

First, in the BP decoding procedure, although there are two kinds of messages, namely, the variable-to-check messages and the check-to-variable messages, at every step of the iteration, only one kind of message is needed to be stored, i.e., after the check-node updating step, only the check-to-variable messages  $\alpha$ 's are stored in the H and after the variable-node updating step, only the variable-to-check messages  $\beta$ 's are stored in the **H**. Second, in our new decoder architecture,  $\Gamma$  ( $\Gamma$  is a multiple of a warp) codewords are decoded together and hence the messages of  $\Gamma$  codewords are also processed together. We number the distinct codewords as  $0, 1, ..., \Gamma - 1$  and we use the same notations for the messages as before, i.e.,  $\beta_{mn}(\gamma)$  is the message from variable node *n* to check node *m* corresponding to the  $\gamma$ -th codeword and  $\alpha_{mn}(\gamma)$ is the message from check node m to variable node *n* corresponding to the  $\gamma$ -th codeword. Since all the  $\Gamma$ codewords messages share the same Tanner graph, messages of the  $\Gamma$  distinct codewords corresponding to the

packages  $\mathfrak{p}_{mn}$ 's are aligned linearly according to their corresponding positions in the parity-check matrix — row-by-row, and left to right for each row. That implies the messages associated to one check node are stored contiguously.

# **Remark.** To be consistent with the use of memory locations in computer programming, all the indices of the data structures in this paper starts from 0.

The advantage of this arrangement is obvious. Since  $\Gamma$  is a multiple of 32, the memory segment for every package is naturally aligned when the data type belongs to one of the required data types (i.e., with word size of 1-, 2-, 4-, or 8-byte). In addition, the structure of the parity-check matrix **H** is shared by the  $\Gamma$  codewords. As these  $\Gamma$  data elements are processed together, they can be accessed by  $\Gamma$  contiguous threads and hence the global memory is always accessed in a coalesced way. We also ensure that the threads within a warp always follow the same execution path with no divergence occurring (except when making hard decisions on the received bits). Then both the memory access and the thread execution are optimal and efficient.

We also need to store the details of the parity-check matrix. Two lookup tables denoted by  $LUT_c$  and  $LUT_v$  will be kept.  $LUT_c$  is used in the check-node updating process and  $LUT_v$  is used in the variable-node updating process. The two tables store the indices of the data accessed in the two updating processes and both are two-dimensional. The first dimension is to distinguish different check nodes, i.e.,  $LUT_c[m]$  is associated with the *m*-th check node or the *m*-th row. Each  $LUT_c[m]$  records the indices of the messages related to the *m*-th check node. The two lookup tables are shared by all  $\Gamma$  codewords. An example is illustrated in the Appendix. The  $LUT_c$  and  $LUT_v$  lookup tables are stored in the constant or texture memory in the CUDA device so as to be cached to reduce the access time.

A separate thread is assigned to process each check node or each variable node in the updating kernel. Hence,  $\Gamma$  threads can be assigned to process the data of  $\Gamma$  codewords simultaneously. So, a two dimensional thread hierarchy is launched. The first dimension is for identifying the different codewords while the second dimension is for processing different check nodes or variable nodes. The thread layout is illustrated in Fig. 3. For each thread block, we allocate  $\Gamma$  threads in the threadIdx.x dimension<sup>2</sup>, and  $BL_y$  threads in the threadIdx.y dimension. Each thread-block contains  $BL_y \times \Gamma$  threads, which should be within the thread-block size limit (1024

2. In CUDA, threads are linear in the threadIdx.x dimension.



Fig. 3: Two dimensional thread layout of the checknode/variable-node updating kernel.

for the current device). The total number of threadblocks is determined by the number of check nodes Mor the number of variable nodes N. We denote  $BL_y$  in the check-node updating kernel as  $BL_{y,cnu}$  and the one in the variable-node updating kernel as  $BL_{y,vnu}$ . Then the numbers of thread blocks are given by  $\lceil M/BL_{y,cnu} \rceil$ and  $\lceil N/BL_{y,vnu} \rceil$ , respectively. In Fig. 3, the threads marked by the vertical rectangular are processing the same codeword. (See the Appendix on the selection of the size of the thread-block.)

#### 3.2 GPU-based LDPCCC Decoder

The decoding algorithm and the pipelined LDPCCC decoder architecture have been introduced in Section 2.5. The LDPCCCs studied in our work are derived from QC-LDPC codes as described in Section 2.4. So our LDPCCC decoder is confined to the LDPCCCs with the parity-check matrix  $\mathbf{H}_{[0,\infty]}$  of this kind of structure.

#### 3.2.1 Data Structure

The LDPC convolutional codes are decoded continuously. We will thus refer to an LDPCC code sequence  $\mathbf{v}_{[0,\infty]} = [\mathbf{v}_0, \mathbf{v}_1, \dots, \mathbf{v}_\infty]$  as a *code stream* and  $\mathbf{v}_i$ ,  $i = 0, 1, \dots, \infty$  as a *code frame* or *variable frame*. A code stream is constrained with the parity-check matrix  $\mathbf{H}_{[0,\infty]}$  by

$$\mathbf{v}_{[0,\infty]}\mathbf{H}_{[0,\infty]}^T = \mathbf{0}$$

The parity-check matrix of the LDPCCC is shown in Figure 4. It is seen that the check nodes are grouped into layers. Each variable-node frame is connected to  $m_s + 1$  (4 here) check layers in the parity-check matrix. Let c



Fig. 4: The periodic structure of the parity-check matrix of the LDPCCCs.

denote the size of  $\mathbf{v}_i$ ,  $i = 0, 1, ..., \infty$  and c - b denote the size of each check layer. Thus the code rate is b/c.

We will use the same notations as in Section 2.4. The LDPCCC is derived from a (J, L) QC-LDPC base code  $\mathbf{H}_{QC}$  which has  $J \times L$  sub-matrices and the size of each sub-matrix is  $p \times p$ .  $\mathbf{H}_{QC}$  is first divided into  $\Lambda \times \Lambda$  sub-blocks<sup>3</sup> ( $\Lambda = 4$  in Figure 4) and each sub-block contains several sub-matrices. We have  $c = L/\Lambda \times p$  and  $c - b = J/\Lambda \times p$ . Referring to Section 2.4, we denote the unwrapped parity-check matrix of the QC-LDPC code as

$$\mathbf{H}_{base} = \begin{bmatrix} \mathbf{H}_{QC}^{L} \\ \mathbf{H}_{QC}^{U} \end{bmatrix}.$$

The  $\mathbf{H}_{[0,\infty]}$  of the derived LDPCCC is a repetition of  $\mathbf{H}_{base}$ . Denoting the number of edges in  $\mathbf{H}_{base}$  by E, we have  $E = J \times L \times p$ .

In designing the LDPCCC decoder, the first thing to consider is the amount of memory required to store the messages. Like the LDPC decoder, we store the messages according to the edges in the parity-check matrix. Let I denote the number of iterations in the LDPCCC decoding. Then I processors are required in the pipelined decoder. Although the parity-check matrix of the LDPCCC is semi-infinite, the decoder only needs to allocate memory for I processors. Hence the total size of the memory required for storing the messages passing between the variable nodes and check nodes is  $I \times E$  units. And the total size of the memory required for storing these channel messages is  $I \times c$ .

Next, we will describe the hierarchical data structure for the LDPCCC decoder memory space. To reuse the notation, we use **H** to denote the memory space for the

3. Note that a "sub-block" is different from a "sub-matrix".

messages on the edges and V to denote the memory space for the channel messages. The H is a multidimensional array with two hierarchies. First, we divide the entire memory space into I groups corresponding to the I processors and we use the first hierarchy of **H** as the data structure for each group. That is  $\mathbf{H}[i]$ , i = 0, 1, ..., I - 1 denote the data structure for the I processors, respectively. Second, recall that the paritycheck matrix in Figure 4 is derived from  $H_{base}$  which is divided into 16 non-zero sub-blocks and each sub-block has a size of  $(pJ/\Lambda) \times (pL/\Lambda)$ . Thus in each group,  $\mathbf{H}[i]$  is also divided into 16 sub-blocks, denoted by the second hierarchy of **H**, namely,  $\mathbf{H}[i][j]$ , where j = 0, 1, ..., 15. Every  $\mathbf{H}[i][j]$  stores the messages associated with one subblock. On the other hand, the memory for the channel messages is simpler:  $\mathbf{V}[i], i = 0, 1, ..., I \cdot (m_s + 1) - 1$  will be allocated. Finally, to optimize the thread execution and memory access,  $\Gamma$  LDPC convolutional code streams are decoded simultaneously, where  $\Gamma$  is a multiple of a warp. Thus every  $\Gamma$  data are combined into one package and take up one memory unit.

An LDPCCC decoder uses the BP algorithm to update the check nodes and variable nodes. The BP decoding procedures are based on the parity-check matrix  $\mathbf{H}_{[0,\infty]}$ . With the data structure to store the messages, the decoder also needs the structure information of  $\mathbf{H}_{[0,\infty]}$ for understanding the connections between the check nodes and the variable nodes. This information can be used to calculate the index of the data being accessed during the updating. Due to the periodic property of the constructed LDPCCC, the structure of  $\mathbf{H}_{base}$  is shared by all the processors. We label the 16 sub-blocks in  $\mathbf{H}_{base}$ with the numbers  $0, 1, \ldots, 15$ .

In addition, in the decoder, the *I* check-node layers or *I* variable-node frames being updated simultaneously in the *I* processors are separated by an interval of  $m_s + 1$ . Since  $\mathbf{H}_{[0,\infty]}$  also has a period of  $T = m_s + 1$ , at any time slot, the *I* processors require the same structure information in updating the check nodes or the variable nodes, as seen in Figure 4. The lookup tables used in check-node updating and variable-node updating are denoted as  $LUT_c$  and  $LUT_v$ , respectively. The two lookup tables will then store the labels of the sub-blocks in  $\mathbf{H}_{base}$  that are involved in the updating process. Besides, another lookup table  $LUT_{sub}$  will be used to store the "shift numbers<sup>4</sup>" of the sub-matrices in each sub-block.

**Example 2.** The  $LUT_c$  and  $LUT_v$  for the LDPCCC in Figure 4 are

$$LUT_c = \begin{bmatrix} 1 & 2 & 3 & 0 \\ 6 & 7 & 4 & 5 \\ 11 & 8 & 9 & 10 \\ 12 & 13 & 14 & 15 \end{bmatrix}$$
(4)

4. For a QC-LDPC base matrix, the information is the "shift number" of each  $p \times p$  sub-matrix (-1 represents the all-zero matrix, 0 represents the identity matrix, l represents cyclically right-shifting the identity matrix l times).



0

2 3

Fig. 5: Illustration of the procedures of a LPDCCC decoder. The horizontal line denotes the updating of the row. The vertical line denotes the updating of a column.

and

$$LUT_v = \begin{vmatrix} 0 & 4 & 8 & 12 \\ 5 & 9 & 13 & 1 \\ 10 & 14 & 2 & 6 \\ 15 & 3 & 7 & 11 \end{vmatrix} .$$
 (5)

#### 3.2.2 Decoding Procedures

Based on the discussion in Section 2.5, the detailed decoding procedures are shown in the Appendix.

#### 3.2.3 Parallel Thread Hierarchy

As described in Sect. 3.2.1, the memory associated with each entry in the H matrix is a message package containing  $\Gamma$  messages from  $\Gamma$  code streams. So there is a straightforward mapping between the thread hierarchy and the data structure. In the check-node-updating kernel (or variable-updating-kernel), a two dimensional thread hierarchy of size  $I \cdot (c - b) \times \Gamma$  (or  $I \cdot c \times \Gamma$ ) is launched, where (c - b) (or c) is mapped to the total number of check nodes (or variable nodes) being updated in *I* processors. The size of one of the dimensions (i.e.,  $\Gamma$ ) is mapped to the number of code streams. Like in LDPC decoder,  $\Gamma$  will be configured as the *threadIdx.x* dimension and (c - b) (or c) will be the *threadIdx.y* dimension in the CUDA thread hierarchy. The  $\Gamma$  threads in the *threadIdx.x* dimension is contiguous and will access the  $\Gamma$  data in each message package for coalesced access.

#### 3.3 CPU-based LDPC and LDPCCC Decoders

We implement both the serial CPU-based LDPC decoder and LDPCCC decoder using the C language. As CPUs with multiple cores are very common nowadays, we further implement a multi-thread CPU-based LDPCCC decoder using OpenMP. OpenMP [30] is a portable, scalable programming interface for shared-memory parallel computers. It can be used to explicitly direct multithreaded, shared memory parallelism. A straightforward application of the OpenMP is to parallize the intensive

|                                 | CPU           | GPU                                                |
|---------------------------------|---------------|----------------------------------------------------|
| Platform                        | Intel Xeon    | Nvidia GTX460                                      |
| Number of cores                 | 4             | 7 × 48 = 336                                       |
| Clock rate                      | 2.26 GHz      | 0.81 GHz                                           |
| Memory                          | 8 GB DDR3 RAM | 768 MB global<br>memory and 48 KB<br>shared memory |
| Maximum<br>number of<br>threads | 8             | _                                                  |
| Maximum<br>thread-block<br>size | _             | 1024 threads                                       |
| Programming<br>language         | C/OpenMP      | CUDA C                                             |

TABLE 1: Simulation environments.

| Code | $J \times L$  | p    | $c \times (c - b)$ | Number of<br>Edges |
|------|---------------|------|--------------------|--------------------|
| А    | $4 \times 24$ | 422  | $2532 \times 422$  | 40512              |
| В    | $4 \times 24$ | 632  | $3792 \times 632$  | 60672              |
| С    | $4 \times 24$ | 768  | $4608\times768$    | 73728              |
| D    | $4 \times 24$ | 1024 | $6144 \times 1024$ | 98304              |

TABLE 2: Parity-check matrices of the QC-LDPC codes used in the LDPC decoder. They are also used to derive the LDPCCCs A' to D'.

loop-based code with the *#pragma omp parallel for* directive. Then the executing threads will be automatically allocated to different cores on a multi-core CPU.

The horizontal step and the vertical step in Algorithm **??** involve intensive computing. On a single-core CPU, the updating of the different nodes are processed with a serial *for* loop. Since the updating of different nodes can be performed independent of one another, it is ideal to parallelize the *for* loop with the *#pragma omp* parallel for directive in the OpenMP execution on a multicore CPU. Hence, in our implementation, we issue multiple threads to both the updating of the check nodes (**??**) and the updating of the variable nodes (**??**) in the multi-thread CPU-based LDPCCC decoder.

#### 4 RESULTS AND DISCUSSION

#### 4.1 The Experimental Environment

The CPU being used is an Intel Xeon containing 4 cores. Moreover, it can handle up to 8 threads at a time. The serial CPU-based decoders are developed using C and the multi-threaded CPU-based LDPCCC decoder is developed using OpenMP. Note that for the serial CPUbased decoders, only one of the 4 cores in the CPU will be utilized. The GPU used in this paper is a GTX460 containing 336 cores and the GPU-based decoders are

| A 2832 6 2.12 4058 1270 313 14           | 8 |
|------------------------------------------|---|
| B   12768 37 2.9   11664 5350 458   15   | 8 |
| C   21664 74 3.4   20046 10950 546   16  | 1 |
| D   82624 371 4.5   70843 51580 728   16 | 2 |

TABLE 3: Decoding time for the GPU-based LDPC decoder and the serial CPU-based decoder at  $E_b/N_0=3.2$ dB. 30 iterations are used. *C* represents the total number of decoded codewords; *T* denotes the total simulation time and *t* is the average simulation time per codeword.

developed using CUDA C. Furthermore, in our simulations, 32 codewords are decoded simultaneously in the GPU decoders, i.e.,  $\Gamma = 32$ . Details of the CPU and GPU used in our simulations are presented in Table 1.

Table 2 shows the characteristics of the QC-LDPC codes under test. For Code A to code D, J = 4 and L = 24 thus giving the same code rate of (24 - 4)/24 = 5/6. These codes are further used to derived regular LDPCCCs. In order to avoid confusion, we denote the derived LDPCCCs as Code A' to Code D'. It can be readily shown that the (3, 4, 24)-LDPCCCs A' to D' have the same code rate of 5/6.

**Remark.** Note that although QC-LDPC codes are adopted in the simulation, the new GPU-based LDPC decoder is able to decode other LDPC codes like randomly-constructed regular or irregular codes.

#### 4.2 The Decoding Time Comparison

In order to optimize the speed and to minimize the data transfer between the CPU (host) and the GPU (device), we generate and process the data, including the codeword and the AWGN noise, directly on the GPU. After hard decisions have been made on the received bits, the number of error bits are counted at the GPU using a "reduce program". Subsequently, the number is transferred to the CPU. Since the data transfer occurs only at the end of the iterative decoding process, the transfer time (overhead) is very small (less than 2%) compared with time spent in the whole decoding process.

In the following, we fix the number of decoding iterations and the simulation terminates after 100 block/frame errors are received. By recording the total number of blocks/frames decoded and the total time taken<sup>5</sup>, we can compute the average time taken to decode one block/frame.

<sup>5.</sup> In the case of the GPU-based decoders, the total time taken includes the GPU computation time, the time spent in transferring data between the CPU and GPU, etc. However, as explained above, the GPU computation time dominates the total time while the overhead is very small.

| Code | Number of threads used |    |    |    |    |  |
|------|------------------------|----|----|----|----|--|
|      | 1                      | 2  | 4  | 6  | 8  |  |
| A′   | 39                     | 20 | 11 | 10 | 9  |  |
| C′   | 73                     | 38 | 21 | 19 | 17 |  |

TABLE 4: Average LDPCCC decoding time (ms) per code frame for the quad-core CPU-based decoder when different numbers of threads are used.

#### 4.2.1 LDPC decoders

The GPU-based decoder and the serial CPU-based decoder are tested with 30 iterations at a  $E_b/N_0$  of 3.2 dB. Table 3 shows the number of transmitted codewords and the simulation times for different codes.

We consider the average time for decoding one codeword for the serial CPU-based decoder, i.e.,  $t_{CPU}$ . We observe that  $t_{CPU}$  increases from Code A to Code D due to an increasing number of edges in the codeword. Further, we consider the average time for decoding one codeword for the GPU-based decoder, i.e.,  $t_{GPU}$ . Similar to the serial CPU-based decoder,  $t_{GPU}$  increases from Code A to Code D.

Finally, we compare the simulation times of the serial CPU-based decoder and the GPU-based decoders by taking the ratio  $t_{\text{CPU}}/t_{\text{GPU}}$ . The results in Table 3 indicate that the GPU-based decoder accomplishes speedup improvements from 148 times to 162 times compared with the serial CPU-based decoder.

#### 4.2.2 LDPCCC decoders

We decode the LDPC convolutional codes A' to D' at a  $E_b/N_0$  of 3.1 dB with I = 20. First, we show the average decoding times for Code A' and Code C' when different numbers of threads are used in the CPU-based decoders. The results are shown in Table 4. The serial CPU-based decoder corresponds to the case with a single thread. We observe that the decoding time is approximately inversely proportional to the number of threads used up to 4 threads. However, the time does not improve much when the number of threads increases to 6 or 8. The reason is as follows. The CPU being used has 4 cores, which can execute up to 4 tasks in fully parallel. Hence, compared with using a single thread, there is an almost 4 times improvement when 4 threads are used. As the number of threads increases beyond 4, however, the tasks of the threads will be scheduled. But a maximum of 4 threads can be executed on the 4 processors at the same time. Consequently, further time improvement is small when more than 4 threads are used.

Next, we compare the decoding times of the LDPCCC decoders when GPU-based and CPU-based decoders are used to decode Code A' to Code D'. For the CPU-based decoders, we consider the cases where a single thread and 8 threads are used, respectively. Table 5 shows the results. As explained above, limited by the number of cores (4 only) in the CPU, the CPU-based decoder can

only improve the speed by about 4 times even when the number of threads increases from 1 to 8. We also observe that compared with the serial CPU-based decoder, the GPU-based LDPCCC decoder can achieve 170 to 200 times speedup improvement. Compared with the 8-thread CPU-based decoder, the GPU-based LDPCCC decoder can also accomplish 39 to 46 times speedup improvement.

#### 5 CONCLUSION

In this paper, efficient decoders for LDPC codes and LDPC convolutional codes based on the GPU parallel architecture are implemented. By using efficient data structure and thread layout, the thread divergence is minimized and the memory can be accessed in a coalesced way. All decoders are flexible and scalable. First, they can decode different codes by changing the parameters. Hence, the programs need very little modification. Second, they should be to run on the latest or even future generations of GPUs which possess more hardware resources. For example, if there are more cores/memory in the GPU, we can readily decode more codes, say  $\Gamma = 64$ codes as compared with  $\Gamma = 32$  codes used in this paper, at the same time. These are actually advantages of GPU parallel architecture compared to other parallel solutions including FPGA or VLSI. We will report our results in the future when we have the opportunity to run our proposed mechanism in other GPU families.

Compared with the traditional serial CPU-based decoders, results show that the proposed GPU-based decoders can achieve  $100 \times$  to  $200 \times$  speedup. The actual time depends on the particular codes being simulated. When compared with the 8-thread CPU-based decoder, the GPU-based decoder can also accomplish 39 to 46 times speedup improvement. Thus the simulation time can be reduced from months to weeks or days when a GPU-based decoder is used. In summary, our results show that the proposed GPU-based LDPC/LDPCCC decoder has obvious advantages in the decoding time compared with CPU-based decoders.

#### REFERENCES

- R. G. Gallager, Low-Density Parity-Check Codes. The MIT Press, Sep. 1963.
- [2] D. MacKay, "Good error-correcting codes based on very sparse matrices," *Information Theory, IEEE Transactions on*, vol. 45, no. 2, pp. 399–431, 1999.
  [3] I. Djordjevic, M. Cvijetic, L. Xu, and T. Wang, "Using LDPC-
- [3] Î. Djordjevic, M. Cvijetic, L. Xu, and T. Wang, "Using LDPC-Coded modulation and coherent detection for ultra highspeed optical transmission," *Lightwave Technology, Journal of*, vol. 25, no. 11, pp. 3619–3625, 2007.
- [4] Y. Miyata, K. Sugihara, W. Matsumoto, K. Onohara, T. Sugihara, K. Kubo, H. Yoshida, and T. Mizuochi, "A triple-concatenated FEC using soft-decision decoding for 100 Gb/s optical transmission," in Optical Fiber Communication (OFC), collocated National Fiber Optic Engineers Conference, 2010 Conference on (OFC/NFOEC), 2010, pp. 1–3.
- [5] Y. Chen and D. Hocevar, "A FPGA and ASIC implementation of rate 1/2, 8088-b irregular low density parity check decoder," in *Global Telecommunications Conference*, 2003. GLOBECOM '03. IEEE, vol. 1, 2003, pp. 113–117 Vol.1.

| Code | $C_{\rm GPU}$ | $T_{ m GPU}$ (s) | $t_{ m GPU}$ (ms) | $C_{\rm CPU}$ | $T_{\rm CPU-1}$ (s) | $t_{ m CPU-1}$ (ms) | $T_{\rm CPU-8}$ (s) | $t_{ m CPU-8} \ (ms)$ | $\frac{t_{\rm CPU-1}}{t_{\rm CPU-8}}$ | $\frac{t_{\rm CPU-1}}{t_{\rm GPU}}$ | $\frac{t_{\rm CPU-8}}{t_{\rm GPU}}$ |
|------|---------------|------------------|-------------------|---------------|---------------------|---------------------|---------------------|-----------------------|---------------------------------------|-------------------------------------|-------------------------------------|
| А    | 3136          | 0.73             | 0.23              | 2846          | 112                 | 39                  | 28                  | 9                     | 4.3                                   | 170                                 | 39                                  |
| В    | 6272          | 1.95             | 0.31              | 5716          | 345                 | 60                  | 79                  | 14                    | 4.3                                   | 194                                 | 45                                  |
| С    | 14400         | 5.4              | 0.38              | 13303         | 976                 | 73                  | 230                 | 17                    | 4.3                                   | 192                                 | 45                                  |
| D    | 43680         | 21.0             | 0.48              | 37451         | 3590                | 96                  | 834                 | 22                    | 4.4                                   | 200                                 | 46                                  |

TABLE 5: Decoding time for the GPU-based LDPCCC decoder and the CPU-based decoders at  $E_b/N_0$ =3.1 dB. I = 20 processors are used. C represents the total number of decoded frames; T denotes the total simulation time and t is the average simulation time per frame. CPU-1 and CPU-8 denote the use of 1 thread and 8 threads, respectively, in the CPU-based decoder.

- [6] I. B. Djordjevic, M. Arabaci, and L. L. Minkov, "Next generation FEC for High-Capacity communication in optical transport networks," *Journal of Lightwave Technology*, vol. 27, no. 16, pp. 3518– 3530, 2009.
- [7] B. Levine, R. R. Taylor, and H. Schmit, "Implementation of near Shannon limit error-correcting codes using reconfigurable hardware," 2000.
- [8] A. Pusane, A. Feltstrom, A. Sridharan, M. Lentmaier, K. Zigangirov, and D. Costello, "Implementation aspects of LDPC convolutional codes," *Communications, IEEE Transactions on*, vol. 56, no. 7, pp. 1060–1069, 2008.
- [9] S. Bates, Z. Chen, L. Gunthorpe, A. Pusane, K. Zigangirov, and D. Costello, "A low-cost serial decoder architecture for lowdensity parity-check convolutional codes," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 55, no. 7, pp. 1967–1976, Aug. 2008.
- [10] Z. Chen, S. Bates, and W. Krzymien, "High throughput parallel decoder design for LDPC convolutional codes," in *Circuits and Systems for Communications*, 2008. ICCSC 2008. 4th IEEE International Conference on, May 2008, pp. 35–39.
- [11] R. Swamy, S. Bates, and T. Brandon, "Architectures for ASIC implementations of low-density parity-check convolutional encoders and decoders," in *Proc. IEEE Int. Symp. Circuits and Systems ISCAS* 2005, 2005, pp. 4513–4516.
- [12] C. W. Sham, X. Chen, F. C. M. Lau, Y. Zhao, and W. M. Tam, "A 2.0 Gb/s Throughput Decoder for QC-LDPC Convolutional Codes," *IEEE Transactions on Circuits and Systems I*, to appear.
- [13] G. Falcao, V. Silva, and L. Sousa, "How GPUs can outperform ASICs for fast LDPC decoding," in *Proceedings of the 23rd international conference on Supercomputing*. Yorktown Heights, NY, USA: ACM, 2009, pp. 390–399.
- [14] H. Ji, J. Cho, and W. Sung, "Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit," in *Signal Processing Systems*, 2009. SiPS 2009. IEEE Workshop on. IEEE, 2009, pp. 285–290.
- [15] —, "Memory access optimized implementation of cyclic and quasi-cyclic LDPC codes on a GPGPU," *Journal of Signal Processing Systems*, pp. 1–11, 2010.
- [16] G. Falcao, L. Sousa, and V. Silva, "Massive parallel LDPC decoding on GPU," in *Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming*. Salt Lake City, UT, USA: ACM, 2008, pp. 83–90.
  [17] —, "Massively LDPC decoding on multicore architectures,"
- [17] —, "Massively LDPC decoding on multicore architectures," *IEEE Transactions on Parallel and Distributed Systems*, vol. 22, no. 2, pp. 309–322, Feb. 2011.
- [18] A. J. Felstrom and K. Zigangirov, "Time-varying periodic convolutional codes with low-density parity-check matrix," *Information Theory, IEEE Transactions on*, vol. 45, no. 6, pp. 2181–2191, 1999.
- [19] M. Tavares, E. Matus, S. Kunze, and G. Fettweis, "A dual-core programmable decoder for LDPC convolutional codes," in *Circuits and Systems*, 2008. ISCAS 2008. IEEE International Symposium on, May 2008, pp. 532 –535.
  [20] E. Matus, M. Tavares, M. Bimberg, and G. Fettweis, "Towards a
- [20] E. Matus, M. Tavares, M. Bimberg, and G. Fettweis, "Towards a GBit/s programmable decoder for LDPC convolutional codes," in *Circuits and Systems*, 2007. ISCAS 2007. IEEE International Symposium on, May 2007, pp. 1657 –1660.
- [21] R. Tanner, "A recursive approach to low complexity codes," Information Theory, IEEE Transactions on, vol. 27, no. 5, pp. 533– 547, 1981.

- [22] M. Fossorier, "Quasicyclic low-density parity-check codes from circulant permutation matrices," *Information Theory, IEEE Transactions on*, vol. 50, no. 8, pp. 1788–1793, 2004.
- [23] W. M. Tam, F. C. M. Lau, and C. K. Tse, "A class of QC-LDPC codes with low encoding complexity and good error performance," *Communications Letters, IEEE*, vol. 14, no. 2, pp. 169–171, 2010.
- [24] T. Richardson, M. Shokrollahi, and R. Urbanke, "Design of capacity-approaching irregular low-density parity-check codes," *Information Theory, IEEE Transactions on*, vol. 47, no. 2, pp. 619– 637, 2001.
- [25] T. Richardson and R. Urbanke, "Efficient encoding of low-density parity-check codes," *Information Theory, IEEE Transactions on*, vol. 47, no. 2, pp. 638–656, 2001.
- [26] X. Hu, E. Eleftheriou, D. Arnold, and A. Dholakia, "Efficient implementations of the sum-product algorithm for decoding LDPC codes," in *Global Telecommunications Conference*, 2001. GLOBECOM '01. IEEE, vol. 2, 2001, pp. 1036–1036E vol.2.
- [27] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Hu, "Reduced-Complexity decoding of LDPC codes," *Communications, IEEE Transactions on*, vol. 53, no. 8, pp. 1288–1299, 2005.
- [28] R. Tanner, D. Sridhara, A. Sridharan, T. Fuja, and D. Costello, "LDPC block and convolutional codes based on circulant matrices," *Information Theory, IEEE Transactions on*, vol. 50, no. 12, pp. 2966–2984, 2004.
- [29] A. E. Pusane, R. Smarandache, P. O. Vontobel, and D. J. Costello, "On deriving good LDPC convolutional codes from QC-LDPC block codes," in *Proc. IEEE Int. Symp. Information Theory ISIT 2007*, 2007, pp. 1221–1225.
- [30] R. Chandra, Parallel programming in OpenMP. Morgan Kaufmann, 2001.
- [31] M. Lentmaier, D. G. M. Mitchell, G. P. Fettweis, and D. J. Costello, "Asymptotically regular LDPC codes with linear distance growth and thresholds close to capacity," in *Proc. Information Theory and Applications Workshop (ITA)*, 2010, pp. 1–8.
- [32] C. Nvidia, "Compute Unified Device Architecture Programming Guide Version 4.0," NVIDIA Corporation, Tech. Rep., 2011.
- [33] W. Nvidia, N. Generation, and C. Compute, "Whitepaper nvidia's next generation cuda compute architecture," *ReVision*, pp. 1–22, 2009.
- [34] F. Kschischang, B. Frey, and H. Loeliger, "Factor graphs and the sum-product algorithm," *Information Theory, IEEE Transactions on*, vol. 47, no. 2, pp. 498–519, 2001.



Yue Zhao received the BE degree in information Engineering from Shanghai Jiaotong University, China in 2009. He was a postgraduate student and research assistant at the Hong Kong Polytechnic University, Hong Kong, from 2009 to 2012, where he was working on algorithms and implementations for the LDPC decoding. He is currently working at the Qualcomm research center, Beijing, China.



Francis C.M. Lau (M'93–SM'03) received the BEng (Hons) degree in electrical and electronic engineering and the PhD degree from King's College London, University of London, UK, in 1989 and 1993, respectively.

He is a Professor and Associate Head at the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong. He is also a fellow of IET and a senior member of IEEE. He is the co-author of *Chaos-Based Digital Communication Systems* 

(Heidelberg: Springer-Verlag, 2003) and *Digital Communications with Chaos: Multiple Access Techniques and Performance Evaluation* (Oxford: Elsevier, 2007). He is also a co-holder of three US patents and one pending US patent. He has published over 230 papers. His main research interests include channel coding, cooperative networks, wireless sensor networks, chaos-based digital communications, applications of complex-network theories, and wireless communications.

He served as an associate editor for *IEEE Transactionson Circuits* and Systems II in 2004–2005 and *IEEE Transactions on Circuits* and Systems I in 2006–2007. He was also an associate editor of *Dynamics* of *Continuous*, *Discrete and Impulsive Systems*, *Series B* from 2004 to 2007, a co-guest editor of *Circuits*, *Systems and Signal Processing* for the special issue "Applications of Chaos in Communications" in 2005, and an associate editor for IEICE Transactions (Special Section on Recent Progress in Nonlinear Theory and Its Applications) in 2011. He has been a guest associate editor of *International Journal and Bifurcation and Chaos* since 2010 and an associate editor of *IEEE Circuits and Systems Magazine* since 2012.