THE INTEREST OF PARALLELISM IN THE TURBO DECODING PROCESS
The decoding of turbo codes is carried out through an iterative process where two SISO decoders, operating in natural and interleaved order, exchange extrinsic information. The turbo decoding approach originally proposed in  is sequential: a half iteration in the interleaved (natural) order is performed only when the natural (interleaved) half iteration is completed. This sequential decoding approach is illustrated in Figure 2.1, where SISO1 and SISO2 decoders are assigned to the natural and the interleaved constituent convo-lutional codes, respectively1 . In this figure, the iterative process starts by performing a half iteration in the natural order. However, as no preference exists, the interleaved half iteration can be also executed in the first place. For the first half iteration, the SISO1 decoder a-priori information is set to a constant value such as zero. Afterwards, the first natural order half iteration is performed. Then, the interleaved order half iteration starts. During this half iteration the SISO2 decoder exploits the extrinsic information previously produced by the SISO1 decoder to improve its error correcting performance. The decoding process continues in this way for a fixed number of half iterations, or until the conditions established by a stop criterion are met. In Figure 2.1, the time required to transmit the extrinsic information is explicitly considered. This time depends on the turbo decoder architecture. Moreover, the transmission of extrinsic information can start during the operation of each SISO decoder, i.e, it does not necessary starts once each half iteration is completed.
This sequential decoding approach presents two main drawbacks when high through-put is required. First of all, since the half iterations cannot be executed simultaneously, due to the data dependency imposed by the exchange of extrinsic information, the de-coding of an information frame is limited by the time required to execute a half iteration. Secondly, the SISO decoder algorithms contain, as presented in section 1.3, recursive operations that prevent a parallel execution of each half iteration. To overcome these constraints, several parallel decoding techniques have been investigated in the literature.
The benefits provided by a parallel turbo decoding technique cannot be only estab-lished by considering the gain in terms of throughput. Some others aspects such as the flexibility, the architecture hardware complexity and the power consumption, have also to be taken into account. Consequently, suitable metrics should be defined in order to correctly assess the parallelism techniques.
PARALLEL ARCHITECTURE EVALUATION
The main objective of our work is to design high speed turbo decoders. Hence, the turbo decoder throughput is regarded as the main characteristic to optimize. Additionally, to validate an architectural solution, two other criteria are also considered: error decoding performance and hardware complexity. Flexibility and power consumption are not con-sidered in our work. However, the cost in terms of power consumption is related to the hardware complexity.
ALGORITHMIC COMPLEXITY ANALYSIS
With regard to the hardware implementation, there is not an unique measure of the algorithmic complexity. Its definition has to be adapted depending on the application domain where the system designed will be applied. The most common approach to evaluate the algorithmic complexity consists in determining the number of elementary operations performed by the algorithm. In , this approach is followed in order to establish the complexity of the Max-Log-MAP algorithm. In this case, additions and comparison-selections correspond to the elementary operations. In , in the con-text of iterative processing, a normalization technique was applied where each addition, subtraction and multiplication is converted into the equivalent number of one-bit full additions. Therefore, the complexity of the receiver is expressed as a multiple of the complexity of a one-bit full adder block. In , the algorithmic complexity is deter-mined by the number of algorithmic operations that should be performed per time unit. It is then expressed in Giga Operations Per second (GOPs). In , two metrics are proposed to compare the algorithmic complexity of diﬀerent Forward Error Correction (FEC) schemes: one based on the arithmetic operations and the other on the storage requirements per decoded bit.
ARCHITECTURAL COMPLEXITY ANALYSIS
The architectural complexity can be defined as the number of hardware resources, related to a specific target technology, required to implement a given algorithm. For instance, if an Application Specific Integrated Circuit (ASIC) target is considered, the area (in mm2 ) is a direct measure of the architecture complexity. This area can also be given in terms of the number of equivalent logic gates of the overall circuit. If a Field Programmable Gate Array (FPGA) target is used, the complexity is therefore measured as the number of functional units in the device that are occupied (RAM blocks, LUTs, Flip-Flops, etc).
It is possible to estimate the hardware complexity from the algorithmic complexity. In this case, the complexity of each elementary operation is first established. Then, the equivalent hardware complexity is computed by multiplying the number of operations by their respective complexities. However, this approach is inaccurate because it assumes that one hardware resource is assigned to each operation. This inaccuracy is specially true for applications dominated by data transfer and with high storage requirements, where the arithmetic computations play a secondary role in the overall complexity. In this case, the hardware complexity should be obtained after logic synthesis.
ERROR DECODING PERFORMANCE
As presented in section 1.1.4, the performance of a digital communication system is expressed in terms of its BER and FER for diﬀerent SNR values. A turbo decoder improves its performance during the iterative process. Classically, if no stop criterion is applied, 6 to 8 iterations are performed by a turbo decoder, providing a good tradeoﬀ between the error correction capabilities and the time required to decode a frame. When parallel turbo decoding process is considered, decoding performance degradation may appear. To limit this degradation, additional iterations may be required, impacting negatively the turbo decoder throughput.
In the context of high throughput architectures, a high parallelism level is neces-sary. The designer should therefore especially consider the adverse impact on the error decoding performance. In the literature, as presented hereinafter, some techniques are proposed to palliate the performance degradation that may appear, keeping as low as possible the number of additional iterations. However, in some cases, if high throughput is the most important constraint to achieve, some degradations with respect to a non parallel architecture may be accepted. It depends on the design specifications in accord with the application domain. In our study, we have defined a reference turbo decoder architecture that does not exhibit any parallelism degree. Its decoding performance is established for a fixed number of iterations. Then, the performance of the parallel architectures is compared to that of this reference architecture.
METRICS ASSOCIATED TO THE ARCHITECTURE DESIGN
The use of some metrics helps to perform a fair comparison between diﬀerent archi-tectures so that the most appropriate solution can be selected. The use of algorithmic complexity metrics for this purpose is not suﬃcient because important implementation issues are not considered.
As stated in , the comparison of diﬀerent decoder architectures is a challenging task. This is specially true when a comparison of published decoder architectures is made, since all the design information is usually not available. We introduce some useful definitions to describe the turbo decoder architectures as follows. Let I denote the number of iterations executed by a turbo decoder, and t the time necessary to decode a frame. The architecture throughput, defined as the number of decoded bits per time unit, depends on t. It can be expressed as: T hroughput = T h = m · L [M bit/s] (2.1) t where m · L corresponds to the number of bits in a frame2 . Let us consider a reference turbo decoder architecture with T hr and tr the throughput and the frame decoding time, respectively. This reference architecture does not have any parallelism degree. Now, consider a parallel architecture with values T hp and tp. For these two architectures, we denote the relative acceleration as: A = tr (2.2) tp Let Cr and Cp be the architectural hardware complexities for the reference and the parallel architecture, respectively. In order to achieve an increase in the turbo decoder throughput (A > 1), the parallel architecture presents a higher hardware complexity. We define the relative cost in terms of the hardware complexity between both architectures as follows:
Table of contents :
List of Figures
List of Tables
1 Context of Channel Coding
1.1 Channel Coding generalities
1.1.2 Channel Model
1.1.3 Channel Decoding Process
1.1.4 Channel Code Performance
1.1.5 Soft Decoding
1.1.6 Block Codes
1.2 Convolutional Codes
1.2.2 Polynomial Representation
1.2.3 Trellis Diagram Representation
1.2.5 Convolutional Code Termination
1.3 Decoding of Convolutional Codes
1.3.1 Viterbi Based Decoding
1.3.2 MAP Based Decoding
1.4 Convolutional Turbo Codes
1.4.2 Turbo Encoding
1.4.3 Turbo Decoding
2 Parallel Processing Exploration for Turbo Decoding
2.1.1 The Interest of Parallelism in the Turbo Decoding Process .
2.1.2 Parallel Architecture Evaluation
2.2 Parallel Processing in Turbo Decoding
2.2.1 Parallelism at the Turbo Decoder Level
2.2.2 Parallelism at the SISO Decoder Level
2.2.3 Parallelism at the Metric Level
2.2.4 Gathering Parallel Turbo Decoding Techniques
2.3 Convergence of Parallel Turbo Decoders
2.3.1 The EXIT Charts
2.3.2 EXIT Chart Diagram Extension
2.3.3 EXIT Chart Based Analysis
2.4 Exploring SOVA Based Turbo Decoders
2.4.2 Improving the SOVA Based Turbo Decoder Performance .
3 High Throughput SISO Decoder Architectures
3.2 Radix-2 SISO Decoder Architecture
3.2.1 Branch Metric Unit
3.2.2 Add Compare Select Unit
3.2.3 Soft Output Unit
3.2.4 Implementation Results for the Radix-2 SISO Decoder .
3.3 Exploration of High Radix SISO Decoders
3.3.1 High Radix BMU and SOU
3.3.2 High Radix ACS Units
3.4 High Radix Architectures Complexity Reduction
3.4.1 Low Complexity Radix-16 SISO Decoder
3.4.2 Low Hardware Complexity Radix-16 SISO Decoder Performance .
3.4.3 Implementation Results
4 High Throughput Turbo Decoder Architectures
4.1 Generic Parallel Turbo Decoder Architecture
4.2 Memory Access Conflicts
4.2.1 Conflict-Free Interleavers
4.2.2 Memory Organization for Conflict-Free Interleavers
4.2.3 Shuffled Turbo Decoders Memory Conflict Problems
4.3 LTE High Throughput Turbo Decoder Architecture
4.3.1 SISO Decoder Architecture
4.3.2 Extrinsic Information Memory Access
4.3.3 FPGA Prototyping of the Turbo Decoder
4.4 Turbo Decoder Design Space Exploration
4.4.1 Turbo Decoder Architecture Model
4.4.2 A Dedicated Approach to Explore the Design Space
4.4.3 Case study: Turbo Decoder for LTE Standard
5 Conclusion and Perspectives