Bitslicing vs. vslicing vs. hslicing
Terminology. Whenever the slicing direction (i.e., vertical or horizontal) is unimportant, we talk about mslicing (assuming m > 1) and we call slicing the technique encompassing both bitslicing and mslicing.
Figure 1.14 provides a visual summary of all slicing forms using RECTANGLE as an example. All these slicing forms share the same basic properties: they are constant-time, they rely on data-parallelism to increase throughput, and they use some sort of transposition to convert the data into a layout suitable to their model. However, each slicing form has its own strengths and weaknesses:
• transposing data to hslicing or vslicing usually has a cost almost negligible compared to a full cipher, while a bitslice transposition is much more expensive.
• bitslicing introduces a lot of spilling, thus reducing its potential performance, unlike hslicing and vslicing.
• only vslicing is a viable option on ciphers using arithmetic operations, since it can use SIMD arithmetic instructions. As shown in Section 1.2.5, trying to implement those instructions in bitslicing (or hslicing) is suboptimal.
• bitslicing provides zero-cost permutations, unlike hslicing and vslicing, since it allows to perform any permutation at compile-time. Intra-register permutations can still be done at run time with a single instruction in hslicing using SIMD shuffle instructions.
• bitslicing requires much more parallel data to reach full throughput than hslicing and vslicing. On RECTANGLE, for instance, 8 independent inputs are required to maximize the throughput on SSE registers using mslicing, while 128 inputs are required when using bitslicing.
• both vslicing and hslicing rely on vector instructions, and are thus only available on CPUs with SIMD extensions, while bitslicing does not require any hardwarespecific instructions. On general-purpose registers, bitslicing is thus usually more efficient than mslicing.
Single Native Packed Addition (native single)
According to Intel’s manual, additions (SSE, AVX and GP) should execute in one cycle. There should be no overhead from the loop: looping requires incrementing a counter and doing a conditional jump. Those two instructions get macro-fused together and execute on port 0 or 6. The SSE and AVX additions can execute on either port 0, 1 or 5, and can thus be done at the same time as the loop increment/jump.
Experimentally, we achieve a latency of 1 cycle/iteration, regardless of the size of the addition (8-bit, 16-bit or 32-bit). On GP registers, this means a throughput of 1 addition per cycle. On SSE (resp., AVX) registers, however, a single packed padd (resp., vpadd) instruction performs multiple additions, thus reducing the cost per addition: 0.25 (resp., 0.13) cycles for 32-bit, 0.13 (resp., 0.06) cycles for 16-bit and 0.06 (resp., 0.03) cycles for 8-bit.
Parallel Native Packed Parallel Additions (native parallel)
Since SSE and AVX addition can execute on either port 0, 1 or 5, three of them could be done in a single cycle, provided that they do not suffer from data dependencies (i.e., none uses the output of another one). The increment and jump of the loop would fuse and be executed on port 6, independently of the additions.
Once again, this corresponds to the numbers we observe experimentally: 1 cycle/iteration regardless of the size of the addition. On GP, SSE and AVX registers, 3 additions are executed per cycle, which translates into three times the throughput of native single.
Bitsliced Addition (bitslice)
Recall that the n-bit ripple-carry adder contains n full-adders. In practice, 3 operations can be omitted from the first full-adder, since it always receives a 0 as input carry. Likewise, 3 more operations can be saved from the last full-adder, since its output carry isnever used and does not need to be computed. We did not have to implement those optimizations by hand since Clang is able to find them by itself. The total numbers of operations of this n-bit ripple-carry adder is therefore n 5 6. Those operations are solely bitwise instructions, which can be executed on ports 0, 1 and 5 on SSE and AVX registers (and on port 6 as well on GP registers). This adder is therefore bound to run in at least (n 5 6)=3 cycles on SSE and AVX registers, and (n 5 6)=4 cycles on GP registers.
In practice, this number can never be reached because a full-adder contains inner data dependencies that prevent it from executing at a rate of 3 instructions per cycle. However, when computing consecutive full-adders, a full-adder can start executing before the previous one is done, thus bypassing this dependency issue. The limiting factor then becomes the CPU scheduler, which cannot hold arbitrarily many ops, and limits the out-of-order execution. The larger the adder, the more full-adders (and thus ops) they contain, and the more the scheduler will be saturated.
Translation to PseudoC
We formalize an imperative language close to C as our target, which we shall call PseudoC. PseudoC is inspired by the Clight [82, 81] language, a subset of C, used as source language for the CompCert compiler. However, PseudoC is a stripped down variant of Clight: we do not include for loops, pointer arithmetic, structures or arrays. The semantics of operators in PseudoC also differs from Clight since we do not fix the underlying architecture.
Figure 3.8 describes the syntax of PseudoC. A PseudoC program is a list of function declarations. Functions are of type void (returned values are always passed back through pointers). Functions contain two kinds of typed variable declarations: parameters and local variables. Similarly to Clight, we prevent function calls from appearing within expressions. Expressions are thus free of side-effects (since no primitive performs any sideeffect). Primitives are annotated with an integer n representing the size of their argument (e.g., 128 for a 128-bit SSE vector, 256 for a 256-bit AVX vector), as well as an integer m containing the size of the atom they manipulate. For instance, the primitive +128 a vector addition performing 16 8-bit additions.
Table of contents :
2 Usuba, Informally
2.1 Data Layout
2.2 Syntax & Semantics, Informally
3 Usuba, Greek Edition
3.2 Type System
3.6 Translation to PseudoC
5 Performance Evaluation
5.1 Usuba vs. Reference Implementations
5.3 Polymorphic Cryptographic Implementations
6 Protecting Against Power-based Side-channel Attacks
7 Protecting Against Fault Attacks and Power-based Side-channels Attacks
7.1 Fault Attacks
8.1 Future Work
Appendix A Backend optimizations