Leakage Characterization and Pre-Processing
So far, we have seen that to emulate a sound attacker A in order to evaluate a target, one must get an estimation of the conditional p.m.f. Pr (Z | X). This can be done either manually by the attacker thanks to the preliminary knowledge of the target behavior, or thanks to measured leakages during a preliminary characterization phase. The problem of estimating the latter conditional p.m.f. Pr (Z | X = x) based on measured data x is singular for two reasons.
On the one hand, SCA traces depict physical quantities varying through the time. By definition, these can be seen as continuous functions of the time, so they cannot be perfectly measured through the acquisition. A high-sampling rate oscilloscope can nevertheless discretize this signal with high fidelity, but at the cost of high dimensionality in the traces, ranging from several hundreds to several millions of samples, depending on the target implementation.
This feature yields technical challenges concerning the use of some estimation algorithms. As an example, naively applying Gaussian Template Attacks (TA) on the raw traces of length D would require to estimate the O(D2) coefficients of the covariance matrices. Hence, such a leakage model poorly scales when increasing the dimensionality of the traces.
On the other hand, in our cryptographic contexts, the relevant informative leakage is empirically known to be only localized in few time samples, the so-called P.o.Is. By relevant, we mean that those P.o.Is statistically depend – independently or jointly – on a sensitive variable, as formally stated by the following assumption.
Randomizing the Primitive Code
In Subsection 3.7.1 we have presented a way to protect an implementation by randomizing the encoding of the sensitive data processed through the execution. Though the latter approach may be theoretically sound, it is practically limited due to the performance overhead incurred by the counter-measure.
An alternative approach consists in designing counter-measures especially sound against a specific type of attacker, not necessarily optimal but most likely to happen in reality. Therefore, the designed counter-measures, although not theoretically sound, can represent a particularly efficient alternative from a runtime and memory performance point-of-view. This is the idea behind the development of randomizing the operations processing the intermediate computations.
In a nutshell, without randomized code all the traces follow the same pattern and the sensitive intermediate computations are leaking at the same time coordinates, which makes the use of simple statistical tools particularly relevant for an attacker. Instead, some randomness in the execution of the elementary operations prevents the attacker from perfectly knowing the expected behavior of the traces, unless adopting specific strategies to mitigate the effects of randomness. In the following, we review how to implement randomization of the primitive code.
The implementation of a cryptographic primitive can be described at different levels, from the source code – if the target is a software device – to the hardware architecture. This implies that a developer is provided with a large spectrum of scopes on which randomization may be applied in the operations.
At the software or hardware levels, round-based cryptographic primitives – such as the AES – may be modified in order to incorporate dummy rounds. Inside those rounds, elementary operations may be augmented themselves with dummy operations which are not necessary to proceed the encryption/decryption. Likewise, for any function looping over independent elementary operations, the latter ones may be executed in a random order. At a thinner scope, the transcription of the source code into machine code may be randomized thanks to a code polymorphism approach. Additional scopes of randomization are available at a hardware level. The use of asynchronous architecture [Ren00] makes the device prone to a jitter effect, causing misalignment in the traces. Likewise, the use of dual-rail logic [SS06] enables the target device to smoothen the power consumption or theEMemanations through the execution time, so that it removes the influence of the sensitive intermediate computation. In the remaining of this section, we further describe some of those approaches which will be investigated in this thesis. The interested reader can also refer to Chapters 7 and 8 of the DPA book by Mangard et al. [MOP07].
The ASCAD Dataset
The ANSSI’s SCA Databases (ASCAD) dataset has been introduced in 2018 by Benadjila et al. [BPS+19] to provide the SCA community a benchmark to investigate and compare DLbased attacks. In particular, the aim is to assess to what extent deep learning is relevant to mount attacks against protected implementations.
The Target. The target is a protected software AES-128 implementation running over an ATMEGA-8515, which has an 8-bit AVR architecture. The software aims at protecting against first-order SCA, by using a Boolean secret-sharing scheme based on the table re-computation method (cf Algorithm 3), although the first two bytes of the AES state are not protected, for the sake of comparison. Typically, the targeted variable on this dataset is the third byte of the state, at the output of the Sbox in the first round, i.e., Z = Sbox (p k). The dataset provides 60, 000 traces, where Np = 50, 000 traces are used for profiling and Nv = 10, 000 for validation, i.e., for emulating attacks phases – see Subsection 3.2.3. For both datasets, the same fixed key has been used, while the plaintexts and the shares have been randomly drawn.19 Whereas the whole traces, focused on the first AES round are made of 100, 000 time samples, this thesis will focus on the chunk corresponding to the interval J45, 400; 46, 100K, i.e., D = 700. This window corresponds to the joint leakage of Z rout and rout. Three versions of this dataset are available: the first one provides the traces as is, whereas the second and third ones provide the same traces on which an artificial shift of maximum amplitude of respectively 50 and 100 points has been applied. The traces are publicly available at https://github.com/ANSSI-FR/ASCAD.
The Empirical Risk Minimization Paradigm
The definition of learnability states whether it is possible or not to find a learning algorithm. However, it does not give any clue about how to find such a learning algorithm. An intuitive and generic approach is to rephrase the problem by finding a model which, rather than minimizing the loss over the whole unknown joint distribution of (X, Z), would minimize the loss ` () over the samples from the profiling set only, a.k.a. the training loss denoted by LX,Z (). That is: LSp (F) , 1 |Sp| X|Sp| i=1 ` (F(xi), zi) .
This principle is known under the name of Empirical Risk Minimization (ERM), and covers many situations such as linear regression, or maximum likelihood estimation. It translates the learning problem into a functional optimization problem – i.e. finding the model F from H minimizing the training loss – that the attacker may directly address. Soundness of the Empirical Risk Minimization (ERM) Principle. The question arising when substituting Problem 2 with ERM, is whether the latter one is actually a learning algorithm, according to Definition 6. In other words, does one have the guarantee that the more profiling traces, the higher the performance metric of the obtained solution A(Sp)? And if so, what is the required size of the profiling set in order to get a satisfying performance, according to Definition 7?.
The Fundamental Theorem of Learning, that we present hereafter, aims at addressing this issue. It gives a necessary and sufficient condition on the hypothesis class H, for the ERM to be a learning algorithm. This condition relies on the so-called Vapnik-Chervonenkis (VC)-dimension of the considered hypothesis class H, a way to characterize its size. The formal definition of the VC-dimension is beyond the scope of this thesis, but will be briefly discussed after we introduce the following fundamental theorem.4
Table of contents :
1 Context, Objectives and Contributions
1.1 Frame of the Thesis
1.2 From the Attacks towards the Evaluation
1.3 Deep Learning based Attacks
1.4 Contributions of this Thesis
2.1 Notations and Conventions
2.2 Recalls in Probability and Statistics
2.3 Recalls in Discrete Mathematics
2.4 Recalls on AES
2.5 Recalls on Vectorial Calculus
3 Side-Channel Attacks
3.1 Definition of a Side-Channel Attack
3.2 Assessing an Attack
3.3 Conditions of an Optimal Attack
3.4 Profiled Attacks
3.5 Unprofiled Attacks
3.6 Leakage Characterization and Pre-Processing
3.8 Overview of the Used Datasets
4 Deep Learning for Side-Channel Analysis
4.1 The Statistical Learning Theory
4.2 The Neural Networks Class Hypothesis
4.3 Implementing the ERM with Neural Networks
4.4 An Overview of the Literature
5 Theoretical Aspects of Deep Learning Based Side-Channel Analysis
5.2 Model Training for Leakage Assessment
5.3 NLL Minimization is PI Maximization
5.4 Study on Simulated Data
5.5 Application on Experimental Data
6 DL-based SCA on Large-Scale Traces: A Case Study on a Polymorphic AES
6.2 Evaluation Methodology
6.3 Results .
7 Gradient Visualization for General Characterization in Profiling Attack
7.2 Study of an Optimal Model
7.3 Proposal for a Characterization Method
7.4 Experimental Verification
7.5 Results .
8 Conclusion & Perspectives
8.1 Summary of the Contributions
8.2 New Tracks of Research in Deep Learning (DL)-based Side-Chanel Analysis (SCA) .
A Noise Amplification of Secret-Sharing I
A.1 The Link between Noise Amplification and Convolution
A.2 A Fixed-Point-Like Proof
B List of acronyms
D List of symbols
List of Figures
List of Tables