Multi-core processors are manufactured on the same silicon integrated circuit, called a die. For CPUs in the same package but on dierent dies, the dedicated term is multi-chip modules. A multi-core processor can be classied as many-core when the core count is particularly high (tens to thousands). In a multi-CPU system, multiple physically distinct CPU packages are communicating with each other to form a single system with even more cores. All of these techniques have as their common goal to increase Thread Level
Parallelism (TLP). Threads are programs that can be executed and managed independently by the operating system scheduler. Threads can be executed concurrently, taking advantage of the multiple cores of the CPU, and can share the same resources, such as peripherals or memory. The sharing of memory allows communication between threads and enables cooperation between them. Multiple threads are grouped in processes, which represent a single application. Processes do not share resources like threads, but can still communicate in other ways, such as using network protocols.
Cache Level Hierarchy
While the shrinking of transistors has allowed CPU performance to improve over the years, memory access speed has not progressed at the same pace. This growing gap between processor and memory performance, shown in gure 2.4, is becoming a bottleneck.
Thankfully, while it is hard to manufacture large amounts of fast memory, smaller capacity memory can be made in similar processes as CPU logic and embedded directly on the CPU die. This embedded, faster memory is used as a cache for the main memory and comes in multiple levels. The Level 1 (L1) cache is usually divided into two functional parts: the instruction cache and the data cache. On modern hardware, each core usually has its own L1 cache that can hold 64 KB. The next levels, L2, L3 and sometimes L4, are increasingly bigger but slower, and are usually shared between multiple cores. Each cache level contains all the data of the lower levels. When the CPU tries to access data not yet in the cache, a cache miss occurs and causes the data to be fetched from the higher cache levels or from the main memory. Caches can implement dierent policies depending on the architectural choices. Cache policies are based on the principle of locality. Two types of locality are exploited: temporal and spatial localities. Following temporal locality, new data replaces the oldest ones in the cache, which are less likely to be used again. To also make use of spatial locality, data is fetched as a cache line, usually 64 bytes wide, which is the unit of data transfer between the cache levels and main memory. Figure 2.5 shows the pyramidal organisation of the cache level hierarchy.
Single Instruction Multiple Data
The Single Instruction Multiple Data (SIMD) paradigm of modern CPUs allow them to increase ILP by having special instructions, which take the same number of clock cycles as scalar instructions but operate on multiple elements at a time. Each architecture family has its own instruction set that operates on special xed size SIMD registers, usually 128- or 256-bits wide. The size of SIMD registers tends to increase in modern Instruction Set Architectures (ISA), as of 2021, some ISA support up to 2048 bits registers and hardware implementations exists for up to 512 bits. A typical SIMD ISA can be subdivided into three categories: memory access (loading or storing data from and to memory), arithmetic and logic operations (registers-toregisters mathematical or logical operations) and permute operations (changing the organisation of elements within a single vector register or mixing elements of multiple registers). As the ISA is dependent on the family of CPU used, the developer must be aware of architectural details when using SIMD and often change the structure of the algorithm to utilize it eciently [61, 101]. There exist many solutions to use SIMD instructions:
• Assembly language: the most straightforward way of using SIMD instructions is to write them directly in assembly, either in an assembly program or using inline assembly from C/C++. The downside is that it creates hard to read code and sacrices portability across architectures.
• C/C++ compiler intrinsics: compilers expose higher-level functions that map directly to the corresponding assembly instruction. But it is still hard to read, not portable, and very verbose.
• SIMD libraries: wrap the intrinsic to give them a standard name, abstracting architecture to make the program portable. In C++ wrapper libraries are able to use operator overloads and template meta-programming to improve readability.
• Domain Specic Languages (DSL) for parallel programming: usually require a dierent compiler or some extensions. Not only limited to SIMD, DSLs can also be used for other source code optimisation [59, 89, 96] or to target specialized hardware [17, 64]. DSLs are easier to use but lose ne-grained control; it is therefore not always possible to reach the best performance using this solution.
• Automatic Vectorization by the compiler provides free performance gains, but not all patterns are eligible for auto-vectorization. It is usually hard to get deterministic results.
SIMD speedup and frequency scaling
As clock frequencies of modern processors are expected to stay near their current levels, or even to reduce, the primary method to improve the computation power of a chip is to increase either the number of processing units (cores) or the intrinsic parallelism of a core (SIMD). The speedup that can be achieved for a particular application depends on the amount of code that can be vectorized. Amdahl’s law  gives a theoretical bound for the speedup: speedup(c) = 1 1 + .
Comparison with other SIMD libraries
This section reviews some of the other SIMD libraries and explains the features unique to SIMDWrappers. The libraries listed are, to my knowledge, the main libraries still maintained or in active development. Only open-source libraries are discussed here.
SIMD libraries can be classied according to the instruction sets they support and which data type they expose. Table 2.3 summarize the backends and data types of each library. SIMDWrappers is the only library that provides distinct, interchangeable backends for AVX-512, allowing the systematic study of the impact of AVX-induced frequency scaling on Intel architectures. The 512-bit SVE backend was developed and tested on the Fujitsu A64FX. The choice of restricting the vector types to 32-bit integers and oating-points was made to simplify the user interface. Having every element be the same size allows SIMD code similar to scalar code to be written, because every vector register contains the same number of elements. This choice may restrict the algorithms one can implement with the library but provides a simple interface with the SOACollection library. Some libraries like Boost.SIMD or simdpp use C++ expression templates. This technique forces the compiler to rewrite arithmetic expressions into dedicated SIMD instructions. A typical example is the Fused Multiply and Accumulate (FMA) instructions that could compute expressions such as ab+c in one cycle. The drawback of this technique is the complexity it adds to the library and the large cryptic errors the compiler can produce. Fortunately, modern compilers already reliably rewrite expressions into FMA instructions. For this reason, SIMDWrappers don’t use expression templates and instead rely on the compiler to infer FMA from addition and multiplication SIMD instructions.
Table of contents :
1 The LHCb experiment
1.2 The Large Hadron Collider
1.3 The LHCb detector
1.3.1 Vertex Locator
1.3.2 Upstream Tracker
1.3.3 Scintillating Fibre Tracker
1.3.4 Ring Imaging Cherenkov Detectors
1.3.6 Muon stations
1.4 The High Level Trigger
1.4.1 Event Building
1.4.4 LHCb Software Framework
2 Parallelism on CPU
2.2 CPU Architecture
2.2.1 Multi-core architectures
2.2.2 Cache Level Hierarchy
2.2.3 Data Layout
2.3 Single Instruction Multiple Data
2.3.1 Instruction sets
2.3.2 SIMD speedup and frequency scaling
2.4 The SIMDWrappers library
2.4.1 Design objectives
2.4.2 Comparison with other SIMD libraries
2.4.3 Instruction emulation
3 Parallelism on GPU
3.2 From arcade video games to HPC
3.3 CUDA programming model
3.4 Grid-stride loops
3.5 Shared memory optimisations
3.6 Warp-level programming
4 Connected Component Analysis
4.2 Connected Component Labeling and Analysis
4.2.1 One component at a time
4.2.2 Multi-pass iterative algorithms
4.2.3 Direct two-pass algorithms
4.2.4 Mask topology: blocks and segments
4.3 HA4: Hybrid pixel/segment CCL for GPU
4.3.1 Strip labeling
4.3.2 Border Merging
4.3.3 CCL – Final labeling
4.3.4 CCA and Feature Computation
4.3.5 Processing two pixels per thread
4.3.6 Experimental Evaluation
4.4 FLSL: Faster LSL for GPU
4.4.1 Full segments (FLSL)
4.4.2 On-The-Fly feature merge (OTF)
4.4.3 Conict detection (CD)
4.4.4 Number of updates and conicts
4.4.5 Experimental Evaluation
4.5 SIMD Rosenfeld
4.5.1 SIMD Union-Find
4.5.2 SIMD Rosenfeld pixel algorithm
4.5.3 SIMD Rosenfeld sub-segment algorithm
4.5.4 Multi-thread SIMD algorithms
4.5.5 Experimental Evaluation
4.6.1 General parameterizable ordered SparseCCL
4.6.2 Acceleration structure for un-ordered pixels
4.6.3 Case study: specialization for LHCb VELO Upgrade
4.6.4 Experimental Evaluation
5 VELO reconstruction algorithm
5.2 Tracking algorithms
5.3 Evolution of the VELO detector and algorithms
5.3.1 Reconstruction in the Run 1 and 2 VELO detector
5.3.2 Reconstruction of the upgraded VELO detector
5.4 SIMD Velo reconstruction
5.4.1 Structure of the algorithm
5.4.2 Seeding tracks
5.4.3 Extending tracks
5.4.4 Numerical precision
5.5.2 Reconstruction physics eciency
6 Scalability of the LHCb software
6.2 Evaluation of HLT1 on CPUs
6.3 Evaluation of HLT2 on CPUs