Multi-thread SIMD algorithms

Get Complete Project Material File(s) Now! »

Upstream Tracker

The Upstream Tracker (UT) is the second tracking detector. Its goal is to get an estimate of the charged particle’s momentum before it goes through the magnet. To do so, it is placed inside the fringe of the magnetic eld where it is weak enough that the particle doesn’t deviate too much from a straight line estimate, but strong enough to have a measurable curvature. Inside this detector, the track model used is the 5 parameters LHCb track model (x0, y0, Tx, Ty, q=p), with q=p being the ratio between the charge of the particle and its momentum, also called curvature. The resolution of the momentum measured by the UT is only about 15% but this measurement is essential to reduce the search time in later detectors and limit the amount of false matching of VELO tracks and SciFi hits.
The UT consists of four planes of silicon micro-strip sensors of various sizes and granularities, as shown in Figure 1.7. In the outer region (green), sensors are 10 cm long with a pitch of 190 m between vertical strips. This design ensures a good resolution in the x direction, crucial for the momentum measurement, while allowing the sensor to cover larger distances in y. The middle and inner-most regions (yel-low and red) use a 95 m pitch to cope with the increased occupancy closer to the beam line. Sensors directly surrounding the beam pipe are also divided vertically, to increase the y resolution by a factor of two.

Ring Imaging Cherenkov Detectors

Ring Imaging Cherenkov (RICH) Detectors are a class of detectors that exploit the emitted Cherenkov light of a particle traversing a material at a speed higher than the speed of light in that material. When a particle traverses such material, photons are emitted in a cone around the particle’s trajectory with an opening angle c depending on the mass m and momentum p of the particle and the refractive index of the material n: cos( c) = m (1.2).
As the momentum is measured with high precision by the tracking detectors and the refractive index of the material is known, the mass of the particle can be de-duced by measuring the opening angle. The choice of the radiator material is crucial to di erentiating between particle types of interest within the desired momentum range.
The LHCb detector contains two RICH detectors. RICH1, placed between the Velo and the UT, uses the uorocarbon gas C4F10 as its radiator material in order to maximize the pion-Kaon ( -K) separation in the 10-40 GeV momentum range. RICH2, placed after the SciFi, employs the uorocarbon gas CF4 as its radiator, providing -K separation up to 100 GeV momenta.

The High Level Trigger

In the previous sections we have seen how events recorded at a rate of 30 MHz by the LHC are observed by the LHCb detector. On average, about 100 KB of data are produced for each event, resulting in a raw data production rate of 3 TB/s. Saving every event’s data to long term storage would be impossible: even if the storage media could sustain such a writing speed, it would saturate our resources very quickly. The solution is to lter the events as they come and only save events containing valuable data according to the physics program. This decision is made by the trigger system.

LHCb Software Framework

LHCb uses many di erent software packages for its simulation, high level trigger or o ine analysis. All of these are built upon the same underlying framework called GAUDI, originally developed by the LHCb collaboration, and now used and maintained by the LHCb and ATLAS collaborations. The modern version of GAUDI is a scheduler capable of executing a functional data ow algorithm, dispatched on many parallel execution units. The core of the framework is written in C++ but the algorithms can be con gured and composed using a high-level API in Python. This con gurability is essential to allow the creation of new physics scenarios without the need of software experts. The framework has been in constant development and evolution for 20 years, adapting to the important changes in both hardware and software. Some of the major changes of the hardware landscape will be presented in the next chapter. The following list presents the main software and libraries used by LHCb:
• Gauss is a package used to simulate how particles produced by proton-proton collisions interact with the detector. The simulation employs the technique of Monte-Carlo (MC) simulation, and the produced data is often referred to as MC data. It uses third party event generator libraries like PYTHIA [154] or EVTGEN [106] and the GEANT4 [14] library for handling particle propagation in the detector.
• Boole is the software that simulates the e ect of sensors and front-end elec-tronics. It processes the output of Gauss to turn it into raw data similar to what is expected from the real detector.
• LHCb is a package containing the de nitions for LHCb object types and low level libraries.
• Rec is a library of algorithms to perform event reconstruction, including track-ing and particle identi cation used for both the HLT and o ine analysis.
• Allen is the HLT1 GPU implementation which will be used in Run 3. It is written in CUDA and uses the same con guration framework as the CPU HLT1 and HLT2 implementations.
• Moore is the con guration framework for the online HLT. It is a collection of Python scripts that helps to de ne data- ows and selection lines.
The LHCb software framework is open-source and distributed under the GPLv3 license at As part of this thesis, my main contributions to the LHCb software framework were the development of new optimized recon-struction algorithms in Rec, as well as some improvements to the Event Model and low level libraries in LHCb. This work will be presented in the next chapters.

CPU Architecture

The evolution of CPU performance over the last decades can be described by Moore’s law and the two equations 2.1 and 2.2 [51]. Moore’s law is an empirical law stating that the number of transistors in an integrated circuit doubles every N months. The rst equation is the time taken by a CPU to execute a program, a direct measure of its performance on a given task: TCPU = IC (2.1). where IC is the instruction count of the program, IPC is the instruction per CPU clock cycle, and F is the clock frequency.
The second important equation is the power dissipation of CMOS circuits, ex-pressed as the sum of the static power dissipation, the power dissipated when the circuit is idle, and dynamic power dissipation depending on the circuit’s activity: PCP U = Pstatic + X Ci Vdd2 F (2.2).
where F is the clock frequency, Vdd is the power supply voltage, Ci is the sum of transistor gate and interconnection capacitance, and is the average percentage of P switching capacitance in the circuit.

READ  Extraction of the propagating medium through the use of the transfer matrices

Single Instruction Multiple Data

The Single Instruction Multiple Data (SIMD) paradigm of modern CPUs allow them to increase ILP by having special instructions, which take the same number of clock cycles as scalar instructions but operate on multiple elements at a time. Each architecture family has its own instruction set that operates on special xed size SIMD registers, usually 128- or 256-bits wide. The size of SIMD registers tends to increase in modern Instruction Set Architectures (ISA), as of 2021, some ISA support  up to 2048 bits registers and hardware implementations exists for up to 512 bits. A  typical SIMD ISA can be subdivided into three categories: memory access (loading or storing data from and to memory), arithmetic and logic operations (registers-toregisters mathematical or logical operations) and permute operations (changing the organisation of elements within a single vector register or mixing elements of multiple registers). As the ISA is dependent on the family of CPU used, the developer must be aware of architectural details when using SIMD and often change the structure of the algorithm to utilize it eciently [61, 101]. There exist many solutions to use SIMD instructions:
• Assembly language: the most straightforward way of using SIMD instructions is to write them directly in assembly, either in an assembly program or using inline assembly from C/C++. The downside is that it creates hard to read code and sacrices portability across architectures.
• C/C++ compiler intrinsics: compilers expose higher-level functions that map directly to the corresponding assembly instruction. But it is still hard to read, not portable, and very verbose.
• SIMD libraries: wrap the intrinsic to give them a standard name, abstracting architecture to make the program portable. In C++ wrapper libraries are able to use operator overloads and template meta-programming to improve readability.
• Domain Specic Languages (DSL) for parallel programming: usually require a dierent compiler or some extensions. Not only limited to SIMD, DSLs can also be used for other source code optimisation [59, 89, 96] or to target specialized hardware [17, 64]. DSLs are easier to use but lose ne-grained control; it is therefore not always possible to reach the best performance using this solution.
• Automatic Vectorization by the compiler provides free performance gains, but not all patterns are eligible for auto-vectorization. It is usually hard to get deterministic results.

SIMD speedup and frequency scaling

As clock frequencies of modern processors are expected to stay near their current levels, or even to reduce, the primary method to improve the computation power of a chip is to increase either the number of processing units (cores) or the intrinsic parallelism of a core (SIMD). The speedup that can be achieved for a particular application depends on the amount of code that can be vectorized. Amdahl’s law [16] gives a theoretical bound for the speedup: speedup(c) = 1 1 􀀀 + c where c is the SIMD cardinality, and is the fraction of vectorized code.

Table of contents :

1 The LHCb experiment 
1.1 Introduction
1.2 The Large Hadron Collider
1.3 The LHCb detector
1.3.1 Vertex Locator
1.3.2 Upstream Tracker
1.3.3 Scintillating Fibre Tracker
1.3.4 Ring Imaging Cherenkov Detectors
1.3.5 Calorimeters
1.3.6 Muon stations
1.4 The High Level Trigger
1.4.1 Event Building
1.4.2 HLT1
1.4.3 HLT2
1.4.4 LHCb Software Framework
1.5 Conclusion
2 Parallelism on CPU 
2.1 Introduction
2.2 CPU Architecture
2.2.1 Multi-core architectures
2.2.2 Cache Level Hierarchy
2.2.3 Data Layout
2.3 Single Instruction Multiple Data
2.3.1 Instruction sets
2.3.2 SIMD speedup and frequency scaling
2.4 The SIMDWrappers library
2.4.1 Design objectives
2.4.2 Comparison with other SIMD libraries
2.4.3 Instruction emulation
2.5 Conclusion
3 Parallelism on GPU 
3.1 Introduction
3.2 From arcade video games to HPC
3.3 CUDA programming model
3.4 Grid-stride loops
3.5 Shared memory optimisations
3.6 Warp-level programming
3.7 Conclusion
4 Connected Component Analysis 
4.1 Introduction
4.2 Connected Component Labeling and Analysis
4.2.1 One component at a time
4.2.2 Multi-pass iterative algorithms
4.2.3 Direct two-pass algorithms
4.2.4 Mask topology: blocks and segments
4.3 HA4: Hybrid pixel/segment CCL for GPU
4.3.1 Strip labeling
4.3.2 Border Merging
4.3.3 CCL – Final labeling
4.3.4 CCA and Feature Computation
4.3.5 Processing two pixels per thread
4.3.6 Experimental Evaluation
4.4 FLSL: Faster LSL for GPU
4.4.1 Full segments (FLSL)
4.4.2 On-The-Fly feature merge (OTF)
4.4.3 Conict detection (CD)
4.4.4 Number of updates and conicts
4.4.5 Experimental Evaluation
4.5 SIMD Rosenfeld
4.5.1 SIMD Union-Find
4.5.2 SIMD Rosenfeld pixel algorithm
4.5.3 SIMD Rosenfeld sub-segment algorithm
4.5.4 Multi-thread SIMD algorithms
4.5.5 Experimental Evaluation
4.6 SparseCCL
4.6.1 General parameterizable ordered SparseCCL
4.6.2 Acceleration structure for un-ordered pixels
4.6.3 Case study: specialization for LHCb VELO Upgrade
4.6.4 Experimental Evaluation
4.7 Conclusion
5 VELO reconstruction algorithm 
5.1 Introduction
5.2 Tracking algorithms
5.3 Evolution of the VELO detector and algorithms
5.3.1 Reconstruction in the Run 1 and 2 VELO detector
5.3.2 Reconstruction of the upgraded VELO detector
5.4 SIMD Velo reconstruction
5.4.1 Structure of the algorithm
5.4.2 Seeding tracks
5.4.3 Extending tracks
5.4.4 Numerical precision
5.5 Benchmarks
5.5.1 Throughput
5.5.2 Reconstruction physics eciency
5.6 Conclusion
6 Scalability of the LHCb software 
6.1 Introduction
6.2 Evaluation of HLT1 on CPUs
6.3 Evaluation of HLT2 on CPUs
6.4 Conclusion


Related Posts