Monte Carlo Neutron Simulations – Project topics materials

Get Complete Project Material File(s) Now! »

Sandy Bridge and Broadwell

Intel Sandy Bridge micro-architecture [28] with the 32 nm process was firstly released in early 2011. It implements many new hardware features, among which the AVX (Advanced Vector Extension) vector instructions introduced the renaissance of SIMD technique. Intel Broadwell [29] with 14 nm transistors and the more powerful AVX2 has a lot of improvements like larger out-of-order scheduler, reduced instruction latencies, larger L2 Translation Lookaside Buffer (TLB), and so on. It should be noted that in this thesis, we always compare one MIC with a dual-socket CPU because they share nearly the same price and power consumption.

Many Integrated Cores

The performance improvement of processors comes from the increasing number of computing units within the small chip area. Thanks to advanced semiconductor processes, more transistors can be built in one shared-memory system to do multiple things at once: from the view of programmers, this can be realized in two ways: different data or tasks execute in multiple individual computing units (multi-thread) or in long uniform instruction decoders (vectorization). In order to answer the need of parallelism and vector capabilities, Intel initially introduced wide vector units (512 bits) to an x86-based GPGPU chip codenamed Larrabee [30]. This project was then terminated in 2010 due to its poor early performance but techniques around are largely inherited by the latter high performance computing architecture – Many Integrated Cores. The initial MIC board, codenamed Knights Ferry was announced just after the termination of the Larrabee. Inheriting the ring structure of the initial design, Knights Ferry has 32 in-order cores built on a 45 nm process with four hardware threads per core [11]. The card supports only single precision floating point instructions and can achieve 750 GFLOPS. Like GPGPUs, it works as an accelerator and connected to the host via PCI (Peripheral Component Interconnect) bus.

Effects of Temperature on Cross-sections

The cross-sections given in the Nuclear Data Libraries are valid for a neutron colliding with an isotope at rest; they are thus called cross-sections at 0 Kelvin. The thermal motion of atoms can influence the interaction between target nuclei and free neutrons. If the neutron is fast enough, with a kinetic energy above 1 MeV, we can neglect the thermal motion of the nuclides and use the 0K cross-sections. In order to accurate cross-sections on considering temperature effects, one must be aware of the distribution of target velocities. The thermal velocity of the target system follows the Maxwell-Boltzmann distribution [77], and can be expressed as a function of absolute temperature (T) and target mass (M): of cross-section resonances by broadening the resonance width and pulling down the resonance peak. This widening of resonances is referred to as Doppler broadening [78]. Such effect caused by the wider target motion distribution results in a significant increase of absorption probability in the resonance region as neutrons scatter down to thermal energies. It has an advanced meaning for reactor safety since it establishes a negative feedback mechanism for temperature increase, and therefore prevents a meltdown of the system. In order to conserve the real reaction rate, the effective probability of collision should be obtained by considering the material temperature. This process is described by:

READ Comparison of DHTs architectures

Table of contents :

R´esum´e
Acknowledgements
Contents
List of Figures
List of Tables
1 Introduction
1.1 Monte Carlo Neutronics
1.2 HPC and Hardware Evolution
1.3 Motivations and Goals
1.4 Outline
2 Modern Parallel Computing
2.1 Computing Architectures
2.1.1 CPU Architectures
2.1.1.1 Memory Access
2.1.1.2 Hardware Parallelism
2.1.1.3 Sandy Bridge and Broadwell
2.1.2 Many Integrated Cores
2.1.2.1 Knights Corner
2.1.2.2 Knights Landing
2.1.3 GPGPU
2.1.3.1 Tesla P100
2.2 Programming Models
2.2.1 OpenMP
2.2.2 Threading Building Blocks
2.2.3 OpenACC
2.3 Vectorization
2.3.1 Methodology
2.3.1.1 Intrinsics
2.3.1.2 Directives
2.3.1.3 Libraries
2.3.1.4 Auto-Vectorization
2.4 Useful Tools for Performance Analysis
2.4.1 TAU
2.4.2 Intel VTune
2.4.3 Intel Optimization Report
2.4.4 Intel Advisor
3 Monte Carlo Neutron Simulations
3.1 Nuclear Reactors
3.2 Nuclear Reactions
3.2.1 Cross-Section
3.3 Effects of Temperature on Cross-sections
3.4 Neutron Transport
3.4.1 Neutron Transport Equation
3.4.2 Monte Carlo Simulation
3.5 Simulation Codes
3.5.1 MCNP
3.5.2 TRIPOLI
3.5.3 PATMOS
3.6 HPC and Monte Carlo Transports
3.6.1 History-Based and Event-Based
3.6.2 Accelerators
3.6.3 Cross-Section Computations
3.6.4 Shared-Memory Model
4 Energy Lookup Algorithms
4.1 Working Environments
4.1.1 Machines
4.1.2 PointKernel Benchmark
4.2 Porting and Profiling
4.2.1 Adaptations to KNC
4.2.2 Code Profiling
4.2.2.1 Profiling on Intel Sandy Bridge
4.2.2.2 Profiling on KNC
4.3 Binary Search and Alternative Search Methods
4.3.1 Manual Binary Search
4.3.2 Vectorized N-ary Search
4.3.3 Vectorized Linear Search
4.3.3.1 Data Alignment for C++ Member Variables
4.3.4 Comparison of Different Search Algorithms
4.4 Unionized Energy Grid
4.4.1 Optimizations for the Unionized Method
4.4.1.1 Initialization
4.4.1.2 Data Structure
4.5 Fractional Cascading
4.5.1 Reordered Fractional Cascading
4.6 Hash Map
4.6.1 Isotope Hashing
4.6.2 Material Hashing
4.6.3 Efficient Hashing Strategies
4.6.3.1 Hashing Size
4.6.3.2 Logarithmic vs. Linear Hashing
4.6.3.3 Search Efficiency within Hashing Bins
4.7 N-ary Map
4.8 Full Simulation Results
4.8.1 Performance
4.8.2 Memory Optimization
4.8.3 Scalability
4.8.4 Results on Latest Architectures
5 Cross-section Reconstruction
5.1 Working Environments
5.1.1 Machines
5.1.2 PointKernel Benchmark
5.2 Algorithm
5.2.1 Resolved Resonance Region Formula
5.2.1.1 Single-Level Breit-Wigner
5.2.1.2 Multi-Level Breit-Wigner
5.2.1.3 Doppler Broadening
5.2.2 Faddeeva Function
5.3 Implementations and Optimizations
5.3.1 Faddeeva Implementations
5.3.1.1 ACM680 W
5.3.1.2 MIT W
5.3.2 Scalar Tuning
5.3.2.1 Algorithm Simplification
5.3.2.2 Code Reorganization
5.3.2.3 Strength Reduction
5.3.2.4 STL Functions
5.3.3 Vectorization
5.3.3.1 Collapse
5.3.3.2 No Algorithm Branch
5.3.3.3 Loop Splitting
5.3.3.4 Declare SIMD Directives
5.3.3.5 Float
5.3.3.6 SoA
5.3.3.7 Data Alignment and Data Padding
5.3.4 Threading
5.4 Tests and Results
5.4.1 Unit Test of Faddeeva Functions
5.4.1.1 Preliminary Numerical Evaluation
5.4.2 Reconstruction in PATMOS
5.4.2.1 Unit Test of Cross Section Calculation
5.4.2.2 Performance in PointKernel
5.4.2.3 Memory Requirement
5.4.2.4 Roofline Analysis
5.4.3 Energy Lookups vs. On-the-fly Reconstruction
6 Conclusion and Perspective
6.1 Conclusion
6.2 Future Work
Bibliography