Cone Beam Computed Tomography
CBCT is a modern medical imaging method. During a CBCT scan, the scanner revolves around the patient’s head taking a large set of X-ray images. Using these images and computational methods, a 3D volume of the x-ray attenuation coeﬃ-cients are obtained. CBCT is most popular in implant dentistry and radiotherapy, but has proved useful for other applications. Two of the appealing characteristics are the fast scan time and relatively small scanner size. The design of the CBCT used is shown in figure 1.2, while a schematic view showing the relevant parts is shown in Figure 1.3.
X-rays are used because they have an attenuation length of 2-6 cm in biological matter, which is suitable for clinical applications. The X-ray photons are generated in an X-ray tube (source), and are sent along a path towards the detector. On their way, they may interact with matter by being scattered or absorbed. Some materials, such as bone, have a higher chance of interacting with the photons, and thus fewer photons will be transmitted through them. The detector detects the transmitted (primary) and scattered photon and produces an image.
The most common algorithm used to reconstruct the 3D volume is the Filtered Back-Projection algorithm, developed by Feldkamp in 1984. The algorithm works by projecting the filtered signal backwards from the detector through the volume to estimate the attenuation at each point in the volume. A full description of the algorithm is available in .
Scatter Artefacts in CBCT Images
If the detector only detected the primary photons, the image would be an accurate representation of how much radiation is attenuated along diﬀerent lines through the phantom. However, due to the scattered photons, this image loses contrast. The problems caused by this are illustrated in Figure 1.4.
The main goal of the work in this thesis was to write a code and related methods that can be used to remove the scattered photons from the X-ray images. The chosen method is the method developed by Wiegert  around 2006. In this method, a first reconstruction of the volume is performed. Using this volume, a large number of photons are simulated using the Monte Carlo (MC) method, and the resulting scatter contribution is calculated. This calculated scatter is then subtracted from the measured images.
The MC method is a method were many samples, in this case individual photons, are taken from a distribution. The sought after quantities, such as the distribution of the scattered photons on the detector, can then be calculated by averaging. This method is regarded as accurate, but it converges relatively slowly.
In Wiegerts work, the simulation of the scatter took 20 days for one scan. Since then, significant technological improvements have been made, and the goal of this thesis is to find a method to perform this simulation in less than five minutes, thus allowing scatter correction for clinical applications.
Scientific Computation on GPUs
General-Purpose computing on Graphics Processing Units (GPGPU) is a relatively new phenomenon oﬀering developers very high computational power at moderate costs. The idea is to use the hardware developed for commercial graphics applica-tions, such as games, and use it for scientific computations.
Computer graphics has a parallel structure with each pixel calculated being largely independent from the other pixels. This allows a very large number of processing cores to work in parallel. To facilitate more processors on the chip, Graphics Pro-cessing Unit (GPU)s sacrifice the ability for each thread to work independently. Instead GPUs use a Single Instruction, Multiple Data (SIMD) architecture. In this architecture, multiple processor performs the same set of instructions, but on diﬀerent data.
This very parallel processing structure has applications in scientific computing, since many problems are highly parallel. The computational problem in this thesis, MC photon transport, is an example of such a problem.
There are two main frameworks used for GPU computing today, Open Computing Language (OpenCL) and Compute Unified Device Architecture (CUDA). OpenCL is an open source framework, while CUDA is proprietary and owned by Nvidia, the largest supplier of GPUs today. In this work, we have used CUDA, primarily because it is already used in other company products.
Layout of thesis
In the following chapter, we will delve deeper into the problems related to scatter artifacts in CBCT and present a solution.
In chapter 2, we will discuss the tools used, starting with introducing the CUDA GPU framework. The physics of interest to the problem will also be introduced, and we will discuss what parts are the most relevant to account for. Finally, we will investigate the methods used by other authors to solve these problems and methods they have used to make the program faster without losing accuracy.
In chapter 3 we will look at the implementation in detail. We will begin with discussing how to generate and store the geometry and how to assign properties to the various materials present. Then, we will investigate the implementation of the physical model, with emphasis on the modelling of photon interactions. At the end, some coding details and tricks will be discussed.
In chapter 4 the results of the work will be presented. We first compare the code with another well known code called PENELOPE, then we test the code by simulating a real experiment. Finally, we test the eﬀect of removing the calculated scatter from real images and see how the reconstruction is improved.
In chapter 5 we discuss the results by and problemize what may have gone worse than expected.
In the final chapter, we conclude the findings of the thesis with a summary, and discuss interesting directions of further research in this area.
In this chapter the background of the thesis will be presented. First, a brief in-troduction to the methods used will be given, with a brief overview of scientific computing on GPUs using Nvidia’s CUDA tools. Some of the specific issues related to this approach will also be discussed. The MC method will also be presented and its rate of convergence discussed. The problem of photon transport in matter will also be presented.
Finally, the MC method for photon transport in matter will be presented in depth, with discussions on other relevant work in the area. Many of the more complex sub-problems will also be discussed. Various methods to speed up the method such as variance reduction methods, pre-calculations and filtering methods will be introduced.
The CUDA computational model is a multi-level model. Each processing thread belongs to a warp, which is a set of threads sharing an instruction queue. A set of warps forms a block, and the blocks are laid out in a grid. To enumerate the threads, each thread has a three dimensional blockId, and a three dimensional threadId within the block.
The main diﬀerence between CUDA programming and Central Processing Unit (CPU) programming is that outer loops are replaced by calls to device (GPU) side functions. For example, if we want to perform some work on a 2D array of data on a CPU, the code would look like Algorithm 1.
Since the CUDA framework uses a SIMD architecture, the code is very sensitive to branching. Branching is any occasion of if or switch or other statements where the code may take diﬀerent paths. If a single thread in a warp needs to take a diﬀerent path from the other threads, all threads have to wait for that one thread to finish before they can continue.
The memory structure on an CUDA GPU is also optimized for parallel execution. This is in contrast to CPU memory which is optimized for serial code. Like a CPU, the GPU has a large global memory. This memory is often in the order of one to several GB. The global memory has three main subdivisions: the general memory (usually denoted global memory), where any kind of data is stored; the texture memory, which is optimized for typical texture access patterns; and the constant memory, which is constant in time and can therefore be heavily cache optimized. Access to the global memory is optimized for parallel access, and to achieve optimal performance, continuous blocks of 64 bytes should be read.
Each block of threads also shares an on chip memory. This memory is usually 64 kB and has two subdivisions, L1 cache which is used to cache global memory accesses, and a general shared memory for general storage. The shared memory is significantly faster to access than global memory. Finally, each thread has its own registers, access to the registers is extremely fast.
This memory structure is illustrated in Figure 2.1, which shows the main compo-nents of the CUDA architecture. Utilizing these diﬀerent memory levels optimally is key to a successful GPU algorithm.
Table of contents :
1.1 Leksell Gamma Knife
1.2 Cone Beam Computed Tomography
1.2.1 Scatter Artefacts in CBCT Images
1.3 Scientific Computation on GPUs
1.4 Layout of thesis
2.2 Monte Carlo Method
2.3 Photon Transport in Matter
2.4 The Monte Carlo Method for Photon Transport
2.4.1 Related Work
2.4.2 Variance Reduction Techniques
2.4.3 Pre- and Post-Processsing
2.4.4 Random Number Generation
3.1 CBCT Volume Reconstruction with Scatter Reduction
3.2.1 Material Model
3.3 Simulating Photons
3.3.1 Generating Photons
3.3.2 Advance Photon
3.3.3 Score Photon
3.3.4 Simulating Interactions
3.3.5 Energy Cut-off
3.4 Variance Reduction
3.4.2 Russian Roulette
3.4.3 Forced Detection
3.5 Filtering Methods
3.6 Scatter Removal
3.7 Code Optimizations and Details
3.7.1 Random Number Generation
3.7.2 Memory Use and Accesses
3.7.3 Built-in Function Calls
3.7.4 Thread Coherence
3.7.5 Numerical Precision
4.1.1 Variance Reduction Methods
4.2 Physical Accuracy
4.2.1 PENELOPE comparison
4.2.2 Head phantom
4.3 Effect on reconstruction
5.1.1 Variance Reduction
6.1 Further work
6.1.1 Sources of error
6.1.2 Possible Speed-ups
A Rejection Sampling
B Estimation of Memory Limitations