Capturing User Input

Get Complete Project Material File(s) Now! »

Capturing User Input

The proposed algorithm works in a semi-automated one- or k-shot fashion, meaning that the segmentation of a volume is initialised by a manual segmentation of a slice that shows the network the object of interest. From the initialisation point the network performs automated segmentation.In between the results are inspected manually to find slices that are incorrect. These are then corrected by hand and the algorithm is reinitialised with the corrected slice as a starting point, attempting to propagate the manual correction forward through the coming slices. Since this interaction is not part of the actual network inference pass, making the network utilise the user provided information is not trivial. The approach used is to force the network to utilise the provided initial slice by heavily augmenting the image slices in terms of shape, rotation and intensity, further described in Section 2.3. Rather than being able to segment the image directly the network should learn to rely on the concatenated prior.

Data Pre-Processing

The data used comes from the Visceral3 [9] database, where 20 whole body CT scans of patients were chosen with dense labels for liver, spleen, left kidney, left lung and pancreas. The intensity values of the CT scans was normalised by clipping the intensity range to 1024 and then divided by 500 (empirically chosen). As a way to provide natural non-linear deformation the “previous masks” were sampled randomly at a distance of 2 from the current slice, which is far away enough to provide the network with non-linear deformation compared to the current slice but close enough to avoid overlap with nearby structures that otherwise might be labelled incorrectly.

Data Augmentation

The online data augmentation applied during training is unusually heavy, consisting of large amounts of affine scaling 20% and rotation 180 as well as flipping left/right and up/down. To improve the robustness to intensity variation, random contrast and brightness augmentation is applied as well. The high amount of image augmentation is to reduce the networks’ dependence on image intensity and to prevent the network from making decisions that divert too much from the given prior. Instead, the network is pushed towards relying on the information provided in the additional image channel containing the prediction of the previous slice. This way the network is able to include and propagate changes provided by a user without being explicitly trained with a “user in the loop” providing changes to the data online.

To improve the robustness to bad predictions from the network, the previous slices are further augmented with affine rotation and scaling. Various combinations of rotation in the ranges f 1, 5, 10 g and scaling f 1, 5, 10 g % are tested to evaluate how this affects robustness and reliance on auxiliary input. The results are presented in Section 3.1. No affine translation is included due to the affine transformation being applied from the centre of the image rather than the centre of mass of the concatenated mask. Due to this, some translation is inherently applied when rotating the mask, overall this approach gave better results than rotating, scaling and translating the mask around a pivot point at the centre of mass.

Network Architectures

The first of the two proposed architectures is the traditional U-Net, originally proposed by Ronnerberger et al. [2] consisting of an encoding and a decoding path. In the encoding path, each layer consists of two “blocks” of a 3×3 convolution, followed by a batch normalisation layer [10], followed by a rectifying linear unit. The two convolutional blocks are then followed by a max pooling layer, reducing the spatial size of the feature maps. The decoding path is identical to the encoder, with the one difference of swapping the max pooling layer for a transposed convolutional layer (up-sampling convolution) that up-samples the encoded feature maps such that the final size of the segmentation is the same as the size of the input image. The encoding and decoding layers are connected by skip connections to retain high resolution features that provide information about spatial localisation. Finally, at the end of the decoding path a 1×1 convolution is applied to reduce the number of features maps to two.

The second architecture shares the same principal idea of having an encoding and a decoding path and is very similar to FusionNet, proposed by Quan et al. [11]. The architecture used in this project differ in that it uses strided convolution rather than max pooling for reducing spatial size similar to what is proposed in [12]. The reasoning behind strided convolution is to retain more localisation and spatial information than a normal pooling operation. Pooling operations enhance a networks’ ability to be localisation invariant, a sought after attribute for classification. In the case of segmenting a specific organ at a specific location in the body, the network should learn to pay attention to the locality of different objects [12]. The residual connections are added to avoid problems with vanishing gradients as the network becomes deeper when strided convolutions are used [13]. The input of both networks differ from the original implementations in that the input layer consists of both the input image slice and the segmentation prediction of the previous slice. The two features are concatenated together as separate “colour” channels.

Variants of both the U-net and the Residual network were trained at layer depths ranging from two to five, to see how the complexity of the networks affect the reliance on auxiliary user provided corrections. The network weights are initialised with Xavier initialisation [14] and biases are zero initialised. Both architectures are implemented using Tensorflow [15].

Since the method is network agnostic, any type of architecture could be used; the two architectures in this project were used for their relative shallowness and ease of training with limited resources, while still providing competitive results.

Network Training

The networks are trained with randomly shuffled 2D slices from the 20 image volumes. Each 2D image is concatenated with a previous or subsequent slice taken from the manual annotations, providing each sample with a strong prior that is incorporated in the encoding part of the network. The network prediction and ground truth are compared using a standard softmax cross entropy function. Online augmentation provides the network with an extremely large number of training samples, providing the networks with a good amount of variation in the data promoting robustness. The networks are trained using the Adam optimiser [16] with an initial learning rate of 0.0001 and standard values for first and second moment decay. The learning rate is multiplied by 0.1 at empirically chosen epochs of 10, 20 and 25 to stabilise the training. The networks are trained for a total of 30 epochs. For robustness and correctness of results, all training and testing is done in a five fold cross validation setup.

Evaluating Results

While the proposed method is thought to be used in an interactive manner with a user providing corrections this is less optimal for evaluation purposes. Mainly, manual corrections introduces bias to the evaluation and would also be very tedious work. Instead a reproducible approach is used where slices are picked at a fixed step length from the middle of the volume, moving towards the edges. The step length is chosen as a percentage of the total object volume such that a smaller organ will result in a smaller step length. This way fair comparisons between different configurations can be made in an automated fashion while still showing the effect of correction propagation.

Impact of Mask Augmentation

Comparing the results of the best augmentation configuration of Table 2, with baseline results in Table 1 the Dice coefficient scores of the proposed configuration are comparable to the standard implementations that are trained separately to each class. Worth noting is that the proposed configuration performs better for the spleen, kidney and pancreas where there is a large variation in shape and where the intensity might be sub-par. This makes sense because these are cases that naturally would confuse a network due to lack of variation and amount of training data. Thus, it would be harder for the network to build a good representation of the task from the image data alone. Instead the provided prior is used to a greater degree.

Perhaps less expected but still understandable is that the separately trained networks perform for organs such as liver and lung that have consistent shape and intensity between different patients. In these cases the concatenated mask acts as noise by propagating error into the segmentation, pushing the network in a faulty direction. From Table 2 we can also see that the U-Net architecture used favours a large variation in scaling and a small variation of rotation. This could be explained by the way most organs change in shape when moving from slice to slice. In most cases there is a change in size of the object, at first growing until it reaches a peak circumference and then shrinking. By applying additional scaling to the concatenated mask the network can be trained to transform the concatenated mask accordingly. Likewise, there is usually very little rotational change to the shape of an organ between slices which could explain why a small rotational variation is favourable. Comparing column two and three of Table 5 the added scaling and rotation of the concatenated mask increase the performance across all cases. This is in line with the results reported by Khoreva et al. [5] which points toward and increased variation in the masks being propagated improves the robustness of the networks.

Network Design

The results shown in Table 4 compare the two architectures at the augmentation configuration that gave the best results for the U-Net architecture. The residual network performs worse at that configuration and also worse than the baseline. If this is due to the residual setup actually performing worse of due to requiring different hyper parameters is hard to conclude. However, it is at least possible to see that the choice of network architecture does impact the results in some way. More evaluation with greater variation in network architecture would be required for further discussion.

In Table 3 the best configuration from Table 2 is tested at varying network depths. Looking at the Dice coefficients the shallower networks obviously perform on a sub-par level. Increasing the depth could actually prove to be a problem due to the risk of vanishing gradients, making it impossible to update the weights of some of the earlier layers [11]. This could on the other hand be counteracted by the use of residual connections [13]; whether this is the case or not has not been evaluated in this project.

The impact of various network depths for the multiple correction case is shown in Figure 5. Net-work depths ranging from two to five were evaluated. The shallower network show consistently worse segmentation accuracy, even as the number of corrections is increased, proposing that a deeper architecture is favoured. This is interesting as it would be expected that a shallow network would rely more on the provided prior due to being unable to extract enough information out of the image alone because of the shallow depth [18]. If the case had been that the shallow networks performed better with an increasing amount of annotated prior data, it could have been a useful architecture for practical use where the segmentations are expected to be corrected. This is however not the case.

Propagation of Mutliple Corrections

In Figure 4 the Dice coefficient for various organs with an increasing number of corrected (replaced with ground truth) slices are shown. As is expected the results improved with more and more corrections. As described in the last paragraph of Section 2.3 the evaluation method for multiple corrections was not as optimal as it could be. In an actual use case scenario it would be preferable to propagate both backwards and forwards from each correction and then rather than strictly overwriting the old segmentation with the new one, do a comparison and keep the best. However, even without doing that Figure 4 still shows that the closer the prior is to the current slice, the less error will have time to accumulate. These results are in line with what was reported by Khoreva et al.

Generalisation to New Cases

The fact that it is at all possible to generalise to entirely new cases is the most interesting and promising result of this project. In Table 5 and Figure 6 both quantitative and qualitative results show, while not perfect, that it is very possible to teach a network to segment a specific organ from a provided prior. This shows that it is possible to teach a network to distinguish between different organs even though every organ is mapped to the same binary segmentation problem. The network performed better on organs with shapes and intensity similar to what is represented by the training data. The results of all of the organs except for the aorta is comparable to the results on the Visceral3 [9] leaderboard.

Even in problematic cases such as the aorta and adrenal gland, the network manages to stay on target without straying too much to nearby organs, see Figure 6. However, it also failed to a large degree to segment the object of interest, possibly because the intensity contrast was too low to be able to ’anchor’ the provided prior to a structure in the image. In Table 5 it is possible to see that both added image and mask augmentation affected the generalisability. Possibly, the image augmentation causes the network to be more reliant on the concatenated mask during training. In inference this is helpful since the prior holds a lot of information about the unseen organ. The mask augmentation should help with robustness to new shapes and resilience to error propagation by training the network with a large variation of mask shapes.

Table of contents :

1. Introduction
2. Method
2.1. Spatial Propagation
2.2. Capturing User Input
2.3. Implementation Details
2.3.1. Data Pre-Processing
2.3.2. Data Augmentation
2.3.3. Network Architectures
2.3.4. Network Training
2.3.5. Evaluating Results
3. Results
3.1. Ablation Study
3.2. Multiple Corrections
3.3. Generalisation to New Cases
4. Discussion
4.1. Impact of Mask Augmentation
4.2. Network Design
4.3. Propagation of Mutliple Corrections
4.4. Generalisation to New Cases
4.5. Future Work
5. Conclusion
Appendix A. State of the Art
References