Fully Convolutional Network for Semantic Seg-mentation
In this section, we detail the Fully Convolutional Network (FCN) model (Long et al. 2015), which has been considered as the cornerstone of deep learning applied to segmentation semantics. It is one of the first works that demonstrates how Convolutional Neural Network (CNN)s can effectively learn to make dense predictions for semantic segmentation in an end-to-end fashion. Many recent state-of-the-art methods can be seen as extensions and improvements of the FCN. Thus, a good understanding of the advantages and limitations of the FCN can help us to understand other works and provides insight into future research.
Principle and Model
After the huge success of CNNs on image classification, the application of CNNs to semantic segmentation is a natural step. However, due to differences between tasks, the CNN methods used for classification cannot be applied directly on semantic segmentation. For image classification problems, their goal is to recognize the object in the image regardless of location or boundaries. As a result, spatial information is generally discarded in classification CNNs, both to speed up computation and to obtain more robust representations. On the contrary, spatial information is the key for semantic segmentation since we need to classify each pixel in the image. The challenge, therefore, is how to retain more spatial information while maintaining discriminating features for classification.
Precisely, given an input image I 2 [0, 1]H W 3 and a pre-defined label set L = 1, 2, …, C where H (resp. W) is the height (resp. width) of the image and C is the total number of classes, a classification model is a function from [0, 1]H W 3 to RC while a semantic segmentation model is a function to R H W C. As can be seen, the classification problem requires the model to convert the volume of the 2d image into a single vector. To do this, a key operation in classification CNNs is the flatten operation, which transforms the 2d activation map into a 1d vector and allows the execution of fully connected operations. However, this kind of reshaping operation discards the spatial relations among pixels in the image. In addition, the fully connected layer requires a fixed input size and therefore the network can only be applied on images of a fixed size. With this in mind, the first contribution of the FCN is to remove the flatten operation and replacing fully connected layer by a 1 1 convolution layer. In this manner, the modified network maintains 2d activation maps to produce dense predictions, and can be applied to an image of any size. (see Figure 2.3 (a)).
However, that is not enough to produce high quality segmentation masks. Apart from the flatten operation, classification CNNs progressively reduce the spatial resolution by subsampling operations such as stride, and usually the last feature map is 32 times smaller than the original size. For example, a 224 224 resolution image becomes a 7 7 feature map after all convolution layers. It is obvious that the segmentation mask provided on this 7 7 feature map will be very coarse and inaccurate. However, simply removing these subsampling operations within a network is a trade-off: the filters see finer information, but have smaller receptive fields and take longer to compute. The second contribution of FCN is to solve this problem by retaining the receptive field and adding skip connections and deconvolutions to refine the predictions. (see Figure 2.3 (b)). Skip connections can incorporate finer information (low level features) into the final prediction and the deconvolution is used as an up sampling method to perform higher resolution outputs.
Thanks to these two major contributions, the FCN can transform most popular classification CNNs such as VGG (Simonyan et al. 2015) and ResNet (K. He et al. 2016) into a fully convolutional version and then used them for semantic segmentation. In the following section, we describe the learning strategy proposed by FCN.
A supervised machine learning algorithm generally consists of three parts: the model (or the hypothesis space), the objective function and the optimization method over a dataset. We have already introduced the FCN model in Section 2.2.1, and in this section, we present the training strategy of FCN.
Classification models usually use the cross-entropy loss as objective function. This loss is computed between the predicted distribution yˆ 2 [0, 1]C and the ground truth distribution y 2 [0, 1]C by the formula: C LCrossEntropy(yˆ, y ) = å yi log(yˆi) (2.2) i=1.
where C is the total number of classes. The loss is then optimized by stochastic gradient descent methods and generally performs well for classification tasks. However, this objective function is not directly applicable to semantic segmentation since the predicted mask ˆ and ground truth mask are not distributions. These masks belong to the space [0, 1]H W C where for each position (i, j), the vector ˆi,j and are distributions. Therefore, FCN proposes to use the Per Pixel Cross S Si,j Entropy (PPCE) loss which is a sum over the spatial dimensions of masks: 1 H W LPPCE(Sˆ, S ) = å å LCrossEntropy(Sˆi,j, Si,j) (2.3) H W i=1 i=j Gradients of this function is the sum over the gradients of each of its spatial components. Thus stochastic gradient descent on LPPCE computed on whole images will be the same as stochastic gradient descent on LCrossEntropy, taking all of the final layer receptive fields as a mini-batch. With this loss function, both feed-forward computation and back-propagation can be efficiently computed over an entire image.
As a result, a classification CNN be can transformed to a FCN version and then be efficiently trained in an end-to-end fashion with stochastic gradient descent methods. With minor modifications, a high-performance classification network can be easily used for semantic segmentation tasks. These advantages make the FCN a milestone in the application of deep learning methods for semantic segmentation.
Optimizing Deep Architectures for Segmenta-tion
Recently, in the spirit of Automated Machine Learning (AutoML), there has been significant interest in designing neural network architectures automatically, in-stead of relying heavily on expert experience and knowledge. Neural Architecture Search (NAS) has successfully identified architectures that exceed human-designed architectures on large-scale image classification problems (Zoph et al. 2018; Chenxi Liu et al. 2018; Real et al. 2019), and there exist recent efforts (L.-C. Chen et al. 2018; Chenxi Liu et al. 2019) of applying NAS on semantic segmentation that have shown strong experimental results.
However, NAS is usually time-consuming, and the obtained architectures are less intuitive and inspireful. To better understand the challenges of segmentation, in this section, we present the recent improvements of human designed networks based on the two challenges discussed above: spatial information and contextual information. It is worth noting that many works design their networks by taking into account both spatial and contextual information since they are the key to segmentation.
WSL methods for semantic segmentation have attracted a lot of interest due to the lack of fully annotated data for segmentation. WSL approaches encompass a variety of training annotations less informative than the pixel level, such as (in order of decreasing informativeness): bounding box (Dai et al. 2015; Papandreou et al. 2015), scribble (D. Lin et al. 2016), point (A. L. Bearman et al. 2016), and image label (Papandreou et al. 2015; Xu et al. 2015) (see Figure 2.10). Image-level labels are the cheapest to provide and can be obtained easily from many resources, thanks to datasets for image recognition. However, due to the complete absence of spatial information, image-level labels are also the most challenging to use. In what follows, we review the literature in weakly-supervised semantic segmentation from image-level annotations.
Fully-supervised methods can be considered as the “upper-bound” in perfor-mance for WSL because they are trained with theoretically the most informative supervisory data possible. Although, for the moment, the best fully supervised segmentation model far out-performs the best WSL method, the quality of WSL methods is impressive, especially considering that they learn to segment without any location-specific supervision. To solve this problem, different WSL methods have been proposed and can be broadly categorized into two approaches. The first type of approach is based on Multi Instance Learning (MIL). More precisely, in MIL, the training set consists of labeled « bags », each of which is a collection of unlabeled instances. A bag is positively labeled if at least one instance in it is positive, and is negatively labeled if all instances in it are negative. The goal of the MIL is to predict the labels of new, unseen bags. In the context of WSL for segmentation, a bag is an image with image-level labels and each pixel is an unlabeled instance. That is, for a given image, the image-level labels tell us that at least one pixel of that class is present. In practice, this usually means training a CNN with an image-level loss and inferring the pixels responsible for each pre-dicted class. MIL-FCN (Pathak et al. 2014) trains a FCN with a global max-pooling loss which selects the most informative region for the MIL prediction. WILDCAT (Durand et al. 2017) proposes a weighted spatial average operation, which also takes negative evidences into account. To better locate objects, (Papandreou et al. 2015) incorporate an additional prior in the MIL framework in the form of an adaptive foreground/background bias. The notion of objectness priors is further developed by (Wei et al. 2016; A. L. Bearman et al. 2016), which provide each pixel with a probability of being an object.
Positioning and Framework
In the previous part of this chapter, we reviewed the main methods of semantic segmentation, and highlighted the state-of-the-art methods that are based on DL. After a detailed analysis of the seminal approach FCN, we summarized the three major challenges of semantic segmentation algorithms based on DL and presented the corresponding improvements. We now go over some interesting questions that we address in this thesis to produce finer segmentation results and to alleviate the demand for annotated data.
The dependency between pixels plays an important role in semantic segmen-tation, as it defines which pixels in an image belong to the same object. As we introduced in Section 2.1, traditional segmentation methods, such as region-based and edge-based methods, employ human designed features to determine if they belong to the same segment. These features are mainly calculated using simple pixel variations and cannot capture the dependency at a semantic level. In the era of DL, thanks to the advances in hardware and algorithms, CNN-based models are able to learn more complicated dependencies from a large amount of annotated data. In order to better model and capture the pixel dependencies, most of the work has focused on improving the network architecture, such as encoder-decoder architecture, pyramid spatial pooling and attention mechanism that we presented in Section 2.3. However, the PPCE loss function in DL-based methods (see Sec-tion 2.4) does not guarantee that the modules added to the network will learn the pixels dependencies. Since the PPCE loss evaluates the pixels independently, it does not impose any constraint on the consistency of dependent pixel predictions. This means that the prediction of a pixel has no effect on the relevant pixels, and the predictions of highly dependent pixels need not be dependent as well. We are particularly interested in this important issue and attempt to address it by defining a new loss function, which called SEMEDA loss, that requires predicted masks to have the same spatial dependency as the ground truth. More specifically, when a pixel is inside a large object, its prediction should be uniform with its neighbors, and if the pixel is on the boundary of the object, its prediction should also contain information about the object next to it and retain the boundary information along with the neighboring pixels. In this manner, the mask produced by segmentation models can better conform to the boundary shape of objects and avoid holes inside the object.
Table of contents :
1.3 Contributions and Outline
1.4 Related Publications
2 related works
2.1 Traditional Methods
2.2 Fully Convolutional Network for Semantic Segmentation
2.2.1 Principle and Model
2.2.2 Learning Strategy
2.2.3 Dataset and Evaluation
2.3 Optimizing Deep Architectures for Segmentation
2.3.1 Spatial Information
2.3.2 Contextual Information
2.4 Objective Function
2.5 Weakly-Supervised Segmentation
2.6 Positioning and Framework
3 semeda: enhancing segmentation precision with semantic edge aware loss
3.2 PPCE Loss Extension
3.3 SEMEDA Model
3.3.1 The Pitfalls of Naive Segmentation Approaches
3.3.2 Structure Learning Through Edge Embeddings
3.3.3 Learning to Detect Semantic Edges
3.4.1 Implementation Details
3.4.2 Experimental Setup
3.4.3 SEMEDA Parametrization
3.4.4 Quantitative Validation
3.4.5 Comparison with State-of-the-art Approaches
3.4.6 Qualitative Assessment
3.5.1 Face Parsing
3.5.2 Lesion Boundary Segmentation
4 weakly supervised segmentation with attribution methods
4.2 Motivation and Context
4.3 Attribution for Convolutional Neural Networks
4.3.1 Related Works
4.3.2 Delving Deep into Interpreting Neural Nets with Piece-Wise Affine Representation
4.3.3 Experiment for Visual Explanation
4.4 Weakly Supervised Semantic Segmentation Using Attribution
4.4.1 Experiment Setup
4.4.2 Quantitative Results
4.4.3 Ablation Study
5.1 Summary of Contributions
5.2 Perspectives for Future Work