Denoising and Skeletonization
With the increasing usage of images in different domains such as medical, satellite, and others; Image denoising becomes a more significant research area in computer vision—the noise infection of images increases due to transmission error or compression. When coming to the engineering drawing area, the noises usually come from the scanning process that digitizes paper drawings. The scanning process of engineering drawing usually generates salt and pepper noises. In scanned engineering drawings, the noisy pixels connected to graphical elements are the most harmful because they might fragment the primitive by losing important pixels or combining two or more primitives.
Nowadays, image denoising is essential in the image processing area. It aims to reduce noisy pixels and make images clearer to understand. Several methods were proposed to solve this problem, and for different types of images, [37, 38, 39]. More specifically, many denoising methods are proposed in the literature to reduce the salt and pepper noises in technical drawings and documents.
One of the well-known filters used in previous works is the Median filter ; it is a basic filter used to remove salt pepper noise from images. This filter slides a ( × ) window over the input image and replaces the window center with the median of the input window values. The main drawback of this method is the high distortion of thin lines and corners because they are surrounded by white pixels, which lead to loose important information and thus affect the vectorization process. This would obviously deteriorate the performance of vectorization and 3D reconstruction as detailed below. Moreover, the Center Weighted Median (CWM) filter  is inspired by the median filter. The purpose of this filter is to retain details such as thin lines. Therefore, the proposed method gives the center more weight compared to the neighborhood. Another type of filer is the Morphological one . This filter computes a combination of dilation or erosion to remove salt and pepper noise for images. The main disadvantage of this method is the distortion of thin lines and the high sensitivity when primitives are close (can lead to join close primitives). KFill [43, 44] and EnhKfill  algorithms are based on sliding × window over the image and deciding either to flip or not the pixel based on neighborhood. The drawback of those methods is selecting the value; if is small then the one-pixel-wide lines are shortened from their endpoints; otherwise, if is large then some graphical elements are extensively eroded.
U-net Validation for Denoising
In this experiment, we trained a U-net network to denoise technical drawings using the described dataset. The model is trained using the MSE as loss function, the ADAM op-timizer, the learning rate is set to 10−4 , batch size is equal to two, the steps per epoch is set to 2000, and the number of epochs is set to 30. The input image should be a binary image of size 1024 × 1024. If the image is not binary then the Otsu algorithm  is used to binarize it. If the input size is smaller than 1024 × 1024, we add white borders around the input. In the tested images, the size of input does not exceed 1024×1024. The trained model reaches 99.97% of accuracy.
The algorithms are evaluated using images from different International Association for Pattern Recognition (IAPR) contests on Graphics RECognition (GREC): GREC 2003, GREC 2005  where:
1. GREC 2003 contains four scanned drawings (named as ’1’,’2’,’3’,’4’) in 256 grayscales and binarized with moderate thresholds. Images ’1’ and ’4’ are binarized again with a higher threshold to generate images ’1_230’ and ’4_230’ with thicker line width. However, images ’2’ and ’3’ are binarized with smaller threshold to generate im-ages ’2_100’ and ’3_100’. Moreover, images ’1_ 4’, ’2_ 4’, ’3_ 4’ and ’4_ 4’ are generated by adding synthesized noises  .
2. GREC 2005 contains six real scanned drawings (named as ’5’,’6’,’7’,’8’,’9’,’10’) where each one is used to generate two images one with random noise (’5_rn’,…,’10_rn’) and the other with salt and pepper noise (’5_sp’,…,’10_sp’).
We used only clean images of GREC 2003/2005 and thin/thicker images of GREC 2003 to thin and thicker. For each image, we add uniform salt-and-pepper noise with different noise levels (5%, 10%,15%,20%, 30%, and 40%), then we add white borders such that each image size reaches 1024 × 1024. The trained model is compared with previous methods using a qualitative metric (visually) and quantitative metrics (PSNR, DRD detailed in the previous section) .
Bag of Visual Word
Bag of Visual word (BoVW) is a well-known method used in image classification. The idea behind it is similar to the idea of Bag of Words in natural language processing. The main concept of BoVW is to represent images as a set of features.
BoVW is a technique used to describe and classify images. This method extracts features from input images, generates visual words by quantizing features space using a clustering algorithm and extracting center points of the cluster, which are considered visual words that describe an image. After extracting the visual words, we train a classifier using extracted visual words to classify new test images. To build a BoVW, we need to generate a dataset, select a features detector and descriptor and select a classifier to train extracted features. All these steps are explained in detail in the following sections.
We generate a dataset that consists of two different groups: Arrow images and Non-Arrow images. It contains 3312 arrow images and 3341 non-arrow images (synthetic and real). The real samples are cropped from real images and classified manually, noting that the real samples represent only about 10% of the dataset. However, the synthetic dataset is generated using a python script that generates arrow images and random line (Non-arrow images) images of random size between 16 × 16 to 64 × 64. The real and synthetic images are then resized to 128 × 128 using nearest-neighbor interpolation and binarized using OTSU algorithm. Figure 2.4.12 shows samples from the dataset.
After generating the dataset, the visual words are build by detecting features, extracting descriptors from images, making clusters from descriptors, using the center of each cluster as a visual dictionary vocabularies, and making frequency histogram from the vocabularies and the frequency of the vocabularies in the image. Thus, we use SIFT and KAZE as feature extractors, and descriptors based on their performance in literature reviews . The two feature extraction algorithms are explained in details in the following paragraphs:
Lowe  presented the Scale Invariant Feature Transforms (SIFT) algorithm, one of the best feature extraction algorithm. SIFT shows its robustness in detecting features with different scales, rotation, and illumination. SIFT is made up of four main steps, as shown in Figure 2.4.13 and described in the following paragraphs.
• Scale-space: In this paragraph, we are going to describe the scale-space extrema detection. As shown in Figure 2.4.14, it is clear that we are not able to identify key points with a different scale. Thus, this stage search over all scales and image locations to detect scale and orientation invariant key points.
Table of contents :
List of Figures
List of Tables
1.1 Thinning Approaches
1.1.1 Morphological Approach
1.1.2 Distance Transform Approach
1.2 Supervised Learning
1.2.1 Learning algorithm
126.96.36.199 Forward Propagation
1.2.2 Basic CNN Components
188.8.131.52 Convolutional Layer
184.108.40.206 Pooling Layer
220.127.116.11 Activation Function
18.104.22.168 Batch Normalization
22.214.171.124 Fully Connected Layer
2 Preprocessing of Engineering Drawings
2.2 Views Extraction
2.3 Denoising and Skeletonization
2.3.2 Dataset Description
2.3.3 Distortion Measurements
2.3.4 U-net Validation for Denoising
2.4 Arrow Head Detection
2.4.1 Bag of Visual Word
126.96.36.199 Dataset Description
188.8.131.52 Local Feature
2.4.2 Cross Validation
2.4.3 Confusion Matrix
3 Unsupervised Vectorization
3.2 Proposed Method
3.2.1 Population Generation
3.2.2 Fitness Function
3.2.3 Selection and Crossover
3.2.4 Post Processing
3.3 Results and Discussion
3.3.1 Comparison with Previous Methods
3.3.2 Algorithm Robustness and Hyper-parameters Effects
184.108.40.206 Number of Chromosomes and Generation Effects
220.127.116.11 Width (𝛼) Effects
18.104.22.168 Rotation, Scale and Noise Effects
22.214.171.124 3D Reconstruction
4 Hybrid Vectorization
4.2 Literature Review
4.3 Semantic Segmentation
4.3.1 Visual Geometry Group Networks
4.3.2 Residual Networks
4.4 Instance Segmentation
4.4.1 Mask Regional Convolutional Neural Network
4.5 Performance Metrics
4.6 Hybrid Vectorization Algorithm
4.6.1 Segmentation Phase
126.96.36.199 Dataset Description
188.8.131.52 Segmentation Phase
184.108.40.206 Segmentation Results and Discussion
4.6.2 Detection Phase
4.6.3 Hybrid Vectorization Results
4.7 3D reconstruction
4.7.1 Pix2Vox Model
Conclusions and perspectives