Web Page Characteristics
In this section we describe the Web pages characteristics. These characteristics are key elements forWeb page segmentation algorithms. They are related to the rendering (1.2.1) or to websites (1.2.2). A glossary of terms is presented at the end of the section (1.2.3).
Web page characteristics from the rendered DOM
Bottom-up strategies start by selecting the leaf nodes in the DOM tree, which are considered as atomic units. They are aggregated with other nodes, in a iterative manner, until the stop condition is met.
One example of this approach is the Gomory-HuPS Algorithm [LLT11]. The block graph is represented as an edge-weighted planar graph. First the graph is constructed. Each graph vertex (block) represents content information based on the DOM structure and the visual information. DOM nodes are selected as blocks if they contain text, links and pictures as children. In most of the cases these elements are leaves in the DOM tree.
Edges are added based on the location of blocks in the geometric model, i.e. an edge is added between two blocks if they are closest neighbors. The weight of a block is the normalized path similarity of two DOM nodes, using the edit tree distance in both paths. This means that DOM nodes in the same subtree will get less weight than if they would be located in a dierent branch of the tree. Figure 1.2a shows the example Web page and the DOM nodes selected. Figure 1.2b shows the initial block graph with its weights. The algorithm iteratively evaluate two blocks i and j. If the weight of the edge between these two vertex is below a parameter, both blocks are grouped. Figures 1.2c and 1.2d show the nal segmentation and the block graph assuming that the parameter value is 0:1.
Document processing and Web page segmentation
We observe that there is a clear relationship between page segmentation and the eld of computer vision. Segmenting and understanding scanned documents images is a very well studied subject in the Document Processing domain [TS94, TCL+99]. In Document Processing systems, the concepts of objects, zones or blocks are applied to dene the geometric and logic structure of a document where each block is associated with a category, its geometric information and a label. Processing a document comprises the document analysis and understanding phases. Document images are analyzed to detect blocks based on pixel information. Blocks are categorized, their geometric information is extracted and, in function of both features, a label is assigned. Moreover, by understanding what blocks contain (label) and where they are located (geometric model), it is possible to merge them [FM81] and give them a reading order, or ow [Meu05]. For example, assume we have a title block and two paragraph blocks. The two paragraph blocks should be merged, the title block should appear rst, followed by the merged paragraph block.
There are algorithms for Web page segmentation that include aspects of the Document Processing approach. The rst to do it is VIPS [CYWM03]. Its segmentation model is based on the recursive geometric model proposed by Tang [TS94] for recognizing regions in a document image. VIPS itself focuses mainly on the content and geometric structure of a Web page. Although they do not explicitly include a logic structure, they understand the document by extracting the blocks and by grouping them based on separators. Kohlschütter [KN08] uses the relation between the pixel concepts with text elements in the Web page domain. They transfer the concept of pixel to HTML text as character data (the atomic text portion), an image region is translated to a sequence of atomic text portions (blocks). They measure the density of each block and merge those which are below a threshold tolerance using the BlockFusion algorithm. [NWM09] uses segmentation, structure labeling and text segmentation and labeling to dene a random eld that leads the extraction.
The Web page segmentations algorithms presented in this section adapt methods from the document processing domain. Actually, there is a vast knowledge in this domain that we can exploit to dene new methods and techniques adapted to Web pages. We think that adapting existing document processing methods and techniques to Web page segmentation allow increasing the quality of the results that can be obtained from the segmentation itself. The vision-based approach exploit this relationship, such as the VIPS algorithm. The classication and labeling of blocks are dened as posterior task not included in the segmentation. In our work we also use document processing concepts but within the bottom-up strategy (VIPS follows a top-down strategy). We are inspired in a classication method as described in [LPH01]. Every block corresponds to an item in a predened taxonomy based on the most basic categories. For scanned documents we can cite some of them: paragraphs, title and gures. In Web pages their equivalent are the HTML5 content categories: phrasing, heading, embedded.
Correctness measures in scanned document segmentation
The evaluation of image segmentation is a very well studied area. For instance, [CCR05] describe several metrics to measure the quality of the segmentation based in the distance of each partition in an image segmentation. However, this method is designed for general images not specically for scanned documents. Its adaptation to Web pages is thus not straightforward. [LPH01] measure the accuracy of a segmentation by determining the label assigned to entities (paragraphs, title, table, etc) by an image segmentation algorithms and their correspondence to a predened taxonomy. This approach gets close to what evaluation of Web segmentation needs, but they do not consider the content and the geometric aspect of a scanned document as part of the evaluation. [SKB08] present a vectorial score that identies the common classes of segmentation errors using a ground truth of annotated scanned pages. We think that this approach covers all the needs of Web page segmentation (with some adaptations and modications) since they consider the content, geometry and visual aspect of a scanned document in their evaluation model.
The correctness measures proposed by Shafait et al. [SKB08] evaluate to what extent a set of text lines are equal to the ones of the ground truth, which ones are missing, which lie into the bounding boxes and which ones are horizontally merged. They dene that a text line is signicant if the amount of foreground pixels in each line is greater that two threshold parameters, one relative and the other absolute. They build a bipartite graph whose nodes represent the text lines in the ground truth, in one hand, and in the other the text lines which represent the proposed segmentation. They assign a weight to the edges of the graph according to the signicance of the ground truth vertices.
Fine-grained segmentation construction
The idea of the ne-grained segmentation is to nd coherent blocks as small as possible. It serves as a starting point for the whole process by creating a rst version of the block graph W0B oM and the geometric model GMBoM. The condition C that a DOM element must satisfy to be considered as a block is that it does not belongs to the following content categories: text, phrasing, embedded, interactive or form-associated elements. Appendix A has a complete description of HTML5 content categories, elements and their exceptions. The value to the label (B:label) is the most inclusive content category of its elements (B:elements). For instance, if the block has one element which content category is ow the label of the block is the same. If the block is associated with two elements, one element in the embedded category and the other in the heading category, the most inclusive category is ow. Figure 2.4 shows which content category includes other content categories.
The process begins from the leaves of the DOM tree, towards the W:root (cf. Section 220.127.116.11). If an element is found that meets the condition C above dened, the process stops for this branch. Figure 2.2 shows how an element is selected as a block. Element li is the rst element that does not belong to the categories above listed, then it is marked as a block and the label ow is assigned. From the information obtained during this sub-process a geometric model (cf. section 1.3.1) and a rst version of the block graph are built (cf. section 1.3.1).
Absolute representation of a segmentation
Each block B is associated with its rectangle (B:rect), its label (B:label) and its weight (B:weight) (cf. Section 1.3.2). To each B we add three values: the amount of elements it covers (B:ec), the text associated to the block (B:text) in the original page W and the importance (B:importance). Note that B:ec = jB:elementsj.
The importance of a block depends on the area covered by its rectangle. Section 3.2.3 explain how it is computed. An absolute segmentation for the rendered DOM W, using the algorithm A and SC a set of stop conditions, is dened by the following function (cf. Section 1.3.2): A(W; SC) ! W0A ;GMA .
Table of contents :
List of Figures
List of Tables
1 Web Page Segmentation and Evaluation
1.1.1 Web applications
1.1.3 Rendered DOM
1.1.4 Element positioning
1.2 Web Page Characteristics
1.2.1 Web page characteristics from the rendered DOM
1.2.2 Characteristics related to the website
1.3 Web page segmentation
1.3.3 Top-down versus bottom-up
1.3.4 Basic Approaches
1.3.5 Hybrids Approaches
1.3.6 Conclusion on Web page segmentation algorithms
1.3.7 Document processing and Web page segmentation
1.3.8 Summary Table
1.4 Segmentation evaluation
1.4.1 Classication of evaluation methods
1.4.2 Segmentation correctness evaluation
1.4.3 Correctness measures in scanned document segmentation
1.4.4 State of the art on evaluating Web page segmentation
1.4.5 Summary table
2 Block-o-Matic (BoM): a New Web Page Segmenter
2.3 Fine-grained segmentation construction
2.4 Composite block and ow detection
2.5 Merging blocks
3 Segmentation evaluation model
3.1 Model adaptation
3.2 Representation of segmentation
3.2.1 Absolute representation of a segmentation
3.2.2 Normalized Segmentation Representation
3.2.3 Block importance
3.3 Representation of the evaluation
3.3.1 Measuring text coverage
3.3.2 Measuring block correspondence
3.4.1 Computing the importance
3.4.2 Computing text coverage
3.4.3 Computing block correspondence
4.2 Block descriptors
4.3 Tested segmentation algorithms
4.3.1 BF (BlockFusion)
4.3.2 BoM (Block-o-Matic)
4.3.3 VIPS (Vision-based Web Page Segmentation)
4.3.4 jVIPS (Java VIPS)
4.4 Dataset construction
4.4.1 Dataset organization
4.4.2 Ground truth construction
4.5 Experiments and results
4.5.1 Setting the stop condition parameters
4.5.2 Setting the thresholds
4.5.3 Computing block correspondence
4.5.4 Computing text coverage
5.1.1 How does it work?
5.1.3 Practical application
5.1.4 Perspectives and outlook
5.2 Block-based migration of HTML4 standard to HTML5 standard
5.2.2 Proposed solution
5.2.5 Perspectives and outlook
A HTML5 Content Categories
B Semantic HTML5 elements
C Web page segmentation evaluation metrics
C.1 Adjusted Rand Index
C.2 Normalized Mutual Information
C.3 Dunn Index
C.4 Nested Earth Mover’s Distance
C.5 Precision, Recall and F1 score
D Web Segmentation approaches details