The DOM structure provides useful information about the components of a web page,
but it is complex and is a mixture of logical and layout components.
In Figure 3.18, for example, the content of the article is buried in a table cell (TD tag) in a row (TR tag) of an HTML table (TABLE tag).
The table is being used in this case to specify layout rather than semantically related data.
Another approach to identifying the content blocks in a page focuses on the layout and presentation of the web page.
In other words, visual features—such as the position of the block, the size of the font used, the background and font colors,
and the presence of separators (such as lines and spaces)—are used to define blocks of information that would be apparent to the user in the displayed web page.
Yu et al. (2003) describe an algorithm that constructs a hierarchy of visual blocks from the DOM tree and visual features.