In this example, you can see that the text in blue circle
contains a hyperlink, for this case, because the traditional
DOM tree methods always use the text-link ratio (ratio of the
length of the text in the node and the length of the hyperlink
text in the node) to judge whether the node is a text node, the
text in blue circle will always be treated as a pure link which
has no sense, they will be thrown away wrongly too.
In our method, we’ll use the VIPS algorithm to overcome
this problem and improve the performance of the webpage
content extraction. For VIPS can divide the webpage into
some semantic blocks, it can get a whole view of the
webpage and get the position information of each block. In
order to recall the sentences which are thrown away, we’ll
keep the DOM tree node tag when using traditional method
to extract the content. The steps are as follows.
1. Using VIPS to divide the webpage into several
blocks and keep the coordinate information of each
block and the node tag in each block.
2. Using traditional method to extract the content of
the webpage and keep the html tag information of
each content node.
3. Using the coordinate information of each block to
determine which blocks should be content blocks.
4. Map the extracted content node tag sequence to the
content block according to the node tag and the
content itself. If some node tags in content block
don’t appear in extracted content node tag sequence,
we recall the node and the text in this node.