ABSTRACT: In this paper, we present a method using webpage
segmentation algorithm to improve the performace of the
webpage content extraction. The traditional methods often
depend on parsing the DOM tree of the webpage and judging
each node of the DOM tree to determin which node is the text
node, this kind of method has a potential problem, it sometimes
throws part of the content away because of its local judgement
strategy. But our method which is based on the VIPS (Visionbased
Page Segmentation) algorithm, can solve the problem
satisfactorily, it can extract the content according to the
coordinate information of the block and help the traditional
method to recall the lost part of the content