Finn et al.
(2001) describe a relatively simple technique based on the observation that there are less HTML tags in the text of the main content of typical web pages than there is in the additional material.
Figure 3.17 (also known as a document slope curve) shows the cumulative distribution of tags in the example web page from Figure 3.16,
as a function of the total number of tokens (words or other non-tag strings) in the page.
The main text content of the page corresponds to the “plateau” in the middle of the distribution.
This flat area is relatively small because of the large amount of formatting and presentation information in the HTML source for the page.