describe an approach that navigates the DOM tree recursively,
using a variety of filtering techniques to remove and modify nodes in the tree and leave only content.
HTML elements such as images and scripts are removed by simple filters.
More complex filters remove advertisements,
lists of links, and tables that do not have “substantive” content.