We implement the webpage classification algorithm by
combining the three techniques mentioned previously 1)
Segmenting Visual Boundaries 2) Breath First Search 3)
Ontology. First of all, we identify the visual boundaries
of HTML tags using information provided by the browser
rendering engine. We parse and traverse the HTML page
using Breadth First Search algorithm. If a particular level
of a tree contains at least five HTML tags with sufficient
visual boundaries (e.g. having area more than 500), we
take these HTML Tags as regions. Once the segmentation
is done, we tokenize the TextNodes into words and then
we select the first two regions, merge them, and group
same words together. When a word matches another, the
first word will form a cluster of size one.
After segmentation and merging of the first 2 regions are
carried out, we will perform the tokenization of
TextNode to each of the remaining regions, and obtain
the root word for each of the tokenized words. For
example, the root word of “oxen” is “ox”, the root word
of “fishes” is “fish”, and so on. After that, we measure
the semantic similarity of each word in the remaining
regions with the words in the merged region using Lin’s
algorithm. If a pair of words obtains a semantic similarity
score of more than 0.7 from a scale of 0.0 to 1.0, the
words will be grouped into their respective cluster. The
counter of the cluster group will be increased by one each
time a match is found. A pair of words which returns a
value of less than 0.7 will be ignored. Finally, we will
have a list of clusters with their own words. We will then
match these keywords with the predefined keywords to
determine their match. Keyword with the closest match is
taken as the label for that page.