Compare Parse Trees
Besides the parser being language dependent, the comparison process is also
language dependent. Therefore, checking a new language requires a new set of
programs because different programming languages have their own grammar
structure and characteristics. This section introduces the approach that can be
generally used in checking for plagiarism with parse trees.
After all programs have been converted into parse trees, all parse trees will
be compared with one another in pairs. The following steps will be performed to
check the similarity of a pair of parse trees:
Step 1: Break down the parse tree into the sub-trees and classify those sub-trees
into different groups according to their types (e.g. methods and
variables). Each sub-tree stands for one type of programs in a program
segment.
Step 2: If the sub-tree consists of other structures, then repeat step 1. Otherwise, if all nodes in two sub-trees are of the same type or same group of
222 S. C. Ng, S. O. Choy and R. Kwan
program structures, they will be compared with each other. A score will
be given for the similarity of a pair of sub-trees.
Step 3: A matrix of similarity for the members in two groups will then be
formed. The next step is to find the most similar pair of members and
calculate the similarity using the weights for the group.
Step 4: Sum up the scores for all sub-trees. The final score, which indicates
how similar the two sub-trees are, is displayed.
After making comparisons of a set of programs, the result is produced in
a tabular form, as shown in Figure 2. From Figure 2, it can be shown that
each program will be compared with every other program in the program
pool. For example, the similarity between the LogonMenu.java and
Registration.java programs is 18.94%, and the similarity between the
LogonMenu.java and ShopHistory.java programs is 30.36%.
In addition to our new algorithm, some enhancements on flexibility and
accuracy in detecting plagiarism are also introduced in our system. The
following sections describe these enhancements.