Previous Plagiarism Detection Methods
Many systems for detecting plagiarism have been built since the 1970s. This
section summarizes some of the well-known systems.
The earliest system, counting-metric system (Berghel and Sallach 1984;
Faidhi and Robinson 1987), detects plagiarism based on the feature vector. First,
each program is converted into a number of software metrics and then mapped
into an n-dimensional Cartesian space. After converting all programs into
metrics, the system calculates the distance between points. Two programs
having very similar points will be considered as being plagiarized. However, this
method for detecting plagiarism can be easily thwarted by simple program
transformations. Also, no matter how many dimensions are used in the Cartesian
space, converting a program into system-metrics will lose much structural
information.
More accurate and reliable systems were developed after more powerful
computers were produced. The new methodology relies on the structural
comparison rather than just looking at the frequency at which keywords appear
in the program. Some systems adopted hybrids between structure and metric
comparison. The well-known structural comparison systems include YAP3 (Wise
1993; Wise 1996), MOSS (Aiken 1997) and JPlag (Malpohl 2002). YAP3 and
JPlag use the same basic comparison algorithm ó the ìGreedy String Tilingî
algorithm. JPlag applies a set of optimizations for improving the overall
efficiency of the system. MOSS, however, adopts a slightly different approach to
check for plagiarism. MOSS maintains a database that stores an internal
representation of programs and then looks for similarities between them.
The basic approach of these systems is to convert the program into a stream
of tokens and then compare these token streams to find common segments.
However, a program consists of both structure and data, where the data refer to
expressions in the decision statements or values of the variables and formal
parameters. Although the conventional approach can capture and analyze the
structure of the programs to determine the similarity of the programs, it ignores
the ëdataí ingredient of the program. Ignoring the structure or data of the
program may consequently misinterpret the meaning of the program.
In this chapter, our new approach uses the parse tree to detect plagiarism in a
pair of programs. Parse tree checking is also a kind of structural comparison.
Using parse tree not only explicitly reflects the structure of the program but also
220 S. C. Ng, S. O. Choy and R. Kwan
stores and examines the program data on the parse tree. Checking plagiarism
using parse tree is a more accurate and more suitable approach as both the
program structure and data are considered.