Performance
We implemented the TVSM on a PostgreSQL5 version
7.2 relational database. For a better performance, only
entries with a scalar-product larger than the scalarthreshold
0.5 are stored in the “scalarproduct” table (this
is equivalent to set all scalar-products lesser than the
scalar-threshold to value zero). For our tests, we used
7184 news documents from the German Heise-Ticker6
Website. 96887 terms have been extracted from these
documents and have been stored in the “term” table. From
this data-basis term-weights and -angles have been
derived as already described in section 2.5 (with the
restriction of the scalar-threshold). Table “scalarproduct”
contained 97509 entries. The calculation of the similarity
between a general document (having 164 different terms)
and all 7184 documents (including reverse ordering by
similarity) needed approximately five seconds on our
generic PC (Athlon XP 1600+ processor with 768 MByte
Ram and FreeBSD operating system). First performance
tests showed that the calculation speed highly depends on
the number of entries in table “scalarpoduct” and that it
only depends very low on the number of terms or
documents. This means the scalar-threshold is a good
variable to adjust the calculation speed versus the quality
of similarity-calculation.
4. Comparison with other vector-based
approaches
Both, the Vector Space Model (VSM) [Salton 1968;
Baezea-Yates 1999, pp. 27-30] and the TVSM assign a
document-vector to each document. In contrast to the
TVSM the VSM assumes that all terms are independent
(orthogonal) to each other. This leads to a relatively high
performance. The assumption of orthogonal terms is
incorrect regarding natural languages which causes
problems with synonyms or strong related terms. In order
to reduce these problems messages are usually passed
through a stopword-list, stemming- and thesaurusalgorithms
before they are forwarded to the VSM. This
abrogates the assumption of term independence only in
parts, because two terms can simply be treated as
equivalent or as not equivalent. Similarity levels between
these two extremes are not possible. From the theoretical
point of view the TVSM has the advantage of not
assuming independence for terms which allows a full
integration of stopword-list, stemming and thesaurus into
the model. Similarity between terms can be gradually
defined from “not equivalent” (term-angle: 90°) to
“equivalent” (term-angle: 0°).
The Generalized Vector Space Model (GVSM) [Wong
1987; Beaza-Yates 1999, pp. 41-44] assigns a documentvector
to each document without the assumption of
orthogonal terms. In contrast to the TVSM the GVSM
allows no flexibility regarding the computation of termangles:
in the GVSM term-angles are based on the computation
of co-occurrence of terms. Because of this limitation
messages have to be pre-processed in a similar way
like for the VSM: Messages are passed through a
stopword-list and stemming-algorithms before they are
forwarded to the GVSM. In contrast to the GVSM the
TVSM specifies only ideal properties of term-angles
(refer section 2.4). Therefore the TVSM allows more
flexibility regarding the calculation of term-angles. Termangles
can be computed using different statistical methods