A lot of information stored in databases consists of sequences.Our query model is very simple: We assume that the user specifies a query sequence and wants to retrieve all data sequences that are similar to the query sequence. Similarity search is different from `normal' queries in that we are not only interested in sequences that match the query sequence exactly, but also in sequences that differ only slightly from the query sequence.
A data sequence X is a series of numbers X = (x 1 .....x k ). Sometimes X is also called a time series. We call k the length of the sequence. A subsequence Z = ( z 1 ...z j ) is obtained from another sequence X = (x 1 ;:::;x k )by deleting numbers from the front and back of the sequence X. Formally, Z is a subsequence of X if z1 = x i ;z2 = x i 1 .... zj = z i j ? 1 for some .Given two sequences X =( x 1 ....x k ) and Y = ( y 1 .... yk ) , we can dene the Euclidean norm as the distance between the two sequences as follows:-
Similarity queries over sequences can be classified into two types.
Complete sequence matching: The query sequence and the sequences in the database have the same length. Given a user-specified threshold parameter ,our goal is to retrieve all sequences in the database that are within -distance to the query sequence.
Subsequence matching: The query sequence is shorter than the sequences in the database.
An Algorithm to Find Similar Sequences
Given a collection of data sequences, a query sequence, and a distance threshold how can we efficiently find all sequences that are within -distance from the query sequence?
One possibility is to scan the database, retrieve each data sequence, and compute its distance to the query sequence. Even though this algorithm is very simple, it always retrieves every data sequence.
Because we consider the complete sequence matching problem, all data sequences and the query sequence have the same length. We can think of this similarity search as a high-dimensional indexing problem. Each data sequence and the query sequence can be represented as a point in a k-dimensional space. Thus, if we insert all data sequences into a multidimensional index, we can retrieve data sequences that exactly match the query sequence by querying the index. But since we want to retrieve not only data sequences that match the query exactly, but also all sequences that are within -distance from the query sequence, we do not use a point query as defined by the query sequence. Instead, we query the index with a hyper-rectangle that has side-length 2 and the query sequence as center, and we retrieve all sequences that fall within this hyper-rectangle. We then discard sequences that are actually further than only a distance of away from the query sequence.
Using the index allows us to greatly reduce the number of sequences that we consider and decreases the time to evaluate the similarity query significantly. The references at the end of the chapter provide pointers to further improvements.