1. Content Detection: For images, this method means that the individual objects in
the image are detected, possibly segmented, and recognized. The image is then
labeled with detected objects. For music, this method may include recognizing
the instruments that are played as well as the words that are said/sung, and even
determining the artists. Of the three approaches, this is the one that is the furthest
from being adequately realized, and involves the most signal processing.
2. Content Similarity Assessment: In this approach, we do not attempt to recognize
the content of the images (or audio clips). Instead, we attempt to find images
(audio tracks) that are similar to the query items. For example, the user may
provide an image (audio snippet) of what the types of results that they are
interested in finding, and based on low-level similarity measures, such as
(spatial) color histograms, audio frequency histograms, etc, similar objects are
returned. Systems such as these have often been used to find images of sunsets,
blue skies, etc. [15] and have also been applied to the task of finding similar
music genres [16].
3. Using Surrounding Textual Information: A common method of assigning
labels to non-textual objects is to use information that surrounds these objects in
the documents that they are found. For example, when images are found in web
documents, there is a wealth of information that can be used as evidence of the
image contents. For example, the site on which the image appears (for example
an adult site or a site about music groups, TV shows, etc.), how the image is
referred to, the image’s filename, and even the surrounding text all provide
potentially relevant information about the image