AN ARCHITECTURAL OVERVIEW
Our goal has been to design a dictionary application comprising a rich set of multimedia information elements. Additionally, our objective has been to achieve a high degree of user interactivity. In this section we discuss an architecture design focusing on the multimedia data model. Only the basic notions are given related to network issues, which are deferred to the next section, after having given some preliminaries in this section.
In the dictionary framework, we define a multimedia object (MO) as a composition of visual, audio, and textual data. Furthermore, we distinguish composite (COs) and primitive objects (POs) within a set of MOs, where the former are built from the latter. Conceptually, we are not restricted to this flat structure where each CO is just a composition of a certain number of the primitive objects. Rather, we consider a hierarchical structure of COs such that each CO may consist of both a certain number of POs and other COs. Therefore, the data structure can be conveniently represented in the form of a hierarchical tree-like structure, whereas each MO is represented as a node. This is illustrated in Fig. 1. It is convenient to refer to a given MO within a tree as a parent MO in respect to the objects that are downward in the tree, to which we refer as child objects. In this view, the data structure is built upward from the set of POs (leaves) and the COs on the lower hierarchy layer. Indeed, ultimately, each MO is physically represented as a collection of POs, but the underlying hierarchical structure of the data has to be respected. Basically, the data structure of concern is related to object-oriented modeling, whereas the relationship among the MOs corresponds to the weak composition.
The benefit of using the data structure as explained above is multifold. For instance, a user might point to one particular MO within the tree, for instance, through textual querying, and obtain a representation of the given MO. Subsequently, s/he might proceed either downward or upward along the tree to further browse for the data of interest. We illustrate the above concept with the example given in Fig. 1. In the top part, an exemplary visual representation of the term “computer” is depicted. To this term an MO is associated which appears in the hierarchical tree at the bottom of Fig. 1. It is evident that the MO “computer” is a composite built from three COs, “motherboard,” “network interface card,” and “hard disk drive,” and a single PO, “connector.” Furthermore, the MO “computer” is composed in MO “LAN,” which appears in the above layer within the hierarchical tree. Each edge in the tree represents a potential browsing direction of the user. Also, several instances of the MO “computer” might exist in the tree, with the association relationship among those MOs. It is noteworthy that data reusability is inherited in the scheme, so a large number of COs can be built from a finite number of POs.
Specifically, we build each PO from visual data accompanied by a textual and audio description of the given visual content. In principle, the audio may be considered the recorded speech of the given text. Then MOs are created from those primitive entities and from the MOs that were previously created. It is worthwhile to relate the above model with the concepts advocated in MPEG-4 [6]. Indeed, there is some resemblance to the MPEG-4 concept of encoding the audio/visual objects by means of primitive audio/visual objects. Within the MPEG-4 framework, the basic element of the semantic coding is a primitive audio/visual object (AVO). An AVO is defined to be of either natural or synthetic origin. Several AVOs can be composed to create an audio/visual scene, and mutually multiplexed and synchronized; they also provide interaction with the user [2]. Nevertheless, MPEG-4 is indeed more general and augmented with some additional features. For instance, we regard text-to-speech and facial animation considered in [2] as being of particular interest within our context.
An interesting issue is the applicability of stream media, audio and video streams, in the context of the multimedia dictionary. We consider multimedia streams quite valuable, particularly for language practice purposes. One of the main issues here is the indexing of the audio/video, which is supposed to be addressed in the upcoming MPEG-7. In our application we have considered the following setting: we associate with each audio/video stream a stream of textual data of what is spoken within the given audio/video scene. Note that this textual data is not necessarily plain text, but might also embed links to other media. For instance, authors in [8] associate links to Hypertext Markup Language documents.
This concept should facilitate the following two scenarios. First, the user might listen/watch audio/video and at some point in time pause the playback to examine the presented text. Potentially, s/he might look up in the dictionary a certain word given in the text. Subsequently, the user might either proceed with listening/watching or further browsing through the dictionary. Second, an indexed audio/video sequence could be searched through textual querying. Then the user could be presented with an audio/video sequence related to the text of concern. Note that the stream concept described earlier is particularly suitable for providing description of actions (verbs). Additionally, its benefit is that the authoring would not be limited only to the recorded voice of the authors themselves, but also allows use of the vast existing recorded materials (of course, if it complies with copyright rights). Later, dealing more with the implementation aspects, we briefly comment on how this concept could be implemented in the existing technology.