Some variables naturally take multiple values, hence are inherently symbolic; while others become symbolic after some
data processing such as aggregation. An example of a naturally occurring symbolic-valued variable is color of bird species.
Some species of birds have more than one color so the value for each observation is a finite list of colors resulting in a multivalued
variable. Another example of a naturally occurring symbolic-valued variable is blood pressure which changes
throughout the day and from day to day resulting in a range of values. On the other hand, an example of aggregated symbolic
data may result from hospital medical records. A hospital’s database may contain information on millions of admissions. It
may be too difficult to extract knowledge from such a large database. If the hospital administration wants to understand
physicians’ performance trends, for example, a database query can be issued to aggregate records by provider and year. The
resulting dataset will then contain variables with multiple values, hence, producing symbolic data. For more examples of
natural and aggregated symbolic data, see Billard and Diday (2006a).
Symbolic data analysis offers a solution to the massive data problems, especially when the entities of interest are groups or
classes of observations. Recent advances in technology enable storage of extremely large databases. Larger datasets provide more
information about the subjects of interest; however, performing even simple exploratory procedures on these datasets requires a
lot of computing power. As a result, much research effort in recent years has been steered toward finding more efficient methods
to accommodate large datasets. In many situations, characteristics of classes of observations are of higher interest to an analyst
than those of individual observations. In these cases, individual observations can be aggregated into classes of observations.
Aggregating individual observations into groups of interest turns an enormous dataset into a more manageable one. Traditionally,
when data are aggregated, either the mean or the median has been used to represent the entire group. However, some information
becomes lost during this process which may produce misleading results when the aggregated dataset is analyzed. With symbolic
data, much of this information can be retained by including all observed values in each aggregated group.
The likelihood function is well studied in the classical environment, laying the framework for many statistical
methodologies from estimation to regression and beyond. To extend these classical methods to symbolic data, the likelihood
functions for symbolic data must be introduced. The focus of this paper is to propose a likelihood function for symbolic data
and to illustrate some of its applications. In the following section, we give a brief introduction to symbolic data. In Section 3,
we propose an approach to finding the likelihood function of symbolic data. Then, in Section 4 we derive the maximum
likelihood estimators for some common types of symbolic data.