Behavioral Data to be Modeled
Before we start, we need to be clear about the data to be used in the behavioral modeling application. At first this may seem trivial, but there are important issues to be decided. Considering our example application, there are N=30 participants and each participant produced T=100 choices on the IGT. The goal is to estimate parameter values from the data that provide evidence about underlying cognitive processes in individual decision makers. Individual data, however, contain the true effect perturbed by experimental error.
For reducing experimental error, a common approach is to aggregate the data by analyzing the choice proportions pooled across all 30 participants, and then to estimate a single set of four parameters from the group data. However, this approach implicitly assumes that there are no important individual differences, e.g., all individuals have exactly the same recency learning parameter and the same degree of intrinsic variability in their choices. If individual differences are strong, and usually they are, then fitting the model to the aggregate data can be very misleading (Estes and Maddox, 2005). Consider the following well-known example from early learning theorists concerned with comparing all-or-none learning models with incremental strength learning models, models in which learning occurs gradually. Let us assume that for each individual, the learning curve can be described by a step function, i.e., a series of failures followed by a series of successes. Learning actually occurs in each individual as an all-or-none process. The learning rate, the time at which that step occurs, may however vary from individual to individual. Hence, if we generate data from the all-or-none model with individual differences in learning rate, then the learning curve averaged across individuals begins to look smooth and gradual, it supports the predictions of the (incorrect) incremental strength learning models.
A better approach is thus to estimate the parameters separately for each individual from the T=100 choice trials, resulting in N=30 different sets with each set containing four parameter estimates of an individual. Obviously, this allows for any type of individual differences in parameters, and it also allows us to determine which model best fits each person from a set of competing models. Using this approach, we can estimate the distribution of parameters from which we can compute the mean and standard deviation. However, the drawback of this approach is that it requires a relatively large amount of data from each individual because it performs poorly for small amounts of noisy data ( Cohen et al., 2010).
A third approach is to use a hierarchical approach (Shiffrin et al., 2008), which is a compromise between the first two. A specific model is neither fitted to grouped or individual data but the model itself incorporates a structure that allows accounting for individual differences within the group. This type of model, called a mixture model in psychology, is a probabilistic model that permits us to identify sub-groups within the entire group. A single probability mixture model is then fitted to all the data from all the participants. A specific class of these probability mixture models are also called hierarchal models. “Hierarchical” refers to the dependence among the parameters and not to the structure of the data. The parameters of hierarchical models are themselves given a model whose parameters are also estimated from the data (Gelman, 2005). In a Bayesian setting ( Box 4.1), the parameters themselves are random variables each with a prior distribution, called a hyper-prior distribution with parameters, called hyper-parameters. This process may invoke several levels (see Hierarchical Bayesian Analysis). That is, the mixture model incorporates an extra, higher level, set of assumptions regarding the distribution of the parameters across individuals. For example, a hierarchical PVL model would require us to postulate a joint distribution function for the four PVL parameters, and then estimate a single set of higher-level parameters that specify the joint distribution. This approach requires a relatively large number of participants to obtain accurate estimates of the mixture density. The hierarchical modeling approach has an advantage over the aggregate modeling approach because it allows for a distribution of parameters across individual differences; it also has an advantage over the individual modeling approach because it avoids fitting separate parameters to each person. However, the drawback of the hierarchical approach is that it requires an accurate assumption about the distribution of individual differences – if this assumption is wrong, then the hierarchical modeling approach could produce poorer estimates of the distribution of parameters than the individual modeling approach.