The spider linear support vector machine (SVM) algorithm (version
1.71; www.kyb.tuebingen.mpg.de/bs/people/spider) was used forMVP
analyses. To compare the classification performance for delayed recall
trials to that of long term memory and immediate recall trials we first
created run-wise folds of the data, and cross validated by training on
two runs and testing on the remaining run. This run-wise crossvalidation
is more conservative than trial-wise cross-validation. For
each fold a separate recursive feature elimination analysiswith an initial
feature reduction stepwas performed. The initial feature reduction step
comprised of selecting the 5000 voxels that discriminated most between
the training classes. This selection was based on the unsigned tstatistic
computed for each feature separately. Importantly, this selection
step only included trials belonging to the training set at the particular
fold. The training set within a fold was further split 10 times, each
time leaving out another 1/10th of the available trials to avoid
overfitting. The SVMclassifierwas trained on each split and the average
absolute discriminative weights (|w|) of the features over the ten splits
were computed. These averagedweightswere used in the recursive feature
elimination step to discard the 10% least discriminating features.
The recursive feature elimination step was performed fifty times, each
time discarding 10% of the features. At each iteration the average accuracy
(and classifier certainty) over the 10 splits of the training set were
computed. As said above, this entire procedure was repeated three
times (once for each fold) changing the training and testing runs. Lastly,
for each recursive feature elimination level the classification accuracy
(and classifier certainty) averaged over the three folds at that level
was obtained. The best recursive feature elimination level is the level
with the highest average classification accuracy.
When testing the generalizability of training to discriminate immediate
from long term retrieval examples to the delayed retrieval trials,
only delayed trials were included in the test set that occurred in the
run not used for training. This ensured the same level of independency
between training and test data for both the reference categories (immediate
and long term retrieval) and the target category (delayed retrieval).
In the standard MVP generalization approach, the best iteration in
the RFE approach is selected based on the accuracy obtained on the
test data. Consequently,when testing on the delayed retrieval target examples,
the iteration is chosen that gives the highest classification of
these target trials. However, to maximize the rigour of testing the reverse
inference hypothesis, we also report classification accuracies for
target examples at the iteration that was most optimal for discriminating
the reference examples, i.e., the iteration that gave the highest accuracy
in classifying immediate versus long term retrieval examples.
Statistical significance was based on 250 re-analyses of the data following
the exact same steps as described above (initial reduction, RFE
and best iteration selection), but with random redistribution of the
training labels. The label re-assignment mimics the null-hypothesis
that there is no systematic association between feature values and classes,
so that the class labels are interchangeable. Random accuracies
were on average higher than 50.0% (47.0 to 56.2%) due to selection of
the best of all iteration. The individual significance level was 0.05. The
significance of individual participant results was the centile position of
the observed accuracy amongst the 250 accuracies obtainedwith the repeated
randomization procedure. Significance at the group level was
established with the cumulative binomial coefficient, to take into account
the accumulation of chance.
Parameter dependency of classification performance
Because there are many parameters to select in the entire MVP procedure,
the influence of several of these parameter settings on the result
was investigated.With respect to the initial univariate feature reduction
step we investigated the effect of the number of features selected (All,
1000, 5000 or 2500 voxels) and the score used for the selection. De
Martino et al. (2008) obtained the best results in a simple auditory discrimination
tasks with a score highlighting the most active voxels
within each of the training classes. Although we did not expect in our
more complex and effortful cognitive task that the most active voxels
would also be the most important voxels to distinguish between the
two types of memory retrieval, we nonetheless investigated the effect
of using this selection score. A second potentially influential factor is
the algorithm used. While most of the analysis were done with the spider
implementation of the SVMbecause of its speed,we verified the results
o