Finally, the aligned and row profile scaled data set with 50 samples and 866 variables was normalized for the pattern recog- nition analysis. Chemometric tools were applied to extract the main information in multivariate data and to develop classification models according to variety in the 1st case and geographical ori- gin in the second case. In the 1st attempt, principal component analysis (PCA) as an unsupervised technique was performed in order to reduce the dimensionality of the original data matrix and provide a new set of variables obtained as a linear combination of the original features. The resulting PCA scores matrix was the input for linear discriminant analysis (LDA). In a second attempt, stepwise linear discriminant analysis (SLDA) was applied as a clas sification method in order to optimize the discrimination. In this method, an LDA classification model is constructed by applying a stepwise variable selection procedure so that the most significant variables involved in sample differentiation are selected using a Wilks' Lambda as a selection criterion and an F-statistic to deter- mine the significance of the changes in Lambda when evaluating the influence of each new variable. Before choosing a new variable to include, this procedure checks to see if all of the variables previ- ously selected remain significant. If a variable selected earlier may no longer be useful, it is removed. This procedure stops when no other variables meet the criteria for entry or when the variable to be included next is one that was just removed. The leave-one-out method was used as cross-validation procedure to evaluate the classification performance. The preprocessing was implemented in MATLAB version 7.0.1.24704 (R14) Service Pack 1. The Matlab code for icoshift is available for download at www.models.life.ku.dk/source. PAC and SLDA were performed using SPSS17.0 software (SPSS Inc., Chicago, Ill., U.S.A.)
Results and Discussion
Data preprocessing
Figure 1A reports the original overlapped chromatograms after the removal of the leading and trailing sections and piece-wise linear baseline correction. It is visible that a pretreatment is neces- sary to correct peak shift. Figure 1B shows the profiles pretreated by the icoshift method, as indicated in "Materials and Methods" section
PCA data compression
After applying preprocessing to raw data, PCA was performed as a preliminary step in multivariate analysis in order to extract the main information existing in multivariate data. Seventeen prin cipal components (with eigenvalues 1) contributed to 89.35% of the total variance were considered significant. When repre- senting the scores of samples on a two-dimensional space defined by the 1st 2 principal components (accounting for 40.58% of the total variance clear separation between samples could not be obtained either according to variety or to geographical origin (data not shown). Therefore, supervised learning pattern recog- nition methods were applied to acquire a higher level of group separation