The training of all CNNs in this work has been carried out by optimizing the cross-entropy objective function using the mini-batch Nesterov’s accelerated gradient descent [19]. Backpropagation of the gradient has been performed with an initial learning rate of 0.01 and the momentum of 0.9. Contrary to some recent works where CNNs are used just as feature extractors and followed by other classification methods (like SVM [21] or ELM [33]), in our experiments, CNNs have been used both as feature extractors and as classifiers. Input RGB-images have been normalized before CNN processing. Every epoch, faces are randomly substituted by their mirrored copy with the probability 0.5 (i.e. either face or its mirrored copy participates in every epoch). The size of a mini-batch has been set to 128. In order to prevent the CNNs from overfitting, we have employed the “dropout” regularization [26] on the activations of convolutional layers and the fully-connected layer. We have made the ratio of the “dropout” to be dependent on the particular size of the convolutional or the fully-connected layer varying it from 0 (i.e. no “dropout”) to 0.5. The training has been stopped once the validation accuracy stops improving. It corresponds to the moment when the training accuracy is between 98.0% and 98.1% (depending on the particular CNN architecture). Training has taken about 30 epochs with slight variations depending on the particular CNN, which corresponds to about 27 h of training for the Starting CNN and 2.5 h of training for the CNN I on the contemporary GPU. All experiments in this work have been performed using Theano deep learning library