Since most recent research studies in sentiment classification have been performed in some limited number of languages
(usually English), there are an insufficient number of labelled sentiment data existing in other languages [23]. Therefore, the
challenge arises as to how to utilise labelled sentiment resources in one language (the source language) for sentiment classification
in another language (the target language). This challenge then leads to an interesting research area called crosslingual
sentiment classification (CLSC). Most existing research works have employed automatic machine translation to
directly translate the test data from the target language into the source language [20,25,32,33]. Following this, a trained classifier
in the source language has been used to classify the translated test data.
However, term distribution between the original and the translated text document is different due to the variety in writing
styles and linguistic expressions in the various languages. It means that a term may be frequently used in one language to
express an opinion while the translation of that term is rarely used in the other language. Hence, these methods cannot reach
the level of performance of monolingual sentiment classification. To solve this problem, making use of unlabelled data from
the target language can be helpful, since this type of data is always easy to obtain and has the same term distribution as the
target language. Therefore, employing unlabelled data from the target language in the learning process is expected to result
in better classification performance in CLSC.
Semi-supervised learning [24] is a well-known technique that makes use of unlabelled data to improve classification performance.
One of the most commonly used semi-supervised learning algorithms is that of self-training. This technique is an
iterative process. Semi-supervised self-training tries to automatically label examples from unlabelled data and add them to
the initial training set in each learning cycle. The self-training process usually selects high confidence examples to add to the
training data. However, if the initial classifier in self-training is not good enough, there will be an increased probability of
adding examples having incorrect labels to the training set. Therefore, the addition of ‘‘noisy’’ examples not only cannot
increase the accuracy of the learning model, but will also gradually decrease the performance of the classifier. On the other
hand, self-training selects most confident examples to add to the training data. But these examples are not necessarily the
most informative instances (especially for discriminative classifiers, like SVM) for classifier improvement [16]. To solve these
problems, we combine the processes of self-training with active learning in order to enrich the initial training set with some
selected examples from unlabelled pool in the learning process. Active learning tries to select as few as possible the most
informative examples from unlabelled pool and label them by a human expert in order to add to the training set in an
iterative process. These two techniques (self-training and active learning) complement each other in order to increase the
performance of CLSC while reduce human labelling efforts.
In this paper, we propose a new model based on the combination of active learning and semi-supervised self-training in
order to incorporate unlabelled data from the target language into the learning process. Because active learning tries to select
the most informative examples (in most cases, the most uncertain examples), these examples may be outlier, especially in
the field of sentiment classification of user’s reviews. To avoid outlier selection in the active learning technique, we considered
the density of the selected examples in the proposed method so as to choose those informative examples that had maximum
average similarity (the more representatives) in the unlabelled data. The proposed method was then applied to book
review datasets in three different languages. Results of the experiments showed that our method effectively increased the
performance levels while reduced the human labelling effort for CLSC in comparison with some of the existing and baseline
methods.
This paper is an extended version of work published in [9]. We extend our previous work in four directions. First, we add
more description regarding problem situation and corresponding solutions and also more discussion about experimental
results. Some new findings from new experiments are also presented in this version. Secondly, more evaluation datasets
in new languages are used in the evaluation section to show the generality of the proposed model in different languages.
Thirdly, the comparison scope is extended by adding more baseline methods and one of the best performing previous
method in CLSC in order to reveal the effectiveness of the proposed model. Finally, in order to assess whether there are
signi