4.1. Datasets
Three different evaluation datasets from three different languages were used in the research reported in this paper and
are detailed as follows:
1. English–French book review dataset (En–Fr): This dataset contains Amazon book reviewing documents in both English
and French languages. This dataset was used by Prettenhofer and Stein [28]. In this dataset, the English language was
treated as the source language and the French language was treated as the target language. Documents in the English
language containing 2000 (1000 positive and 1000 negative) book reviews were used as the labelled data. A total of
4000 review documents (2000 positive and 2000 negative) were selected from the French dataset and treated as the unlabelled
data.
2. English–Chinese book review dataset (En–Ch): This dataset was selected from the Pan reviews dataset [25]. It contains
book review documents in the English and Chinese languages. As for the previous dataset, documents in the English language
containing 2000 (1000 positive and 1000 negative) book reviews were used as the labelled data. Documents in the
Chinese language containing 4000 (2000 positive and 2000 negative) book reviews were treated as the unlabelled data.
3. English–Japanese book review dataset (En–Jp): This dataset contains Amazon book review documents in the English and
Japanese languages. This dataset was also used by Prettenhofer and Stein [28]. In this dataset, the English language was
treated as the source language and the Japanese language was treated as the target language. Documents in the English
language containing 2000 (1000 positive and 1000 negative) book reviews were used as the labelled data. A total of 4000
review documents (2000 positive and 2000 negative) were selected from the Japanese dataset and treated as the unlabelled
data.
All review documents are labelled as being either positive or negative based on their sentiment polarities. Each Amazon
review has a polarity rating from zero to five stars. Zero star is the most negative review and five stats indicate the most
positive review. All reviews with rating greater than three stars are labelled as positives and those with rating less than three
stars are labelled as negatives. Reviews with three stars are discarded because their polarities are ambiguous. All the review
documents in the target languages were translated into the source language (English) using the Google translate engine.1
Table 1 shows the properties of these three evaluation datasets.
In the pre-processing step, all the English language reviews were converted into lowercase. Special symbols, words with
one character length and other unnecessary characters were eliminated from each review document. In the feature extraction
step, unigram and bi-gram patterns were extracted as sentimental patterns. To reduce the computational complexity,
especially in density estimation, we performed feature selection using the information gain technique [37]. We selected
5000 high score unigrams and bi-grams as final features. Each document was represented by a feature vector. Each entry
of a feature vector contained a feature weight. We used term presence as feature weights since this method has been confirmed as the most effective feature weighting method in sentiment classification [26,36].