5. Other research topics in CLIR
5.1. Pivot language approach
So many languages are spoken in the world, and it is not always possible to obtain the bilingual resources
we need for a particular pair of languages. A promising technique to circumvent the problem of
limited availability of linguistic resources would be the pivot language approach, in which an intermediate
language acts as a mediator between two languages for which no bilingual resource is available. Suppose
that a CLIR task between Japanese and Dutch is requested by a user. In this case, machine-readable
resources of Japanese–Dutch pairs may be unavailable, and it would be easier to find Japanese–English and
Dutch–English resources since English is such a widely used language. Thus CLIR between Japanese
and Dutch can be performed via English (as an intermediary) without direct bilingual resources of Japanese
and Dutch.
The pivot language approach may also alleviate the problem of explosive combinations of languages,
i.e., if we have to perform CLIR between each pair of n languages, O(n2) resources are needed. However,
the pivot language approach enables us to handle the complex tasks with only O(n) resources (Gey, 2001).
A basic way of using the pivot language approach would be a transitive translation of a query using two
bilingual dictionaries (Ballesteros, 2000). In the case of search from Japanese to Dutch via English, if Japanese–
English and English–Dutch dictionaries are available, CLIR can be performed by replacing Japanese
query terms with the corresponding English equivalents and successively substituting the English equivalents
with the Dutch equivalents. Of course, if Japanese–English and English–Dutch MT systems can be
used, a similar transitive translation is also feasible.
With the case of dictionary-based transitive translation, translation ambiguity can become a more serious
problem. It is possible that resulting translations become doubly ambiguous if each replacement stage
yields ambiguity: (1) from the source language to the intermediate language and (2) from the intermediate
language to the target language. Suppose, for example, that a Japanese source query consists of four words,
and every word has four English equivalents. In addition, if every English equivalent has four Dutch equivalents,
simple replacements are going to produce 64 (=43) search terms in total from only 4 source terms,
which would inevitably contain some irrelevant translations. To solve this problem, Ballesteros (2000) attempted
to apply the disambiguation methods mentioned above (co-occurrence-based method, query
expansion, etc.) to transitive translation and attained a substantial improvement in search performance.
Gollins and Sanderson (2001) also proposed a technique called ‘‘lexical triangulation’’ to alleviate the translation
ambiguity problem in which two pivot languages are used independently and removal of erroneous
translations is attempted by taking only translations in common from two ways of transitive translation
using two pivot languages.
The pivot language approach has been utilized in TREC, NTCIR, and CLEF due to unavailability of
bilingual resources. For example, the following transitive combinations of languages have been explored:
• English>French>German (Franz et al., 1999)
• French>English>German, etc. (Gey, Jiang, Chen, & Larson, 1999)
K. Kishida / Information Processing and Management 41 (2005) 433–455 445
• German>English>Italian (Hiemstra & Kraaij, 1999)
• Japanese>English>Chinese (Lin & Chen, 2003)
• Chinese>English>Japanese (Chen & Gey, 2003)
In particular, Franz et al. (1999) proposed some interesting techniques for searching German documents
with English queries:
(1) Convolution of translation probability: Estimating translation probability from an English term e to a
German term g through French terms f such that
PðgjeÞ ¼
X
f
Pðgjf ÞPðf jeÞ:
(2) Automatic query generation from the intermediate language corpus: Generating French queries
automatically by simply merging all non-stopwords in the top-ranked French documents searched
by the English–French CLIR system, and putting the French query into the French–German CLIR
system.
5.2. Merging strategy for multilingual information retrieval
Suppose that we have a multilingual document collection in which two or more languages are mixed (not
a parallel corpus), and a user wishes to search the collection for a query expressed in a single language. This
task is more complicated than simple bilingual CLIR. In CLEF and NTCIR, multilingual CLIR has been
adopted as a research task, and many research groups have worked on the issue.
Basically, there are two approaches for multilingual IR (Lin & Chen, 2003):
• Distributed architecture in which the document collection is separated by language, and each part is
indexed and retrieved independently.
• Centralized architecture in which the document collection in various languages is viewed as a single document
collection and is indexed in one huge index file.
In distributed architectures, a standard bilingual search is repeatedly performed for each separate
language sub-collection respectively, and several ranked document lists are generated by each run.
Then the problem becomes how to merge the results of each run into a single ranked list so that all relevant
documents in any language are successfully ranked. Essentially, the merging strategy is a general research
issue of IR when searching distributed resources (i.e., distributed IR), in which it is inevitably necessary to
merge ranked lists obtained from each resource. In CLIR, the following merging strategies have been
investigated:
• Raw score: straightforwardly using document scores estimated in each run.
• Round robin: interleaving each document list in a round robin fashion by assuming that distribution of
relevant documents is identical among the lists.
• Normalized score: normalizing document scores by each run in order to remove effects of collectiondependent
statistics on estimation of the scores.
• Rank-based score: mathematically converting ranks in each run into scores by assuming a relationship
between the rank and probability of relevance.
• Modified score: modifying raw scores in each run so as to reduce effects of collection-size dependency,
translation ambiguity, etc.
446 K. Kishida / Information Processing and Management 41 (2005) 433–455
If the retrieval model employed for each run can estimate relevance probability of each document correctly,
it would be reasonable to re-rank all documents together according to values of the probability (i.e.,
raw scores). For example, Chen and Gey (2003) simply merged the results from Chinese, Japanese and English
collections according to values of probability of relevance estimated by the logistic regression model.
However, in most cases, it would be difficult to consider each document score to be a pure probability of
relevance even if a probabilistic retrieval model was actually used. In this case, if we can assume that relevant
documents are distributed in the same way in every separate language sub-collection, a simple strategy
is round robin-based merging, in which only the rank of each document is taken into account.
Otherwise, an alternative method is to use normalized document scores such that
v ¼ ðv vminÞ=ðvmax vminÞ;
where v is a raw score, and vmin and vmax are the minimum and maximum in each run respectively
(Powell, French, Callan, Connell, & Viles, 2000). Savoy (2002) has empirically compared search performance
among the four strategies of round robin, raw score, normalized score and the CORI approach (see
Callan et al., 1995 for details) using the CLEF test collection and reported that normalized score is dominant
among them. Similarly, Moulinier and Molina-Salgado (2002) tried to conduct comparisons among
round robin, raw score, CORI, normalized score and collection-weighted normalized score (a variation of
normalized score), and reported that collection-weighted normalized score showed higher mean average
precision.
Other techniques for estimating optimal scores for merging ranked lists have been proposed. Franz et al.
(2000) empirically found a linear relationship between log of rank and precision at the rank and used scores
that are converted according to the relationship for merging results from each run. Similarly, the strategy of
rank-based scoring was investigated in Kraaij et al. (2000). Hiemstra et al. (2001) also examined the effectiveness
of modifying raw scores so as to remove effects of collection-size dependency in the process of estimating
raw scores. Meanwhile, Lin and Chen (2003) proposed a method of modifying raw scores based on
the degree of ambiguity when each source query was translated, according to an assumption that a good
translation may give much more relevant documents. Savoy (2003a) tested a logistic regression formula
for predicting a relevance probability of a document given a rank and a score of the document.
On the other hand, for the centralized architecture, the set of multilingual documents is not divided into
sub-collections for each language. In order to search such a heterogeneous collection, we need either
(1) to translate the source query into all languages included in the document collection and to merge all
translations into a single query, or
(2) to translate the documents into a single language used in the query.
Gey et al. (1999), Chen (2002) and Nie and Jin (2003) employed the first method for searching the CLEF
test collection. With this method, it may be necessary to adjust idf factors because documents in a language
having fewer documents may take advantage of weighting by document frequency (Lin & Chen, 2003).
5.3. Combination of some language resources
Needless to say, quality and coverage of language resources for translation significantly affect search performance
of CLIR. Specifically, in the case of searches between two unrelated la