Real-time precision retrieval based on big data images has become a key technical issue recently.
The vocabulary tree is an efficient method for addressing this issue owing to high precision
and fast retrieval time. Most of the existing construction methods for the vocabulary
tree are centralized. However, under a centralized scheme, it is almost impossible to train a
big vocabulary tree with limited memory to retrieve a similar image with high precision. In
this paper, a new scheme of the distributed in-memory vocabulary tree based on MapReduce
model for massive image training and retrieval is proposed. Firstly, the distributed image feature
exaction mechanism is presented to preprocess massive images. Secondly, a distributed
K-means algorithm based on MapReduce model is proposed to build the first level of the vocabulary
tree concurrently. Thirdly, the big vocabulary tree is divided into many subtrees. The
entire training task for computing the vocabulary tree is divided into many subtasks. These
training subtasks are performed in parallel in the memory of the cluster nodes. This distributed
vocabulary tree strategy can support massive image training in memory. Therefore, a
similar image can be retrieved in a distributed manner based on MapReduce model. Besides,
the training time and memory overhead of our proposed scheme are analyzed in detail. The
experimental results demonstrate that, with an increase in computer nodes, the training time
and memory overhead on each node are linearly reduced, and the retrieval time is relatively
reduced compared with centralized scheme without a loss of retrieval precision.