Network Architecture
A triplet-based network architecture is proposed for the
ranking loss function (4), illustrated in Fig. 2. This netQ P N
Triplet Sampling Layer
....
Images
....
Ranking Layer
p
i p i+ p if(pi) f(pi+ - ) f(pi)
Figure 2. The network architecture of deep ranking model.
work takes image triplets as input. One image triplet contains a query image pi, a positive image p+ i and a negative
image p− i , which are fed independently into three identical deep neural networks f(.) with shared architecture and
parameters. A triplet characterizes the relative similarity relationship for the three images. The deep neural network
f(.) computes the embedding of an image pi: f(pi) ∈ Rd,
where d is the dimension of the feature embedding.
A ranking layer on the top evaluates the hinge loss (3)
of a triplet. The ranking layer does not have any parameter. During learning, it evaluates the model’s violation of
the ranking order, and back-propagates the gradients to the
lower layers so that the lower layers can adjust their parameters to minimize the ranking loss (3).
We design a novel multiscale deep neural network architecture that employs different levels of invariance at different scales, inspired by [8], shown in Fig. 3. The ConvNet
in this figure has the same architecture as the convolutional
deep neural network in [15]. The ConvNet encodes strong
invariance and captures the image semantics. The other two
parts of the network takes down-sampled images and use
shallower network architecture. Those two parts have less
invariance and capture the visual appearance. Finally, we
normalize the embeddings from the three parts, and combine them with a linear embedding layer. In this paper, The
dimension of the embedding is 4096.
We start with a convolutional network (ConvNet) architecture for each individual network, motivated by the recent
success of ConvNet in terms of scalability and generalizability for image classification [15]. The ConvNet contains
stacked convolutional layers, max-pooling layer, local normalization layers and fully-connected layers. The readers
can refer to [15] or the supplemental materials for more details.
A convolutional layer takes an image or the feature maps
of another layer as input, convolves it with a set of k learnable kernels, and puts through the activation function to