Our CNN G architecture is shown in Figure 3 and is trained to predict a prob -distribution Z given a grayscale image X. Our network has 8 blocks of convolutional layers. The first 5 convolutional blocks are initialized from the VGG convolutional layers, with some architec- tural modifications. We remove the pooling layers, place the stride in the pre- ceding convolutional layer, and add batch normalization after every block of convolutions. Since the input is single-channel lightness rather than a three- channel RGB image, the weights of the first convolutional layer are averaged. The VGG network downsamples the spatial feature maps by a factor of 2 after each block of convolutional layers, and doubles the number of output channels. We remove the stride in the final conv4 layer and dilate the conv5 layer kernels by a factor of 2 to compensate for the increased input spatial resolution . This allows us to produce 4 times the amount of spatial information in the network bottleneck, with a small 10.3% increase in memory per image.
We found that performing a simple bilinear upsampling from a re- duced feature map produced results that were sufficiently detailed. We use deep supervision to help guide learning.To supervise an intermediate layer in the net, we add a readout layer on top, consisting of a 1x1 convolution to convert from feature to output space, followed by a softmax. The output is supervised using cross entropy loss with respect to the down-sampled ground truth. We add these readout layers on conv6, conv7, and conv8. The losses are normalized to the number of spatial locations evaluated, with losses deeper in the network weighted twice the amount as the previous layer.