Our CNN G architecture is shown in Figure 3 and is trained to predict a prob -distribution Z given a grayscale image X. Our network has 8 blocks of convolutional layers. The first 5 convolutional blocks are initialized from the VGG convolutional layers, with some architec- tural modifications. We remove the pooling layers, place the stride in the pre- ceding convolutional layer, and add batch normalization after every block of convolutions. Since the input is single-channel lightness rather than a three- channel RGB image, the weights of the first convolutional layer are averaged. The VGG network downsamples the spatial feature maps by a factor of 2 after each block of convolutional layers, and doubles the number of output channels. We remove the stride in the final conv4 layer and dilate the conv5 layer kernels by a factor of 2 to compensate for the increased input spatial resolution . This allows us to produce 4 times the amount of spatial information in the network bottleneck, with a small 10.3% increase in memory per image.