CNNs have a large number of hyperparameters, such as input
image patch size, number of layers, number of filters, size of
filters, and parameters for training. The tuning of these hyperparameters
is essential to obtain good performance for a specific
task. The patch size of the input for CNN, the filter size and
pooling stride of the first convolution layer, and the size of
the filters have been tuned to maximize performance in this
study. Exhaustive searching of the best combination of hyperparameters
is very time consuming. In this experiment, the
hyperparameters are tuned in two steps. First, the input image
patch size and filter sizes are tuned using only three images,
each of which is used for training, validation, and training.
Then, the numbers of filters and layers are tuned on the full
data set using the size of input image patch and the size of filters
from the last step.
The patch size of the input image and the filter size are
related with the intrinsic scale and complexity of the problem.
In the tuning experiment, different patch sizes were tested. The
patch size was found to have an impact on the results with
most of the differences coming from the banding effect and
melt surfaces. A patch size of 41 showed the least banding
effect and ice concentration underestimation, which might be
caused by melt conditions, and was therefore adopted. CNN
models with smaller patch size tend to underestimate ice concentration.
Intuitively, the strength and the small-scale texture
of the backscatter of a melt pond are very similar to calm
water [33]; thus, the correct identification of melt ice needs
more information from its neighborhood in the image. A small
patch size causes confusion between melt and water due to this
lack of enough supporting neighborhood information. Using
patch size over 41 does not lead to improved performance.
Similarly, larger patch size also benefits the recognition of
wind-roughened water. If the banding effect were to be totally
removed from the image, the optimal patch size might be
different. Our tuning experiments suggest that the model is not
very sensitive to the selection of other parameters as long as the
model is large enough (sufficient number of filters and layers).
The image analyses are subsampled because their spatial resolution
is much coarser than the SAR images, which introduces
representation errors. It would be beneficial to model the errors
explicitly [35], although the CNN is relatively robust to training
sample errors [17]. Another benefit of the CNN is that it is