We chose HRNet32 as the backbone network to perform the task of wildlife objectdetection in the manner of Cascade R-CNN [36,44]. HRNet achieves the purpose of strongsemantic information and precise location information through parallel branches of multipleresolutions and continuous information interaction between different branches [44].Overall, Cascade R-CNN has four stages, one Region Proposal Network (RPN) and threefor detection with IoU = {0.5, 0.6, 0.7}. Sampling in the first detection stage follows FasterR-CNN [45]. In the next stage, resampling is achieved by simply using the regressionoutput from the previous stage. The model structure is shown in Figure 4