We first explore the potential parallelism of the recurrent neural network and propose a fine-grained two-stage pipeline implementa- tion. The experiment results are illustrated in Table I. The results show that the proposed GPU implementation can achieve 2 ∼ 11× speed-up compared with the basic CPU implementation with the Intel Math Kernel Library. We then use the proposed GPU implementation978