Underwater target classification using passive sonar remains a critical issue due to the changeable ocean environment. Convolutional Neural Networks (CNNs) have shown success in learning invariant features using local filtering and max pooling. In this paper, we propose a novel classification framework which combines the CNN architecture with the second-order pooling (SOP) to capture the temporal correlations from the timefrequency (T-F) representation of the radiated acoustic signals. The convolutional layers are used to learn the local features with a set of kernel filters from the T-F inputs which are extracted by the constant-Q transform (CQT). Instead of using max pooling, the proposed SOP operator is designed to learn the co-occurrences of different CNN filters using the temporal feature trajectory of CNN features for each frequency subband. To preserve the frequency distinctions, the correlated features of each frequency subband are retained. The pooling results are normalized with signed square-root and l2 normalization, and then input into the softmax classifier. The whole network can be trained in an end-to-end fashion. To explore the generalization ability to unseen conditions, the proposed CNN model is evaluated on the real radiated acoustic signals recorded at new sea depths. The experimental results demonstrate that the proposed method yields an 8% improvement in classification accuracy over the state-of-the-art deep learning methods.