内容提要: |
Recently, video action recognition receives lots of attention, and deep learning based methods have achieved promising performance. Most existing methods focus on spatiotemporal information encoding to learn video representations, which ignore the relevance among channels. In this paper, a novel attention model called Channel-wise Temporal Attention Network(CTAN) is proposed to explore the fine-grained key information for action recognition. First, the channel-wise attention generation module is proposed to emphasize the fine-grained informative features in each frame. Then, the temporal information aggregation module is introduced before attention generation to exploit the interaction of different frames. Finally, a discriminative video-level representation for action recognition is generated by end-to-end training. |