Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

September 30, 2019 · Declared Dead · 🏛 International Conference on Machine Learning and Applications

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Alexandros Stergiou, Ronald Poppe arXiv ID 1909.13474 Category cs.CV: Computer Vision Citations 20 Venue International Conference on Machine Learning and Applications Last Checked 4 months ago

Abstract

Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence of the merits of the decoupled spatio-temporal convolutions.