ActionVLAD: Learning spatio-temporal aggregation for action classification

April 10, 2017 ยท Entered Twilight ยท ๐Ÿ› Computer Vision and Pattern Recognition

๐ŸŒ… TWILIGHT: Old Age
Predates the code-sharing era โ€” a pioneer of its time

"No code URL or promise found in abstract"
"Derived repo from GitHub Pages (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, LICENSE, LICENSE-tensorflow-models, README.md, combine_streams.py, convert_first_layer_for_flow.py, data, datasets, demo, deployment, docker, eval, eval_image_classifier.py, experiments, get_models.sh, nets, preprocessing, restore, train_image_classifier.py, vlad_utils

Authors Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell arXiv ID 1704.02895 Category cs.CV: Computer Vision Citations 463 Venue Computer Vision and Pattern Recognition Repository https://github.com/rohitgirdhar/ActionVLAD โญ 219 Last Checked 1 month ago
Abstract
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision