A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

June 26, 2023 ยท Entered Twilight ยท ๐Ÿ› arXiv.org

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: .gitignore, README.md, beam.py, configs, data.py, encoder, ensemble.py, ensemble_b.py, eval_for_loveu_cvpr2022.py, inference.py, model.py, pretrain, sh, train.py

Authors Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen arXiv ID 2306.14412 Category cs.CV: Computer Vision Cross-listed cs.MM Citations 0 Venue arXiv.org Repository https://github.com/zcfinal/LOVEU-CVPR23-AQTC โญ 1 Last Checked 3 months ago
Abstract
Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario. In this scenario, through learning instructional videos, AI assistants provide users with step-by-step guidance on operating devices. In this paper, we present a solution for enhancing video alignment to improve multi-step inference. Specifically, we first utilize VideoCLIP to generate video-script alignment features. Afterwards, we ground the question-relevant content in instructional videos. Then, we reweight the multimodal context to emphasize prominent features. Finally, we adopt GRU to conduct multi-step inference. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our method, which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision

๐ŸŒ… ๐ŸŒ… Old Age

Fast R-CNN

Ross Girshick

cs.CV ๐Ÿ› ICCV ๐Ÿ“š 27.7K cites 11 years ago