Neptune: The Long Orbit to Benchmarking Long Video Understanding

December 12, 2024 ยท Declared Dead ยท ๐Ÿ› arXiv.org

๐Ÿฆด CAUSE OF DEATH: Skeleton Repo
Boilerplate only, no real code

Repo contents: .gitignore, CONTRIBUTING.md, LICENSE, README.md, combined_lengths.png, examples.png, minerva_examples.png, stats.png

Authors Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand arXiv ID 2412.09582 Category cs.LG: Machine Learning Cross-listed cs.AI, cs.CV Citations 16 Venue arXiv.org Repository https://github.com/google-deepmind/neptune โญ 83 Last Checked 2 months ago
Abstract
We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning

Died the same way โ€” ๐Ÿฆด Skeleton Repo

R.I.P. ๐Ÿฆด Skeleton Repo

Neural Style Transfer: A Review

Yongcheng Jing, Yezhou Yang, ... (+4 more)

cs.CV ๐Ÿ› IEEE TVCG ๐Ÿ“š 828 cites 8 years ago