Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning
December 02, 2023 Β· Entered Twilight Β· π International Conference on Machine Vision
Repo contents: .gitignore, ILSVRC2012_val_info, ILSVRC2012_validation_ground_truth.txt, LICENSE, README.md, ViT, ViT_feature_modeling, ViT_orig, author_README.md, author_RUN_AMAX.md, author_RUN_DGX.md, create_ilsvrc2012_val_folders.py, docker, files_to_replace, freezeout, full_pretrain_out_freezeout, image, untar_val.sh, vis_tool
Authors
Utku Mert Topcuoglu, Erdem AkagΓΌndΓΌz
arXiv ID
2312.02194
Category
cs.CV: Computer Vision
Citations
2
Venue
International Conference on Machine Vision
Repository
https://github.com/utkutpcgl/ViTFreeze
β 5
Last Checked
2 months ago
Abstract
In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Computer Vision
π
π
Old Age
π
π
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
π
π
Old Age
SSD: Single Shot MultiBox Detector
π
π
Old Age
Squeeze-and-Excitation Networks
R.I.P.
π»
Ghosted