Self-Guided Masked Autoencoder – 서울대학교 데이터사이언스대학원

We introduce Self-Guided Masked Autoencoder (SelfMAE), a novel variant of the Masked Autoencoder (MAE) framework for self-supervised visual representation learning. Through in-depth analysis, we discover that MAE inherently performs pattern-based patch-level clustering from the very early stages of pretraining. Using similarity and attention metrics, we show that the MAE encoder learns to group patches based on visual characteristics like texture and color, even without external labels or supervision. This behavior leads to a highly structured embedding space, confirmed quantitatively through increased feature and similarity variance compared to baselines such as MoCo and ViT. Our analysis further reveals that the decoder utilizes high-level shared representations from the encoder to reconstruct masked patches, validating the role of patch clustering in MAE’s learning dynamics.

Building on these insights, we propose a self-guided informed masking strategy that replaces random masking in MAE with masks derived from its own evolving patch-clustering ability. We define a novel metric, the exploitation rate, to detect the training point at which mask tokens begin to meaningfully represent global image structure. At this point, we initiate informed masking by bi-partitioning the image into semantically distinct patch clusters using a graph-based algorithm (Normalized Cut), and intensively masking one cluster—typically the foreground object—to accelerate learning of less distinguishable patterns. Our method incorporates minimal “hint” tokens to preserve reconstruction viability and is fully self-contained, requiring no external models or annotations. This enables MAE to focus computational resources on learning finer semantic details.

Our approach consistently improves performance across various downstream tasks, including image classification (ImageNet-1K, CIFAR-100, iNaturalist), object detection (COCO), and semantic segmentation (ADE20K), outperforming both the original MAE and other informed masking approaches like AMT. We show that SelfMAE enhances attention spread, increases the utilization of high-frequency visual information, and diversifies the semantic representation of mask tokens. Through extensive ablations, we confirm the importance of key design choices such as the informed masking schedule, attention layer selection, and hint strategy. Overall, SelfMAE significantly accelerates and strengthens MAE’s learning process while maintaining the advantages of full self-supervision, offering both theoretical insights and practical benefits for vision transformer training.

Jeongwoo Shin, Inseo Lee, Junho lee, Joonseok Lee

https://proceedings.neurips.cc/paper_files/paper/2024/file/6c4a1a3cbe70ef36d7d6332166bba77d-Paper-Conference.pdf

References

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
L. Kong, M. Q. Ma, G. Chen, E. P. Xing, Y. Chi, L.-P. Morency, and K. Zhang. Understanding masked autoencoders via hierarchical latent variable models. In CVPR, 2023.
N. Park, W. Kim, B. Heo, T. Kim, and S. Yun. What do self-supervised vision transformers learn? In ICLR, 2023.