Masked Autoencoder (MAE) is a self-supervised approach for representation learning, widely applicable to a variety of downstream tasks in computer vision. In
spite of its success, it is still not fully uncovered what and how MAE exactly
learns. In this paper, with an in-depth analysis, we discover that MAE intrinsically
learns pattern-based patch-level clustering from surprisingly early stages of pretraining. Upon this understanding, we propose self-guided masked autoencoder,
which internally generates informed mask by utilizing its progress in patch clustering, substituting the naive random masking of the vanilla MAE. Our approach
significantly boosts its learning process without relying on any external models
or supplementary information, keeping the benefit of self-supervised nature of
MAE intact. Comprehensive experiments on various downstream tasks verify the
effectiveness of the proposed method.
We introduce Self-Guided Masked Autoencoder (SelfMAE), a novel variant of the Masked Autoencoder (MAE) framework for self-supervised visual representation learning. Through in-depth analysis, we discover that MAE inherently performs pattern-based patch-level clustering from the very early stages of pretraining. Using similarity and attention metrics, we show that the MAE encoder learns to group patches based on visual characteristics like texture and color, even without external labels or supervision. This behavior leads to a highly structured embedding space, confirmed quantitatively through increased feature and similarity variance compared to baselines such as MoCo and ViT. Our analysis further reveals that the decoder utilizes high-level shared representations from the encoder to reconstruct masked patches, validating the role of patch clustering in MAE’s learning dynamics.
Building on these insights, we propose a self-guided informed masking strategy that replaces random masking in MAE with masks derived from its own evolving patch-clustering ability. We define a novel metric, the exploitation rate, to detect the training point at which mask tokens begin to meaningfully represent global image structure. At this point, we initiate informed masking by bi-partitioning the image into semantically distinct patch clusters using a graph-based algorithm (Normalized Cut), and intensively masking one cluster—typically the foreground object—to accelerate learning of less distinguishable patterns. Our method incorporates minimal “hint” tokens to preserve reconstruction viability and is fully self-contained, requiring no external models or annotations. This enables MAE to focus computational resources on learning finer semantic details.
Our approach consistently improves performance across various downstream tasks, including image classification (ImageNet-1K, CIFAR-100, iNaturalist), object detection (COCO), and semantic segmentation (ADE20K), outperforming both the original MAE and other informed masking approaches like AMT. We show that SelfMAE enhances attention spread, increases the utilization of high-frequency visual information, and diversifies the semantic representation of mask tokens. Through extensive ablations, we confirm the importance of key design choices such as the informed masking schedule, attention layer selection, and hint strategy. Overall, SelfMAE significantly accelerates and strengthens MAE’s learning process while maintaining the advantages of full self-supervision, offering both theoretical insights and practical benefits for vision transformer training.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
L. Kong, M. Q. Ma, G. Chen, E. P. Xing, Y. Chi, L.-P. Morency, and K. Zhang. Understanding masked autoencoders via hierarchical latent variable models. In CVPR, 2023.
N. Park, W. Kim, B. Heo, T. Kim, and S. Yun. What do self-supervised vision transformers learn? In ICLR, 2023.