Towards Scalable Human-aligned Benchmark for Text-guided Image Editing – 서울대학교 데이터사이언스대학원

Text-guided image editing—modifying images based on natural language prompts—has rapidly evolved with the rise of diffusion-based generative models. These models are now capable of performing sophisticated edits such as object insertion, removal, and attribute modification. However, evaluating their performance remains a major challenge. Existing benchmarks often rely on subjective user studies or small datasets, limiting reproducibility and consistency.

To address these limitations, the authors present HATIE—a large-scale, fully-automated, and perceptually grounded benchmark for fair and scalable evaluation. A central contribution of HATIE lies in its end-to-end automation pipeline, which spans three core components: (1) a novel filtering process to select editable images and objects, (2) a data-driven and context-aware editing query generation system, and (3) a human-aligned evaluation framework combining multiple perceptual metrics.

The automated filtering pipeline identifies suitable image regions for editing by excluding objects that are too small, heavily occluded, ambiguous, or undetectable by segmentation models. It also removes images with duplicate object classes to eliminate ambiguity. This allows the construction of a clean, high-quality dataset with 18,226 images and 19,933 editable objects spanning 76 categories, ready for structured editing.

Building on this, HATIE generates 49,840 editing queries that are not only diverse but also grounded in visual plausibility. The system automatically synthesizes instructions using template-based language generation, while performing feasibility checks based on object co-occurrence patterns and spatial relationships. For instance, it avoids implausible edits like placing a car on a bookshelf, ensuring each query is realistic and relevant to the image context.

To evaluate the results, HATIE introduces a multi-dimensional scoring system that captures five key aspects of editing: Object Fidelity, Background Fidelity, Object Consistency, Background Consistency, and Image Quality. Each metric is computed using a combination of CLIP similarity, LPIPS distance, DINO features, L2 distance, FID, and segmentation analysis. Crucially, these scores are aggregated using weights learned from a user study, ensuring the final Total Score aligns closely with human perceptual judgment.

Extensive experiments demonstrate that HATIE enables precise and fine-grained numerical evaluation of image editing models. Not only does it provide a single total score for comprehensive model comparison, but it also delivers detailed scores for each sub-aspect, allowing researchers to analyze model behavior from multiple angles. The benchmark system is robust and sensitive enough to detect even subtle variations in output quality, with very small statistical error margins, making it suitable for rigorous performance validation and benchmarking.

In summary, HATIE offers a comprehensive, automated, and human-aligned evaluation framework for text-guided image editing. By integrating novel object filtering, context-aware query generation, and perceptually grounded evaluation, HATIE establishes a new standard for scalable benchmarking in the field. The benchmark is publicly available and is poised to accelerate research in controllable and trustworthy image editing models.

Suho Ryu, Kihyun Kim, Eugene Baek, Dongsoo Shin, Joonseok Lee

https://openaccess.thecvf.com/content/CVPR2025/html/Ryu_Towards_Scalable_Human-aligned_Benchmark_for_Text-guided_Image_Editing_CVPR_2025_paper.html