Context-Robust Knowledge Editing for Language Models – 서울대학교 데이터사이언스대학원

Large language models (LLMs) exhibit emerging intelligence by absorbing extensive knowledge during pretraining. However, some of this knowledge may become outdated or require correction [1, 2]. To address this, knowledge editing focuses on modifying a subset of model parameters to ensure the model generates the edited knowledge [3, 4]. Yet, models edited by many existing methods often fail to recall the edited knowledge in the presence of preceding context during text generation (Figure 1). In particular, words in the preceding context that are semantically related to the original knowledge tend to receive disproportionately high attention scores, thereby disrupting the recall of the edited knowledge. Additionally, there is currently no benchmark for evaluating the robustness of knowledge editing methods to such contextual interference.

To address these limitations, we introduce CHED (Contextual Hop Editing Dataset), a new benchmark for evaluating the contextual robustness of knowledge editing methods. In CHED, edited knowledge is accompanied by several context texts containing words related to the original knowledge. Such contexts may interfere with the model’s ability to recall the edited knowledge, causing it to revert to the original knowledge. We further propose a new knowledge editing method, CoRE (Context-Robust Editing), which improves contextual robustness by prepending prefix contexts during editing and minimizing the variance of the model’s hidden states generated across these contexts (Figure 2). This simple regularization ensures that only the necessary parameter modifications are applied while preventing overfitting to any single context.
According to our evaluation, CHED effectively reveals the vulnerability of many knowledge editing methods to preceding context. Nevertheless, our CoRE method demonstrates strong contextual robustness. We also find that models are more easily distracted when the prefix context is provided as a user utterance (as in a chat setting), rather than as the model’s own utterance. Further analysis of the model’s attention mechanism reveals that CoRE helps the model reduce attention to distractive words in the preceding context and increase attention to words that facilitate recall of the edited knowledge.
This work will be published at ACL Findings 2025.

Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo

Preprint: https://arxiv.org/abs/2505.23026

References

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2024. A survey of large language models. Preprint, arXiv:2303.18223.
Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023. Editing large language models: Problems, methods, and opportunities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, Singapore. Association for Computational Linguistics.
Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad, and Jun Wang. 2023. How do large language models capture the ever-changing world knowledge? A review of recent advances. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8289–8311, Singapore. Association for Computational Linguistics.