Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items

Understanding and incorporating human values into large language models (LLMs) has become increasingly important, particularly as these models are increasingly integrated into our daily lives. Researchers have developed various approaches to assess LLMs’ values, using psychometric inventories [1,2,3], large-scale benchmarks annotated by crowdworkers [4,5] or auto-labeled by LLMs [6,7,8]. However, existing methods rely on identifying perceived values in text, rather than collecting assessments from individuals who actually hold those values [6,8]. Moreover, existing works either rely heavily on standardized psychometric questionnaires or focus on safety scenarios, resulting in a significant discrepancy between the tested scenarios and the diverse real-world scenarios in which test models are most commonly used to generate text and express values.

To address these limitations, we adopt a more psychometrically rigorous approach and introduce Value Portrait, a more reliable benchmark for understanding LLMs’ value orientations across diverse real-world scenarios. Value Portrait has two key characteristics. First, each item is a query-response pair that reflects a realistic human-LLM interaction, sourced from both human-LLM interactions (ShareGPT, LMSYS) and human-human advisory contexts (Reddit, Dear Abby). Second, each query-response pair is tagged with strongly correlated values. To establish these correlations, a large number of human annotators rated each query-response pair based on how similar the response was to their own thoughts. We then measured the correlations between their ratings and their actual scores on each psychological dimension (values and personality traits) obtained through official questionnaires.

Through evaluating 44 LLMs with our benchmark, we found that LLMs prioritize Benevolence, Security, and Self-Direction while placing less emphasis on Tradition, Power, and Achievement values. Also, reasoning models exhibit consistently higher Benevolence scores across multiple model families. Furthermore, larger models exhibit greater variability across value dimensions within the same model family, showing distinct preferences for different values. We also found that LLMs produce biased predictions about the values associated with various demographics, deviating from actual human data. For example, GPT-4o perceives males as having higher scores on Conformity and Tradition than females, whereas human data shows minimal gender differences. Overall, our work highlights the importance of grounding value assessments in human data and provides a more psychometrically and ecologically valid framework for evaluating LLMs’ values.

This work will be published at ACL 2025.

Jongwook Han, Dongmin Choi, Woojung Song, Eun-Ju Lee, Yohan Jo

Preprint: https://arxiv.org/abs/2505.01015

References

Marilù Miotto, Nicola Rossberg, and Bennett Kleinberg. 2022. Who is GPT-3? an exploration of personality, values and demographics. In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 218–227, Abu Dhabi, UAE. Association for Computational Linguistics.
Dorith Hadar Shoval, Kfir Asraf, Yonathan Mizrachi, Yuval Haber, and Zohar Elyoseph. 2024. Assessing the alignment of large language models with human values for mental health integration: Cross-sectional study using schwartz’s theory of basic values. JMIR Mental Health.
Jen-tse Huang, Wenxiang Jiao, Man Ho Lam, Eric John Li, Wenxuan Wang, and Michael Lyu. 2024. On the reliability of psychological scales on large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6152–6173, Miami, Florida, USA. Association for Computational Linguistics.
Nailia Mirzakhmedova, Johannes Kiesel, Milad Alshomary, Maximilian Heinrich, Nicolas Handke, Xiaoni Cai, Valentin Barriere, Doratossadat Dastgheib, Omid Ghahroodi, MohammadAli SadraeiJavaheri, Ehsaneddin Asgari, Lea Kawaletz, Henning Wachsmuth, and Benno Stein. 2024. The touché23- ValueEval dataset for identifying human values behind arguments. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LRECCOLING 2024), pages 16121–16134, Torino, Italia. ELRA and ICCL.
Liang Qiu, Yizhou Zhao, Jinchao Li, Pan Lu, Baolin Peng, Jianfeng Gao, and Song-Chun Zhu. 2022. Valuenet: A new dataset for human value driven dialogue system. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11183–11191.
Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. 2024. ValueBench: Towards comprehensively evaluating value orientations and understanding of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2015–2040, Bangkok, Thailand. Association for Computational Linguistics.
Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, et al. 2024. Do llms have distinct and consistent personality? trait: Personality testset designed for llms with psychometrics. arXiv preprint arXiv:2406.14703.
Jing Yao, Xiaoyuan Yi, and Xing Xie. 2024b. Clave: An adaptive framework for evaluating values of llm generated responses. In Advances in Neural Information Processing Systems, volume 37, pages 58868–58900. Curran Associates, Inc.