Tool-Augmented Language Models (TALMs) are designed to solve requests from users by selecting and calling APIs to fulfill complex queries. While recent benchmarks have focused on enhancing tool selection and reasoning in single-turn settings, they fall short of representing real-world multi-turn interactions—where TALMs must ask users for missing information or clarify vague queries. Furthermore, existing multi-turn studies tend to simulate only simple exchanges, lacking the richness of real-world dialogue dynamics. This gap makes it difficult to evaluate TALMs’ abilities in scenarios that require diverse interaction with users to identify missing inputs, navigating failures, or clarify vague requests.

To fill this gap, we introduce ToolDial, a dataset consisting of 11,111 multi-turn dialogues built around APIs from RapidAPI. ToolDial is crafted to capture the complexity of real conversations by simulating scenarios where the TALM requests missing inputs, reasons about dialogue flow, and decides when to invoke additional APIs. We define 16 distinct user and system actions and construct 23 plausible action sequences to reflect real world interaction patterns. A graph-based API chaining framework is used to ensure that one API’s output can serve as another’s input, enabling more realistic multi-step API usage in the dialogues.
Our experimental results show that modern language models struggle to extract correct input parameters from dialogue history and to choose plausible next actions to resolve the situation during a conversation. In particular, they often fail to select clarification actions when faced with vague requests, and they tend to skip asking the user for input parameter values, instead rushing to call the API with hallucinated values. In contrast, we found that when language models are trained on ToolDial, the two aforementioned issues are significantly alleviated.


Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo

https://openreview.net/forum?id=J1J5eGJsKZ

References
  1. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru, Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. ICLR 2024.
  2. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms, EMNLP Main Conference. 2023.