ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models
Tool-Augmented Language Models (TALMs) leverage external APIs to answer user queries across various domains. However, existing benchmark datasets for TALM research often feature simplistic dialogues that do not reflect real-world scenarios, such as the need for models to ask clarifying questions or proactively call additional APIs when essential information is missing. To address these limitations, we construct and release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI. ToolDial has two key characteristics. First, the dialogues incorporate 16 user and system actions to capture the rich dynamics of real-world interactions. Second, we simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information. To facilitate this process, we introduce a method for generating an API graph that represents input and output compatibility between APIs. Using ToolDial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement.
Tool-Augmented Language Models (TALMs) are designed to solve requests from users by selecting and calling APIs to fulfill complex queries. While recent benchmarks have focused on enhancing tool selection and reasoning in single-turn settings, they fall short of representing real-world multi-turn interactions—where TALMs must ask users for missing information or clarify vague queries. Furthermore, existing multi-turn studies tend to simulate only simple exchanges, lacking the richness of real-world dialogue dynamics. This gap makes it difficult to evaluate TALMs’ abilities in scenarios that require diverse interaction with users to identify missing inputs, navigating failures, or clarify vague requests.
To fill this gap, we introduce ToolDial, a dataset consisting of 11,111 multi-turn dialogues built around APIs from RapidAPI. ToolDial is crafted to capture the complexity of real conversations by simulating scenarios where the TALM requests missing inputs, reasons about dialogue flow, and decides when to invoke additional APIs. We define 16 distinct user and system actions and construct 23 plausible action sequences to reflect real world interaction patterns. A graph-based API chaining framework is used to ensure that one API’s output can serve as another’s input, enabling more realistic multi-step API usage in the dialogues. Our experimental results show that modern language models struggle to extract correct input parameters from dialogue history and to choose plausible next actions to resolve the situation during a conversation. In particular, they often fail to select clarification actions when faced with vague requests, and they tend to skip asking the user for input parameter values, instead rushing to call the API with hallucinated values. In contrast, we found that when language models are trained on ToolDial, the two aforementioned issues are significantly alleviated.
Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo