How Tool Complexity Impacts AI Agents Selection Accuracy

Allen Chan
9 min readFeb 20, 2025

Co-author: Sebastian Carbajales

In our previous discussion, we explored how the number of tools (“Quantity”) an AI Agent can handle affects token consumption. Now, we turn our attention to selection accuracy — how well an AI Agent picks the right tool for a given task and generates the correct call specification. As the number and complexity of tools grow, selection accuracy can degrade, leading to inefficiencies, errors, and poor user experience.

In this article, we will examine the “Quality” of the tools:

  • How the description of tools can impact accuracy.
  • The role of interface complexity in tool selection.
  • Architectural strategies to mitigate selection challenges.
  • Best practices for writing effective tool descriptions and parameters.

For this discussion, we focus on single-turn tool selection — cases where the AI must select the correct tool in a single interaction, without relying on prior conversation history.

[This is a rather long article, if you just want to read our conclusions and recommendations, jump to the “Any Guideline on Tool Complexity?” section.]

[This article is part of the AI-first Enterprise Automation with GenAI and AI Agents series.]

Factors Affecting Tool Selection Accuracy

1. The Impact of the Number of Tools on Accuracy

The more tools an AI Agent has access to, the harder it becomes to select the correct one. Two key research benchmarks highlight this challenge:

Nexus Function Calling Leaderboard

The Nexus Function Calling Leaderboard (NFCL) evaluates function selection in AI models. Comparing results from different categories sheds light on how tool count affects selection accuracy.

  • VirusTotal Category → Consists of 12 simple APIs (so 12 tools) for analyzing suspicious files and URLs. Since each request requires only one function call, selection is relatively easy.
  • OTX Category → Comprises 9 APIs interacting with Open Threat Exchange. It is also a simple benchmark, and models generally achieve high accuracy scores.

Conclusion: Models perform better on OTX than on VirusTotal, even though both have simple APIs. The increased number of tools in VirusTotal likely contributes to the performance drop.

Recent ReAct Agent Benchmarking (Feb 2025)

Another experiment recently published by the Langchain team evaluated the performance of ReAct agent, testing how an AI Agent’s performance changes as more tools and instructions are added.

  • The experiment found that as the number of domains (tool categories) increases, accuracy declines.
  • Tasks requiring multiple function calls saw an even greater accuracy drop.

Experiment setup:

  • 5 models tested: claude-3.5-sonnet, gpt-4o, o1, o3-mini, llama-3.3–70B
  • 30 tasks run for two domains for 3 times: Calendar Scheduling (4 tools) and Customer Support (9 tools)
  • Same tasks run by adding unrelated domains to the original one: final test with 14 domains and 117 tools

Conclusion: accuracy dropped dramatically from 1 to 14 domains

  • Calendar Scheduling tasks:
    -
    gpt-4o accuracy dropped from 43% with single domain & 4 tools to just 2% with 7 domains and 51 tools.
    - llama-3–3–70b failed in all cases.
  • Customer Support tasks:
    -
    gpt-4o accuracy dropped from 58% with single domain & 9 tools to just 26% with 7 domains and 51 tools.
    - llama-3–3–70b dropped from 21% to 0% from 1 domain to 7 domains.

These findings reinforce that having too many tools — especially when combined with irrelevant ones and lacking a pre-selection strategy — significantly degrades performance.

Composio Function-Calling Benchmark

A common thread in the performance evaluations mentioned above is that accuracy can be improved through better prompts and tool descriptions. For example, the Composio Function-Calling Benchmark tests 50 function calling problems, each of which must select from 8 function schemas. The benchmark assesses several schema optimization techniques and measures the resulting accuracy, which ranges from approximately 33% (with no optimization) to approximately 74% (with multiple optimizations applied), a significant improvement. A detailed description of all optimization techniques can be found on the benchmark’s webpage.

In a similar experiment, Gentoro created equivalent functions that were more closely tailored to the specific prompts, and then ran the same benchmarking tests. The results showed 100% accuracy across all test cases, demonstrating the value of designing tools that are aligned with user needs rather than simply mirroring API interfaces. By defining an interface closer to how users interact with the system, the experiment was able to achieve superior performance and accuracy. However, this approach requires additional implementation effort, as it involves a more thorough understanding on the business use cases of the APIs, and adjusts the API signature to optimize for different use cases.

Conclusion: tool design is as important as its description and optimization to the success of the selection accuracy. The ability to effectively balance user-centered design with computational efficiency will be crucial in developing tools that can consistently deliver high-performing results across a wide range of use cases.

2. The Impact of Interface Complexity on Selection

Selecting the correct function is just the first step. The next challenge is generating the correct inputs to call the function. Three key factors affect accuracy here:

  • The number of tool parameters (more parameters increase confusion).
  • The complexity of data structures used in the tool interface.
  • How descriptive are the tool specifications.

NVD Library vs. VirusTotal and OTX

An analysis of the NVD Library category in NFCL found that models performed worse with the NVD Library than with VirusTotal or OTX.

  • The NVD Library category includes only two APIs, so selection should be straight forward.
  • However, each API has 30 parameters, making it harder for the AI to correctly map inputs.
  • Conclusion: This suggests that even a small toolset can cause selection errors if the parameter design is too complex.

3. How to Write Effective Tool Descriptions and Parameters

The Berkeley Function-Calling Leaderboard (BFCL) provides valuable insights into how the number and complexity of parameters in function calls affect AI model performance. In its second version, the average number of parameters is four parameters, with the most complex function having 28 parameters. This variation in parameter count directly influences a model’s ability to accurately select and execute functions.

  • Moreover, BFCL V2 highlighted that scenarios requiring models to choose the appropriate function from multiple options are more prevalent in real-world applications. This underscores the importance of designing functions with a manageable number of parameters to enhance selection accuracy and overall performance.
  • Conclusion: A well-defined tool description can significantly improve selection accuracy. Let’s look at some examples.

Example 1: Ambiguous/Unclear Data Definition

If a tool requires multiple inputs, using a structured data definition based on JSON schema can make it easier for the AI to select and use correctly.

Bad Example:

{ 
“name”: “stock_info”,
“description”: “Get stock market data.”,
“parameters”: {
“ticker”: “Stock symbol”,
“date”: “Date”
}
}

🔴 Why is this bad?

  • The function name is too general.
  • The description does not clarify what stock data is provided.
  • No format instructions specified for the date or ticker input.

Good Example:

{ 
“name”: “get_stock_data”,
“description”: “Retrieves real-time or historical stock market data, including price, volume, and trends.”,
“parameters”: {
“ticker”: {
“type”: “string”,
“description”: “Stock symbol of the company (e.g., ‘IBM’ for IBM.).”
},
“date”: {
“type”: “string”,
“format”: “YYYY-MM-DD”,
“description”: “Date for historical stock data retrieval (leave blank for real-time data).”
}
}
}

🟢 Why is this better?

  • Precise function name (“get_stock_data” clarifies retrieval) following the “{operation}_{entity}_{data}” convention.
    — While there are no fixed rules on how we should name a function, we can follow the simple format of “{operation}_{entity}_{data}”, e.g. “get_stock_quote”, “update_product_price”, “query_flight_schedule”, “update_account_address”.
  • Expanded description (Specifies real-time or historical data).
  • Structured parameters (Ensures AI correctly interprets input).

Example 2: Poor vs. Effective Tool Descriptions

Bad Tool Description:

{ 
"name": "weather_info",
"description": "Get weather details for a location.",
"parameters": {
"location": "User's location",
"unit": "Unit of temperature"
}
}

🔴 Why is this bad?

  • Vague function name (“weather_info” does not specify what information is retrieved).
  • Unclear description (What kind of weather details? Current weather or forecast?).
  • Poor parameter descriptions (Does “unit” refer to Celsius/Fahrenheit?).

Good Tool Description:

{ 
“name”: “get_weather_data”,
“description”: “Retrieves the current temperature, humidity, and forecast for a given location.”,
“parameters”: {
“location”: {
“type”: “string”,
“description”: “City name or geographic coordinates for which weather data is requested.”
},
“unit”: {
“type”: “string”,
“enum”: [“Celsius”, “Fahrenheit”],
“description”: “Preferred unit for temperature display.”
}
}
}

🟢 Why is this better?

  • Clear and descriptive function name (“get_weather_data” specifies it retrieves weather data).
  • Detailed description (Clarifies what information is provided).
  • Well-defined parameters (Includes type, valid values, and detailed descriptions).

Example 3: Simple vs. Complex Parameter Design (Procurement Use Case)

Let’s take an example where an AI Agent is selecting a tool to find a product in a procurement system.

❌ Overly Complex Parameter Structure (Harder for AI to Handle)

{ 
“name”: “search_product_catalog”,
“description”: “Finds products in the procurement catalog based on various filters.”,
“parameters”: {
“category”: {
“type”: “object”,
“properties”: {
“department”: {
“type”: “string”
},
“subcategory”: {
“type”: “string”
}
},
“description”: “Department and subcategory of the product.”
},
“specifications”: {
“type”: “object”,
“properties”: {
“brand”: {
“type”: “string”
},
“material”: {
“type”: “string”
},
“size”: {
“type”: “string”
}
},
“description”: “Product specifications.”
},
“price_range”: {
“type”: “object”,
“properties”: {
“min_price”: {
“type”: “number”
},
“max_price”: {
“type”: “number”
}
},
“description”: “Minimum and maximum price range.”
},
“availability”: {
“type”: “string”,
“enum”: [“in_stock”, “out_of_stock”, “backorder”],
“description”: “Availability status.”
}
}
}

🔴 Why is this bad?

  • Too many nested fields (e.g., category → subcategory, specifications → [brand, material, size]).
  • Forces AI to reason about deep object structures, increasing error risk.
  • Higher token usage due to deep nesting.

✅ Simplified, AI-Friendly Structure (Better for Selection Accuracy)

{ 
“name”: “search_product_catalog_by_brand”,
“description”: “Finds a product in the procurement catalog based on simple filters.”,
“parameters”: {
“query”: {
“type”: “string”,
“description”: “Keywords for searching the product (e.g., ‘office chair ergonomic’).”
},
“category”: {
“type”: “string”,
“description”: “Main product category (e.g., ‘Office Supplies’, ‘Furniture’).”
},
“brand”: {
“type”: “string”,
“description”: “Preferred brand name (optional).”
},
“price_min”: {
“type”: “number”,
“description”: “Minimum price filter.”
},
“price_max”: {
“type”: “number”,
“description”: “Maximum price filter.”
},
“availability”: {
“type”: “string”,
“enum”: [“in_stock”, “out_of_stock”, “backorder”],
“description”: “Filter by product availability.”
}
}
}

🟢 Why is this better?

  • Flattens nested fields into simple parameters (no complex object structures).
  • Uses a flexible “query” parameter instead of forcing structured inputs.
  • Narrow the selection (e.g. “by brand”) instead of supporting all possible selection criteria.
  • Minimizes decision complexity by reducing deep nesting.

Any Guideline on Tool Complexity?

There are no hard limit on how many parameters or how deeply nested constructs you can have in a given tool, but here are some general guidelines:

  • 1–5 parameters and no nesting (1 level) → Easy for models to manage.
  • 6–10 parameters up to 2 level of nesting → Increased likelihood of input errors, especially without structured guidance (i.e. no JSON schema used).
  • 10+ parameters or more than 2 levels of nesting → High risk of incorrect parameter mapping, especially without structured guidance.

Strategies to Improve Tool Selection Accuracy

✅ 1. Improving Accuracy through Better Prompts and Tool Descriptions

Reduce the Number of Parameters

  • Only include essential inputs. Remove redundant or optional fields.
  • If a parameter is rarely needed, handle it as an optional default instead of requiring it every time.
  • Split up a complex operation with many parameters into multiple tools with less parameters, and make it clear in the description or name new tool set.

Flatten Nested Structures

  • Instead of deeply nested objects, use string-based representations where possible.
  • Combine multiple related parameters into a single field (e.g., category instead of department and subcategory).

Use Enumerations to Reduce Free-Form Inputs

  • If a parameter has limited valid values, use an enum list rather than letting the AI generate free text.
  • Example:
{ 
“availability”: {
“type”: “string”,
“enum”: [“in_stock”, “out_of_stock”, “backorder”],
“description”: “Filter by product availability.”
}
}
  • Why? The AI won’t have to infer acceptable values.

Pre-Format Complex Data to Reduce Cognitive Load

  • Use pre-structured formats based on JSON schema instead of requiring the AI to compose complex inputs.
  • Example: Instead of requiring a structured JSON object for product specifications, allow a single formatted string (“brand: Logitech, color: black, size: medium”).

✅ 2. Hierarchical Tool Selection: Sub-Agents and Supervisors

Instead of a single AI Agent handling all tools, use sub-agents for specific domains with a supervisor agent routing requests.

🔹 Example:

  1. User requests financial data.
  2. Supervisor agent routes to the Finance Agent.
  3. Finance Agent selects the correct API (e.g., stock prices vs. budget analysis).

✅ 3. Dynamic Tool Activation

Instead of sending all tools in every request, pre-filter tools based on the user query.

🔹 Example:

  • If the user asks about weather, only weather-related tools should be sent.
  • If the query is about finance, omit weather tools to improve selection accuracy.

✅ 4. Monitor and Fine-Tune Model Performance

  • Track misclassification rates (which tools are selected incorrectly and why?).
  • Refine tool descriptions and parameters based on failure patterns.
  • Experiment with different LLMs (some models perform better at selection).

Final Thoughts

Tool selection accuracy is crucial for building reliable AI Agents. Key takeaways:

  • Too many tools reduce accuracy — use hierarchical selection when possible.
  • Vague tool descriptions lead to selection errors — ensure clarity in naming, descriptions, and parameters.
  • Complex tool interfaces increase confusion — optimize parameter structure and use dynamic activation.

In the next article, we’ll explore other aspects including calling multiple tools in a single turn and multi-turn tool selection and how AI Agents refine choices over time. Stay tuned!

--

--

Allen Chan
Allen Chan

Written by Allen Chan

Allen Chan is an IBM Distinguished Engineer and CTO for Business Automation, building products to get work done better and faster with Automation and AI.

No responses yet