Tool-Use Optimization in Autonomous Agents: From ReAct to Reflection

Production autonomous agents on benchmarks like GAIA and WebArena report success rates of 40-60%, with tool-selection errors and hallucinated argument

AGEIUM ResearchApril 19, 202611 min read

agentic-ai tool-use function-calling reflection react reasoning llm-agents benchmark

참고: 본 글은 AGEIUM Research가 게시하는 논문형 블로그입니다. 실험 결과 수치는 제시된 아키텍처의 **예시 시연(illustrative benchmark)**이며, 참고문헌에 인용된 외부 논문(arxiv·Nature·Science 등)은 실존 검증된 출처입니다.

1. 서론

Modern autonomous agents powered by large language models (LLMs) have demonstrated remarkable capabilities in reasoning, planning, and goal-directed action across diverse domains. Yet deployment in production settings reveals a persistent gap between theoretical promise and empirical performance. Contemporary benchmarks including GAIA (Mialon et al., 2023; arXiv:2311.12983) and WebArena (Zhou et al., 2023; arXiv:2307.13854) document task-completion rates well below human baselines even with frontier models, and failure-mode analyses of large trajectory corpora consistently surface a recurring cluster of preventable errors: incorrect tool selection (choosing the wrong API from available options), hallucinated arguments (fabricating parameter names that do not exist in the target schema), and missed recovery opportunities (failing to recognize and correct a failed invocation before committing to downstream actions). These errors represent not fundamental reasoning deficits but systematic breakdowns in the coupling between the agent's deliberative reasoning substrate and the tool-invocation substrate — a gap that existing frameworks treat as separable engineering concerns.

The dominant agent architectures — ReAct (Yao et al., 2023; arXiv:2210.03629), Toolformer (Schick et al., 2023; arXiv:2302.04761), and Reflexion (Shinn et al., 2023; arXiv:2303.11366) — instantiate a modular pipeline: the LLM reasons, selects a tool, invokes it with arguments, observes the result, and (in Reflexion's case) critiques and revises its trajectory. This decomposition offers interpretability and enables independent testing of individual modules. However, the separation also prevents the framework from quantifying how each cognitive primitive — reasoning depth, tool selection quality, reflection frequency — contributes to end-to-end task success, leaving practitioners without principled guidance on where optimization effort yields the greatest return. More critically, as tool catalogs scale to thousands of callable APIs — a regime increasingly common in production multi-tenant platforms — the naive enumeration paradigm (serializing all tools into the context window) degrades non-linearly. Tool selection accuracy falls with catalog size due to semantic ambiguity among similar-sounding functions, context-window saturation, and token-budget constraints. Existing retrieval-augmented solutions apply two-stage filtering or embedding-based nearest-neighbor search, but treat tool selection as an isolated retrieval problem rather than as a dynamic decision under uncertainty conditioned on the agent's evolving reasoning state.

This paper introduces ReflAct, a unified agent framework that formalizes tool selection, tool invocation, and reflective revision as a single partially-observable Markov decision process (POMDP), in which all three phases are jointly optimized rather than treated as independent modules. Within ReflAct, we propose the Tool Selection Score (TSS): a learned ranking function trained on four axes — semantic relevance, historical invocation success rate, estimated execution cost, and trust-level annotations — that scales sub-linearly with catalog size and is designed to operate within the latency constraints of interactive production environments. Typed function-call validation with schema-guided constrained decoding is integrated directly into the generation loop, eliminating a class of hallucinated argument errors at the decoding level rather than as a post-hoc filter. Integrated reflection is modeled as a meta-action within the same POMDP, allowing the agent to query, revise, or escalate within a unified optimization objective.

We evaluate ReflAct on ToolHub-Eval, a new benchmark comprising task instances drawn from a diverse corpus of real-world tools — APIs, plugins, and local functions — annotated with ground-truth invocation schemas and human-assigned trust levels. ToolHub-Eval is designed to control for tool-catalog size and trust heterogeneity, enabling reproducible comparison across tool-density regimes that existing benchmarks do not support. We compare ReflAct against ReAct and Reflexion baselines under matched compute budgets, and report ablations that isolate the contribution of each framework component.

The remainder of this paper is structured as follows. Section 2 reviews recent advances in agent architectures, tool-use benchmarking, and retrieval-augmented function selection. Section 3 introduces the ReflAct POMDP formulation and derives the joint optimization objective. Section 4 presents the Tool Selection Score and its four-axis learned ranking algorithm. Section 5 describes the ToolHub-Eval benchmark design, annotation protocol, and dataset statistics. Section 6 reports experimental results across tool-density regimes and includes ablation studies isolating each component. Section 7 discusses implications for production deployment, latency constraints, and the emerging role of trust annotations. Section 8 concludes with open challenges and directions for future work.

2. 관련 연구

Agentic systems that combine language model reasoning with external tool execution have emerged as a promising direction for enhancing LLM capabilities beyond text generation. ReAct establishes a foundational paradigm in which language models generate reasoning traces and action traces in an interleaved manner, allowing the model to leverage external tools to ground its reasoning in observable outcomes. This approach demonstrates substantial improvements on knowledge-intensive and interactive tasks compared to chain-of-thought alone, though it requires explicit prompting to maintain the reasoning-acting loop. Building on this foundation, Reflexion introduces verbal self-reflection mechanisms that allow agents to learn from failure trajectories across episodes. By maintaining explicit memory of past failures and success patterns, Reflexion-based agents achieve improved performance on complex reasoning benchmarks; the verbal reflection component is interpretable and can be examined post-hoc, providing insights into the agent's error recovery strategies.

The challenge of enabling language models to autonomously learn when and how to invoke tools has been addressed through self-supervised approaches. Toolformer demonstrates that models can be fine-tuned to predict and insert tool calls at appropriate points in their generation, without requiring dense annotations for every potential tool invocation. This self-supervision paradigm reduces annotation overhead while maintaining the model's ability to decide whether tool use is necessary for a particular query. However, Toolformer operates primarily in synthetic or relatively constrained domains; scaling to real-world API ecosystems introduces distinct challenges around tool selection, error handling, and API composition.

At larger scale, function-calling systems must address combinatorial API spaces. Gorilla tackles the problem of connecting language models to massive API repositories by leveraging retrieval-augmented tool selection, enabling a single model to handle thousands of distinct APIs without exhaustive multi-modal training. ToolLLM extends this scalability further, demonstrating practical integration of over 16,000 real-world APIs, and provides extensive benchmarking on API invocation accuracy, error recovery, and multi-step API orchestration. Both systems emphasize the importance of retrieval-based filtering to avoid overwhelming the LLM context with irrelevant API specifications, and both show that appropriate ranking of candidate APIs directly impacts performance.

Beyond single-episode planning, the paradigm of lifelong skill acquisition has been explored in embodied and simulated environments. Voyager demonstrates that language models can continually acquire new skills in open-ended environments by generating executable code (in this case, for Minecraft), maintaining a persistent skill library, and using that library to ground future problem-solving. This work underscores the value of cumulative learning: agents that can build upon previously solved tasks show non-trivial improvements in sample efficiency and exploration coverage compared to agents trained from scratch on each new task.

The common thread across these works is the tension between generality and specialization. Models must be capable enough to reason about when tools are needed, yet mechanisms must be in place to prevent hallucinated or incorrect tool invocations. Reflection and error recovery are increasingly recognized as essential components of robust agentic systems, yet implementing these mechanisms in ways that scale to real-world API ecosystems remains an open problem. The present work addresses this gap by integrating structured error feedback, dynamic tool ranking, and reflective planning into a unified agentic architecture optimized for large-scale tool-use scenarios.

3. 배경

Tool use represents a foundational capability for autonomous agents, enabling systems to interact with external environments beyond language generation. Early large language models relied primarily on in-context prompting for tool invocation, with human-designed templates guiding function calls through structured text. The introduction of function-calling APIs—standardized protocols for transmitting structured tool requests and responses—marked a significant transition from template-based approaches toward more systematic interaction patterns. This shift enabled models to request external computations, retrieve information, and execute domain-specific operations without explicitly generating code or managing system calls through natural language fallbacks. Contemporary systems leverage function calling as a core primitive, yet this capability alone proves insufficient for robust agentic behavior. Agents operating in real-world settings encounter tool mismatches (selecting unsuitable tools for given tasks), incorrect parameter bindings, and execution failures requiring strategic recovery mechanisms.

Chain-of-thought (CoT) reasoning has emerged as a complementary technique, prompting models to articulate intermediate reasoning steps before committing to actions. Extensions such as tree-of-thought and graph-of-thought architectures expand this paradigm by exploring multiple reasoning branches, offering pathways toward more comprehensive solution spaces. However, CoT concentrates on decision transparency and intermediate state articulation rather than runtime adaptation following failures. Reflection—the systematic analysis of prior decisions and outcomes—has been proposed as a mechanism for error correction and iterative refinement. Existing reflection frameworks tend to operate post-hoc, analyzing completed trajectories or failure sequences after execution completion, thereby limiting opportunities for mid-trajectory correction when detection of misaligned choices remains possible.

Tool discovery and selection constitute underexplored challenges in agentic AI. As tool repositories scale—from dozens of functions to thousands of API endpoints and specialized models—naive enumeration becomes computationally prohibitive. Semantic similarity matching, the dominant current approach, relies on embedding-based retrieval without accounting for historical tool performance patterns, domain-specific cost structures, or reliability metadata that should inform selection decisions. Tool selection systems that incorporate multiple signals—success rates, latency characteristics, failure modes—remain scarce in published agentic architectures. Schema validation for function calls has typically been treated as a post-hoc enforcement mechanism or delegated to runtime error handling, rather than integrated into the generation process itself. Schema-guided constrained decoding offers potential for preventing invalid parameter bindings during generation, yet its integration with large language models remains technically and operationally complex.

The gap between current practice and the requirements of production agentic systems suggests opportunities for unified frameworks that address tool selection, schema-aware generation, and adaptive error recovery as integrated concerns rather than isolated components. Such frameworks should balance multiple objectives: selecting tools aligned with task intent, ensuring generated calls conform to tool specifications without fallback to error correction, and recovering gracefully when execution failures occur through mechanisms that extend beyond terminal replay toward counterfactual path exploration.

4. 방법론

ReflAct is constructed as a unified agent loop comprising four tightly coupled primitives that operate in sequence within each reasoning step, sharing internal state across the full trajectory rather than treating each tool invocation as an isolated transaction. This architectural decision is deliberate: the most common failure mode in prior agentic systems arises from the absence of longitudinal state coupling, which causes each step to reason from scratch and thereby repeat mistakes or redundantly re-evaluate tool options that have already been disqualified by earlier evidence. By threading a persistent trace context through all four primitives, ReflAct preserves causal provenance across the full execution horizon.

The first primitive, the Tool Selection Score (TSS), replaces the standard top-k semantic retrieval paradigm with a multi-factor learned ranker. Given a task state embedding derived from the concatenation of the current instruction, prior tool outputs, and the reflection summary from the previous step, the TSS ranker scores each candidate tool as a weighted combination of four signals. The semantic similarity component measures cosine distance in a shared embedding space between the task state and the tool's description and example invocations. The historical success rate component is maintained per tool per task-type cluster and is updated online using exponential moving average over binary outcome signals. The cost signal encodes normalized latency and token expenditure estimates for each tool, derived from empirical profiling over the training distribution. The trust level signal reflects a manually curated and periodically audited reliability score that accounts for API deprecation risk, version stability, and observed hallucination rate in prior invocations. The tool index itself is sharded by domain category to allow sublinear retrieval at scale — a critical design consideration given that ToolHub-Eval spans over twelve hundred distinct tools with heterogeneous schemas. The ranker weights are learned end-to-end during fine-tuning using a contrastive ranking objective, where positive examples are tool invocations that led to successful task completion and negative examples are sampled from incorrect selections recorded in trajectory logs.

Emergent Cooperation in Multi-Agent Systems: Game Theory Perspectives

16 min read

Self-Improving Agents: Meta-Learning and Automatic Prompt Optimization

11 min read