by Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang from the Department of Computer Science and Technology at Tsinghua University and Huawei Noah's Ark Lab
https://arxiv.org/abs/2410.07745
LLMs and Tool Learning:
- LLMs: Large Language Models with powerful reasoning capabilities
- Need external tools for information retrieval or domain expertise
- Challenges: imitating static trajectories, suboptimal solutions
StepTool Framework:
- Introduced to improve tool learning in LLMs
- Consists of two components:
- Step-grained Reward Shaping: assigns rewards at each tool interaction based on success and contribution to the task
- Step-grained Optimization: uses policy gradient methods for multi-step optimization
Benefits:
- Significantly outperforms existing methods in multi-step, tool-based tasks
- Robust solution for complex task environments
Access:
- Codes available at: https://github.com/yuyq18/StepTool
Tool Learning and Large Language Models (LLMs)
Large Language Models (LLMs):
- Remarkable abilities in reasoning and inference
- Impressive performance across a wide range of tasks
Limitations of LLMs:
- Complex tasks requiring real-time information or domain-specific knowledge exceed their capacities
Tool Learning as a Solution:
- Emerging solution to augment LLMs with external tools (APIs)
- Dynamically select, invoke, and interact with tools for real-time responses
Supervised Fine-Tuning (SFT):
- Most common approach for tool learning enhancement
- Imitates expert-generated trajectories in text generation manner
Limitations of SFT:
- Imitating static predefined tool sequences limits model's adaptability to new tasks or environments
- Blindly imitating trajectories may lead to suboptimal task solving performance
Proposed Approach: Reinforcement Learning (RL)
- Offers a more dynamic perspective by treating tool learning as sequential decision making process
- Models tool interactions as actions leading to state transitions
Challenges in Applying RL to Tool Learning:
- Multiple decision steps and real-time feedback from external tools and environments
- Complex rewards considering accuracy of tool invocation and contribution to overall task completion
Introducing StepTool: A Novel Framework for Tool Learning:
- Models tool learning as sequential decision making process
- Consists of two core components: Step-grained Reward Shaping and Step-grained Optimization
Step-grained Reward Shaping:
- Designed rewards at each step based on accuracy of tool invocation and contribution to overall task completion
- Richer signals for tool learning, guiding decision making process
Step-grained Optimization:
- Proposed optimization method based on policy gradient theory
- Ensures adaptability to dynamic, multi-step interactions
Contributions:
- Identification of limitations of SFT and classic RLHF for tool learning, and introduction of StepTool
- Design of step-grained rewards tailored to tool learning scenarios
- Proposal of a step-grained optimization method based on policy gradients
- Demonstration of effectiveness through comprehensive experiments with three open-sourced models.
Recent Advancements in LLMs:
- Expanded ability to utilize external tools: for complex tasks (Chen et al., 2023; Shen et al., 2024; Schick et al., 2024)
- Interact with diverse external tools: program executors, search engines, QA systems
- Subsequent models focused on extensive interactions with real-world APIs and tools (citetqin2023toolllm, patil2023gorilla)
- Incorporated vast APIs from platforms like RapidAPI and TorchHub
- Trained LLaMA model for tool-based tasks in a Supervised Fine-Tuning (SFT) manner
- Research efforts on constructing verifiable and diverse datasets for SFT training (Tang et al., 2023; Abdelaziz et al., 2024; Liu et al., 2024)
Tool Learning Research:
- Exploration of Direct Preference Optimization (DPO) (Rafailov et al., 2024) for Tool Learning (Chen et al., 2024)
- Constructs preference data pairs based on task completion
- Without accounting for the quality of intermediate steps
- Our work shapes step-grained rewards and leverages them for step-grained reinforced optimization.
Previous Research on Process Supervision for LLMs:
- Lightman et al., 2023: explored effectiveness of process supervision on enhancing long-chain reasoning abilities in LLMs.
- Uesato et al., 2022: optimized reasoning using Reinforcement Learning from Human Feedback (RLHF).
- Ma et al., 2023: considered correctness of each reasoning step and applied DPO using step-level preference pairs.
Differences in Approach:
- Definition of a "step":
- In mathematical reasoning: text segments generated by LLMs (Lightman et al., 2023)
- In our work: real-time interactions with external tools and environments
- Focus on rewards:
- Mathematical rewards: correctness relative to ground truth
- Tool learning rewards: account for both tool invocation success and its contribution to task completion
Recent Work in Agent-Based Context:
- Xiong et al., 2024: proposed step-level refinement to enhance LLM agent performance, estimated step-level by Monte Carlo method, and provided step-by-step guidance.
Our Approach:
- Shapes step-grained rewards tailored to tool learning scenarios
- Performs step-grained optimization using policy gradient methods
- Focuses on exploration-based learning unlike previous work.
Tool Learning Model: Multi-step Decision Making Problem
Modeling Tool Learning Process:
- Formulated as a Markov Decision Process (MDP)
- Tuple representation: M = (S, A, P, R, γ)
- State space: S (current context or environment responses at time step t)
- Action space: A (external tool API calls or terminal responses at time t)
- State transition dynamics: P (probability of transitioning to a new state given current state and action)
- Reward function: R (effectiveness of tool-calling step based on current state and action)
- Discount factor: γ (balances immediate rewards with long-term performance)
Tool Selection Strategy:
- Decision-making policy πθ (parameterized by θ)
- Governs selection of actions based on the current state
Trajectory Representation:
- Sequence of states and actions over time: τ = {s1, a1, s2, a2, ..., sT , aT }
Expected Reward:
- Rθ = Eτ ∼πθ (τ ) [R(τ )] (maximize final task-solving performance)
Parameter Updates:
- Gradient of expected reward: ∇Rθ = Eτ ∼πθ (τ ),(st,at)∼τ R(τ ) ∇ log πθ (τ )
Training Efficiency and Stability:
- Replace R(τ ) with advantage function ˆA(st, at) for efficient learning and stabilization:
- Gn t = estimated future reward
- V (st) = expected return when starting from state st following the current policy thereafter.
Proposal for Enhanced LLM Task Solving
Introducing StepTool: A novel reinforcement learning framework designed to support complex problem-solving tasks. It operates based on the advantage function (Equation 3) and policy gradient formulation (Equation 2). As presented in Figure 2, StepTool consists of two main components:
- Step-grained Reward Shaping: Evaluates tool invocation accuracy at each step and its overall contribution to task completion by assigning rewards.
- Step-grained Optimization: Utilizes policy gradient methods for multi-step optimization using the model.
These components offer step-by-step feedback, enabling more effective decision-making in complex environments.
Step-Grained Reward Shaping
- Provides step-level reward signals for intermediate steps, guiding model decision-making
- Effective in tool learning scenarios with well-defined formats and explicit goals
- Overcomes limitations of delayed rewards by offering explicit feedback
Step-Grained Reward Design
- Two key factors: SuccCalling and Contribution
- SuccCalling: evaluates successful execution of tool call (format, content)
- Contribution: assesses extent to which tool action facilitates overall task solution
- Final step reward linked to task completion (IsSolved)
Step-Grained Optimization
- Step: Successful Tool Calling (SuccCalling)
- Step: Contribution
- Policy Gradient
- Step: Is Task Solved (IsSolved)
- Architecture: StepTool
- Framework Features:
- Step-grained Reward Shaping
- Step-grained Optimization
SuccCalling Metric
- Evaluates successful tool call execution
- Formal representation: ˆrSC_t = SuccCalling(at, st+1)
Contribution Metric
- Assesses tool action's contribution to overall task solution
- Minimal actions receive lower rewards
- Formal definition: ˆrCon_t = Contribution(at, aT )
IsSolved Metric (Final Step)
- Reward directly linked to task completion
- Evaluates final answer based on initial user query
- Formal representation: ˆrIS_t = IsSolved(q, aT )
Reward Definition
- Intermediate Steps: ˆrt = (α · ˆrSC_t + ˆrCon_t)
- Final Step: ˆrIS_t = IsSolved(q, aT )
- Scaling Factor: α
- Normalization for consistency
Step-Grained Reward Acquisition
- Collect trajectories from model's interactions with tools/environments
- Assign step-grained rewards through rule-based models, human annotations, or advanced models like GPT-4
- Use annotated data for offline reinforcement learning optimization or train a reward model for online training.
Step-Grained Reinforcement Learning Strategy
Limitations of Single-Step Approaches:
- RLHF (Ouyang et al., 2022) has limitations
Proposed Step-Grained Optimization Strategy:
- Extends the gradient of expected reward to a token-level consideration
- Represents each action as a sequence of Lt tokens
- Gradient of expected return Rθ at step level: ∇Rθ = Eτ ∼πθ (τ ),(st,at)∼τ " TX t=1 ˆA(st,at) LtX i=1 log πθ (ai t|st, a1:i−1 t ) #
- Advantage function: ∇A(st,rt,at) = Gn t − V(st) = ∑i=1^Tγi−t rti + γT−iV(st)
- Optimization objective: Lθ(π) = Eτ ∼πθ (τ ),(st,at)∼τ " TX t=1 ∇A(st,rt,at) log πθ(ai|st, a1:i−1 t )
- Addresses the complex case of multi-step interactions with external environments
Practical Instantiation with PPO:
- Compatible with any policy gradient-based RL algorithm
- Estimates advantage function using Generalized Advantage Estimation (GAE)
- Employs clipped PPO objective to prevent large updates during optimization
- Introduces per-token KL divergence penalty from the old policy at each token
Benchmark & Evaluation Metrics
- ToolBench: widely recognized benchmark for tool learning, evaluating model performance on solving complex tasks
- StableToolBench: improved and more stable version of ToolBench, features 765 solvable tasks across six subsets
- Each subset varies in tool category and instruction complexity
- Subset statistics: 163 tasks (Instruction), 153 (Category), 158 (Tool) for I1; 106 tasks (Instruction), 371 (Relevant API) for I2; 124 tasks (Instruction), 328 (Relevant API) for I3; 61 tasks (Instruction), 180 (Relevant API) for I4
- Evaluation metrics: pass rate, win rate
- Baselines: supervised fine-tuning (SFT) and PPO, using three open-source base models: ToolLlama, Llama3.1, Qwen2
- Each model adopts two different strategies: Chain of Thought (CoT) and Depth-First Search Decision Tree (DFSDT)
- Training setting: SFT uses static expert paths from GPT-4 with ToolBench dataset; PPO and StepTool use responses and interaction paths generated by each model towards 5,000 training tasks. All models are optimized with default settings in the same experimental environment.
Performance Comparison Results
- Three base models and two strategies used for comparison: GPT-3.5, DFSDT, PPO, SFT, Qwen2
- StepTool outperforms most baselines across all subsets (I1 Ins., I1 Cat., I2 Ins., I2 Cat., I3 Ins.) majority of the time
- Consistent improvement in performance over baselines for specific subsets: 'I1 Tool': 1%-4%, 'I3 Ins.' 5%-13%
- Superior task-solving capabilities with better solution paths as evidenced by win rates (Figure 3)
- StepTool vs. PPO (DFSDT): win rate range 50%-65.8% in I1 tool, I2 cat., and I3 ins.
- Effectiveness in tool-based task solving demonstrated through consistent outperformance of baselines across different settings (CoT and DFSDT).
Comparative Performance Results Summary:
- StepTool: consistently outperforms SFT and PPO for most subsets in the same base model and strategy
- Advantages of StepTool: enhances performance, especially when using DFS strategy in complex scenarios
- Performance improvements vary across different subsets: significant improvement on more complex tasks like 'I3 Ins.' (5%-13%) compared to simpler tasks like 'I1 Tool.' (1%-4%)
- Win rates: StepTool has a win rate over 50% against all baselines across three randomly selected subsets (I1 tool, I2 cat., I3 ins.).
Ablation Study on StepTool Components
Components:
- Step-grained Reward: step-grained rewards are set to 0
- Step-grained Optimization: full trajectory is divided into sub-trajectories with intermediate actions, and models are trained using PPO
Findings:
- Removing step-grained reward reduced the average pass rate to 48.1%
- Removing step-grained optimization further decreased the pass rate to 46.9%
- These findings suggest that:
- Setting intermediate rewards to zero fails to provide informative signals, leading to reduced performance
- Traditional PPO optimizes steps in isolation without leveraging dependencies across multiple steps
- Importance of both components in StepTool for solving complex multi-step tasks
Table 3: Ablation study on two components of StepTool.
Component | I1 Ins. | I1 Cat. | I1 Tool | I2 Cat. | I2 Ins. | Average |
---|---|---|---|---|---|---|
StepTool | 58.7±1.8 | 57.8±1.7 | 57.2±0.7 | 52.7±0.8 | 52.7±1.0 | 42.1±1.5 |
w/o Step-grained Reward | 57.2±2.6 | 50.5±0.4 | 45.1±0.8 | 44.9±1.5 | 51.1±1.5 | 39.9±0.8 |
w/o Step-grained Opt | 57.7±1.5 | 52.2±1.3 | 43.0±1.4 | 45.3±0.8 | 41.8±1.1 | 41.5±1.5 |
ToolLlama | 54.2±0.5 | 50.3±0.8 | 56.5±1.5 | 52.0±0.6 | 45.4±0.6 | 37.2±1.0 |
Tool Invocation Success Rates Comparison
To assess our method's impact on intermediate tool invocations, we calculate average success rates across test sets for both ToolLlama and Qwen2 models in various settings: CoT and DFSDT. Figure 4 shows that StepTool consistently boosts the success rates of intermediate tool invocations in both scenarios, proving its effectiveness and accuracy in multistep tasks by enhancing tool performance.
Case Study: Correcting Wrong Tool Selection using StepTool
- User Query: I'm planning a movie night and need recommendations. Get channel info for "Paramount Pictures", video comments for ID "123456", and streaming/downloading information for the movie with ID "UxxajLWwzqY".
- Step 1: Tool - getchannelinfo: Retrieves channel info for "Paramount Pictures"
- Step 2: Tool - getvideoscomment: Retrieves video comments for ID "123456"
- Step 3: Wrong Tool Selection: Incorrectly calls "getvideoscomment" again instead of switching to "download_stream" for the third step.
- Issues: Handling intermediate steps before optimization.
- Step 4: Finish: After applying StepTool, the model correctly uses "download_stream" and provides a streaming link, fulfilling the user's request.
- Importance of Optimizing Intermediate Steps: Demonstrates effectiveness in improving decision-making in complex tasks.
StepTool Framework:
- Components: Step-grained Reward Shaping and Step-grained Optimization
- Step-grained Reward Shaping: Provides feedback at each tool interaction by evaluating success and contribution to the task.
- Step-grained Optimization: Uses policy gradient methods to optimize decision-making at each step.
- Experiments: Demonstrate superiority in enhancing performance of solving complex tasks.
Limitations:
- Training Instability: PPO training process exhibits instability, but experimental setups and parameter settings are provided for reference and reproducibility.
- Further Improvement: Potential for greater precision in reward design and applicability to a wider range of tasks.
Reproducibility: All necessary implementation code, experimental setups, model configurations, and scripts to reproduce results are available on GitHub: https://github.com/yuyq18/StepTool.