Tool Correctness¶
Tool Agent Single Turn
At a Glance¶
Score Range
0.0 ββββββββ 1.0Recall of expected tools
Default Threshold
0.5Pass/fail cutoff
Required Inputs
tools_called expected_toolsTool call lists
What It Measures
Tool Correctness evaluates whether an AI agent called the correct tools by comparing actual tool calls against expected ones. It supports name-only matching, parameter validation, and strict ordering requirements.
| Score | Interpretation |
|---|---|
| 1.0 | All expected tools called correctly |
| 0.5-0.9 | Partial matchβsome tools missing |
| 0.0 | No expected tools called correctly |
- Evaluating AI agents
- Testing function calling
- Validating tool selection
- Checking parameter passing
- Order doesn't matter (consider disabling strict_order)
- Tool output quality matters more
- Parameters have valid variations
- No expected tools defined
How It Works
The metric compares called tools against expected tools with configurable matching strategies.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Tools Called]
B[Expected Tools]
end
subgraph CONFIG["βοΈ Configuration"]
C{Strict Order?}
D{Check Parameters?}
end
subgraph MATCH["π Matching"]
E[Compare names]
F[Validate parameters]
G[Check sequence]
end
subgraph SCORE["π Score"]
H["matched / expected"]
end
A & B --> C
C -->|No| E
C -->|Yes| G
E --> D
G --> D
D -->|Yes| F
D -->|No| H
F --> H
style INPUT stroke:#8b5cf6,stroke-width:2px
style CONFIG stroke:#3b82f6,stroke-width:2px
style MATCH stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
Just check if the tool name matches. Parameters ignored.
Parameters must match exactly.
Called args must contain all expected args (extras OK).
Similarity-based matching with threshold.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
check_parameters |
bool |
False |
Also validate tool parameters |
strict_order |
bool |
False |
Tools must be called in exact order |
parameter_matching_strategy |
str |
exact |
exact, subset, or fuzzy |
fuzzy_threshold |
float |
0.8 |
Similarity threshold for fuzzy matching |
Parameter Matching Strategies
- exact: Parameters must match exactly (default)
- subset: Called args must contain all expected args (extras allowed)
- fuzzy: Similarity-based matching using SequenceMatcher
Code Examples¶
from axion.metrics import ToolCorrectness
from axion.dataset import DatasetItem
from axion._core.schema import ToolCall
metric = ToolCorrectness()
item = DatasetItem(
tools_called=[
ToolCall(name="search", args={"query": "weather"}),
ToolCall(name="format", args={"style": "brief"}),
],
expected_tools=[
ToolCall(name="search", args={}),
ToolCall(name="format", args={}),
],
)
result = await metric.execute(item)
print(result.score) # 1.0 - both tools called (params not checked)
from axion.metrics import ToolCorrectness
metric = ToolCorrectness(
check_parameters=True,
parameter_matching_strategy='exact'
)
item = DatasetItem(
tools_called=[
ToolCall(name="calculate", args={"a": 5, "b": 3}),
],
expected_tools=[
ToolCall(name="calculate", args={"a": 5, "b": 3}),
],
)
result = await metric.execute(item)
print(result.score) # 1.0 - params match exactly
from axion.metrics import ToolCorrectness
metric = ToolCorrectness(strict_order=True)
# Correct order
item_correct = DatasetItem(
tools_called=[
ToolCall(name="fetch", args={}),
ToolCall(name="process", args={}),
ToolCall(name="store", args={}),
],
expected_tools=[
ToolCall(name="fetch", args={}),
ToolCall(name="process", args={}),
ToolCall(name="store", args={}),
],
)
# Score: 1.0
# Wrong order
item_wrong = DatasetItem(
tools_called=[
ToolCall(name="process", args={}), # Should be second
ToolCall(name="fetch", args={}), # Should be first
ToolCall(name="store", args={}),
],
expected_tools=[
ToolCall(name="fetch", args={}),
ToolCall(name="process", args={}),
ToolCall(name="store", args={}),
],
)
# Score: 0.0 - order mismatch at position 0
from axion.metrics import ToolCorrectness
metric = ToolCorrectness(
check_parameters=True,
parameter_matching_strategy='fuzzy',
fuzzy_threshold=0.8
)
item = DatasetItem(
tools_called=[
ToolCall(name="search", args={"query": "what is machine learning"}),
],
expected_tools=[
ToolCall(name="search", args={"query": "what is ML"}),
],
)
result = await metric.execute(item)
# Score depends on string similarity of query values
from axion.metrics import ToolCorrectness
from axion.runners import MetricRunner
metric = ToolCorrectness(check_parameters=True)
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score:.2f}")
print(f"Explanation: {item_result.explanation}")
Example Scenarios¶
β Scenario 1: Perfect Match (Score: 1.0)
All Tools Correct
Expected Tools:
search(query="weather")parse(format="json")
Called Tools:
search(query="weather")parse(format="json")
Result: 1.0
All expected tools called with correct parameters.
β οΈ Scenario 2: Partial Match (Score: 0.67)
Missing Tool
Expected Tools:
fetchtransformstore
Called Tools:
fetchtransform(store not called)
Result: 2/3 = 0.67
Explanation: "Correctly called: ['fetch', 'transform']; Missing tools: ['store']"
β Scenario 3: Wrong Tool (Score: 0.0)
Incorrect Tool Called
Expected Tools:
calculate
Called Tools:
search
Result: 0.0
Explanation: "Missing tools: ['calculate']; Unexpected tools: ['search']"
β οΈ Scenario 4: Parameter Mismatch
Wrong Parameters
Config: check_parameters=True, strategy='exact'
Expected:
search(query="Python tutorials")
Called:
search(query="python tutorial")(different text)
Result: 0.0
Exact matching fails on parameter difference.
Fix: Use strategy='fuzzy' or strategy='subset' for flexibility.
Why It Matters¶
Verify AI agents select and call the right tools for tasks.
Test LLM function calling capabilities and parameter handling.
Ensure multi-step agent workflows execute correctly.
Quick Reference¶
TL;DR
Tool Correctness = Did the agent call the expected tools correctly?
- Use it when: Evaluating AI agents or function calling
- Score interpretation: Fraction of expected tools called correctly
- Key configs:
check_parameters,strict_order,parameter_matching_strategy
-
API Reference
-
Related Concepts
[ Agent Evaluation Β· Function Calling Β· Tool Use