Skip to content

Tool Metrics

Evaluate AI agent tool calling correctness and effectiveness
1 Metric Agent

Tool metrics evaluate the correctness and effectiveness of tool usage in AI agent workflows. These metrics assess whether agents correctly invoke the right tools with appropriate parameters.


Available Metrics

Tool Correctness

Evaluate if expected tools were correctly called

tools_called expected_tools

Quick Reference

Metric Score Range Threshold Key Question
Tool Correctness 0.0 โ€“ 1.0 0.5 Were the right tools called correctly?

Usage Example

from axion.metrics import ToolCorrectness
from axion.runners import MetricRunner
from axion.dataset import DatasetItem
from axion._core.schema import ToolCall

# Create evaluation item
item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "weather in Paris"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={"query": "weather in Paris"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
)

# Initialize metric
metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)

# Run evaluation
runner = MetricRunner(metrics=[metric])
results = await runner.run([item])

print(f"Tool Correctness: {results[0].score:.2f}")
# Output: Tool Correctness: 1.00

Evaluation Modes

Tool Correctness supports multiple evaluation strategies:

Name Only (Default)

Just verify the correct tools were called. Parameters are ignored.

metric = ToolCorrectness()
With Parameters

Validate both tool names and their arguments.

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)
Strict Order

Tools must be called in the exact expected sequence.

metric = ToolCorrectness(strict_order=True)
Fuzzy Parameters

Allow similar (but not identical) parameter values.

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='fuzzy',
    fuzzy_threshold=0.8
)

Why Tool Metrics?

๐Ÿค– Agent Evaluation

Verify AI agents select the right tools for tasks.

๐Ÿ”ง Function Calling

Test LLM function calling capabilities.

๐Ÿ“Š Workflow Validation

Ensure multi-step workflows execute correctly.

๐Ÿงช Regression Testing

Catch breaking changes in agent behavior.