Tool Metrics¶

Evaluate AI agent tool calling correctness and effectiveness
1 Metric Agent

Tool metrics evaluate the correctness and effectiveness of tool usage in AI agent workflows. These metrics assess whether agents correctly invoke the right tools with appropriate parameters.

Available Metrics¶

Tool Correctness

Evaluate if expected tools were correctly called

tools_called expected_tools

Quick Reference¶

Metric	Score Range	Threshold	Key Question
Tool Correctness	0.0 – 1.0	0.5	Were the right tools called correctly?

Usage Example¶

from axion.metrics import ToolCorrectness
from axion.runners import MetricRunner
from axion.dataset import DatasetItem
from axion._core.schema import ToolCall

# Create evaluation item
item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "weather in Paris"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={"query": "weather in Paris"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
)

# Initialize metric
metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)

# Run evaluation
runner = MetricRunner(metrics=[metric])
results = await runner.run([item])

print(f"Tool Correctness: {results[0].score:.2f}")
# Output: Tool Correctness: 1.00

Evaluation Modes¶

Tool Correctness supports multiple evaluation strategies:

Name Only (Default)

Just verify the correct tools were called. Parameters are ignored.

metric = ToolCorrectness()

With Parameters

Validate both tool names and their arguments.

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)

Strict Order

Tools must be called in the exact expected sequence.

metric = ToolCorrectness(strict_order=True)

Fuzzy Parameters

Allow similar (but not identical) parameter values.

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='fuzzy',
    fuzzy_threshold=0.8
)

Why Tool Metrics?¶

🤖 Agent Evaluation

Verify AI agents select the right tools for tasks.

🔧 Function Calling

Test LLM function calling capabilities.

📊 Workflow Validation

Ensure multi-step workflows execute correctly.

🧪 Regression Testing

Catch breaking changes in agent behavior.