Tool Correctness¶

Evaluate whether AI agents call the correct tools with proper parameters
Tool Agent Single Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Recall of expected tools

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
tools_called expected_tools
Tool call lists

What It Measures

Tool Correctness evaluates whether an AI agent called the correct tools by comparing actual tool calls against expected ones. It supports name-only matching, parameter validation, and strict ordering requirements.

Score	Interpretation
1.0	All expected tools called correctly
0.5-0.9	Partial match—some tools missing
0.0	No expected tools called correctly

✅ Use When

Evaluating AI agents
Testing function calling
Validating tool selection
Checking parameter passing

❌ Don't Use When

Order doesn't matter (consider disabling strict_order)
Tool output quality matters more
Parameters have valid variations
No expected tools defined

How It Works

Computation Matching Strategies Score Formula

The metric compares called tools against expected tools with configurable matching strategies.

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Tools Called]
        B[Expected Tools]
    end

    subgraph CONFIG["⚙️ Configuration"]
        C{Strict Order?}
        D{Check Parameters?}
    end

    subgraph MATCH["🔍 Matching"]
        E[Compare names]
        F[Validate parameters]
        G[Check sequence]
    end

    subgraph SCORE["📊 Score"]
        H["matched / expected"]
    end

    A & B --> C
    C -->|No| E
    C -->|Yes| G
    E --> D
    G --> D
    D -->|Yes| F
    D -->|No| H
    F --> H

    style INPUT stroke:#8b5cf6,stroke-width:2px
    style CONFIG stroke:#3b82f6,stroke-width:2px
    style MATCH stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px

Name Only (default)
Just check if the tool name matches. Parameters ignored.

Exact Parameters
Parameters must match exactly.

Subset Parameters
Called args must contain all expected args (extras OK).

Fuzzy Parameters
Similarity-based matching with threshold.

score = matched_tools / total_expected_tools

Example: - Expected: [search, calculate, format] - Called: [search, format] - Score: 2/3 = 0.67

Configuration¶

Parameters

Parameter	Type	Default	Description
`check_parameters`	`bool`	`False`	Also validate tool parameters
`strict_order`	`bool`	`False`	Tools must be called in exact order
`parameter_matching_strategy`	`str`	`exact`	`exact`, `subset`, or `fuzzy`
`fuzzy_threshold`	`float`	`0.8`	Similarity threshold for fuzzy matching

Parameter Matching Strategies

exact: Parameters must match exactly (default)
subset: Called args must contain all expected args (extras allowed)
fuzzy: Similarity-based matching using SequenceMatcher

Code Examples¶

Basic Usage (Name Only) With Parameter Checking Strict Order Fuzzy Parameter Matching With Runner

from axion.metrics import ToolCorrectness
from axion.dataset import DatasetItem
from axion._core.schema import ToolCall

metric = ToolCorrectness()

item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "weather"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={}),
        ToolCall(name="format", args={}),
    ],
)

result = await metric.execute(item)
print(result.score)  # 1.0 - both tools called (params not checked)

from axion.metrics import ToolCorrectness

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)

item = DatasetItem(
    tools_called=[
        ToolCall(name="calculate", args={"a": 5, "b": 3}),
    ],
    expected_tools=[
        ToolCall(name="calculate", args={"a": 5, "b": 3}),
    ],
)

result = await metric.execute(item)
print(result.score)  # 1.0 - params match exactly

from axion.metrics import ToolCorrectness

metric = ToolCorrectness(strict_order=True)

# Correct order
item_correct = DatasetItem(
    tools_called=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
    expected_tools=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
)
# Score: 1.0

# Wrong order
item_wrong = DatasetItem(
    tools_called=[
        ToolCall(name="process", args={}),  # Should be second
        ToolCall(name="fetch", args={}),    # Should be first
        ToolCall(name="store", args={}),
    ],
    expected_tools=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
)
# Score: 0.0 - order mismatch at position 0

from axion.metrics import ToolCorrectness

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='fuzzy',
    fuzzy_threshold=0.8
)

item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "what is machine learning"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={"query": "what is ML"}),
    ],
)

result = await metric.execute(item)
# Score depends on string similarity of query values

from axion.metrics import ToolCorrectness
from axion.runners import MetricRunner

metric = ToolCorrectness(check_parameters=True)
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score:.2f}")
    print(f"Explanation: {item_result.explanation}")

Example Scenarios¶

✅ Scenario 1: Perfect Match (Score: 1.0)

All Tools Correct

Expected Tools:

search(query="weather")
parse(format="json")

Called Tools:

search(query="weather")
parse(format="json")

Result: 1.0

All expected tools called with correct parameters.

⚠️ Scenario 2: Partial Match (Score: 0.67)

Missing Tool

Expected Tools:

fetch
transform
store

Called Tools:

fetch
transform (store not called)

Result: 2/3 = 0.67

Explanation: "Correctly called: ['fetch', 'transform']; Missing tools: ['store']"

❌ Scenario 3: Wrong Tool (Score: 0.0)

Incorrect Tool Called

Expected Tools:

calculate

Called Tools:

search

Result: 0.0

Explanation: "Missing tools: ['calculate']; Unexpected tools: ['search']"

⚠️ Scenario 4: Parameter Mismatch

Wrong Parameters

Config: check_parameters=True, strategy='exact'

Expected:

search(query="Python tutorials")

Called:

search(query="python tutorial") (different text)

Result: 0.0

Exact matching fails on parameter difference.

Fix: Use strategy='fuzzy' or strategy='subset' for flexibility.

Why It Matters¶

🤖 Agent Evaluation

Verify AI agents select and call the right tools for tasks.

🔧 Function Calling

Test LLM function calling capabilities and parameter handling.

📊 Workflow Validation

Ensure multi-step agent workflows execute correctly.

Quick Reference¶

TL;DR

Tool Correctness = Did the agent call the expected tools correctly?

Use it when: Evaluating AI agents or function calling
Score interpretation: Fraction of expected tools called correctly
Key configs: check_parameters, strict_order, parameter_matching_strategy

API Reference

axion.metrics.ToolCorrectness
Related Concepts

[ Agent Evaluation · Function Calling · Tool Use