Skip to content

Tool Correctness

Evaluate whether AI agents call the correct tools with proper parameters
Tool Agent Single Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Recall of expected tools
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
tools_called expected_tools
Tool call lists

What It Measures

Tool Correctness evaluates whether an AI agent called the correct tools by comparing actual tool calls against expected ones. It supports name-only matching, parameter validation, and strict ordering requirements.

Score Interpretation
1.0 All expected tools called correctly
0.5-0.9 Partial matchβ€”some tools missing
0.0 No expected tools called correctly
βœ… Use When
  • Evaluating AI agents
  • Testing function calling
  • Validating tool selection
  • Checking parameter passing
❌ Don't Use When
  • Order doesn't matter (consider disabling strict_order)
  • Tool output quality matters more
  • Parameters have valid variations
  • No expected tools defined

How It Works

The metric compares called tools against expected tools with configurable matching strategies.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Tools Called]
        B[Expected Tools]
    end

    subgraph CONFIG["βš™οΈ Configuration"]
        C{Strict Order?}
        D{Check Parameters?}
    end

    subgraph MATCH["πŸ” Matching"]
        E[Compare names]
        F[Validate parameters]
        G[Check sequence]
    end

    subgraph SCORE["πŸ“Š Score"]
        H["matched / expected"]
    end

    A & B --> C
    C -->|No| E
    C -->|Yes| G
    E --> D
    G --> D
    D -->|Yes| F
    D -->|No| H
    F --> H

    style INPUT stroke:#8b5cf6,stroke-width:2px
    style CONFIG stroke:#3b82f6,stroke-width:2px
    style MATCH stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px

Name Only (default)
Just check if the tool name matches. Parameters ignored.

Exact Parameters
Parameters must match exactly.

Subset Parameters
Called args must contain all expected args (extras OK).

Fuzzy Parameters
Similarity-based matching with threshold.

score = matched_tools / total_expected_tools

Example: - Expected: [search, calculate, format] - Called: [search, format] - Score: 2/3 = 0.67


Configuration

Parameter Type Default Description
check_parameters bool False Also validate tool parameters
strict_order bool False Tools must be called in exact order
parameter_matching_strategy str exact exact, subset, or fuzzy
fuzzy_threshold float 0.8 Similarity threshold for fuzzy matching

Parameter Matching Strategies

  • exact: Parameters must match exactly (default)
  • subset: Called args must contain all expected args (extras allowed)
  • fuzzy: Similarity-based matching using SequenceMatcher

Code Examples

from axion.metrics import ToolCorrectness
from axion.dataset import DatasetItem
from axion._core.schema import ToolCall

metric = ToolCorrectness()

item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "weather"}),
        ToolCall(name="format", args={"style": "brief"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={}),
        ToolCall(name="format", args={}),
    ],
)

result = await metric.execute(item)
print(result.score)  # 1.0 - both tools called (params not checked)
from axion.metrics import ToolCorrectness

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='exact'
)

item = DatasetItem(
    tools_called=[
        ToolCall(name="calculate", args={"a": 5, "b": 3}),
    ],
    expected_tools=[
        ToolCall(name="calculate", args={"a": 5, "b": 3}),
    ],
)

result = await metric.execute(item)
print(result.score)  # 1.0 - params match exactly
from axion.metrics import ToolCorrectness

metric = ToolCorrectness(strict_order=True)

# Correct order
item_correct = DatasetItem(
    tools_called=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
    expected_tools=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
)
# Score: 1.0

# Wrong order
item_wrong = DatasetItem(
    tools_called=[
        ToolCall(name="process", args={}),  # Should be second
        ToolCall(name="fetch", args={}),    # Should be first
        ToolCall(name="store", args={}),
    ],
    expected_tools=[
        ToolCall(name="fetch", args={}),
        ToolCall(name="process", args={}),
        ToolCall(name="store", args={}),
    ],
)
# Score: 0.0 - order mismatch at position 0
from axion.metrics import ToolCorrectness

metric = ToolCorrectness(
    check_parameters=True,
    parameter_matching_strategy='fuzzy',
    fuzzy_threshold=0.8
)

item = DatasetItem(
    tools_called=[
        ToolCall(name="search", args={"query": "what is machine learning"}),
    ],
    expected_tools=[
        ToolCall(name="search", args={"query": "what is ML"}),
    ],
)

result = await metric.execute(item)
# Score depends on string similarity of query values
from axion.metrics import ToolCorrectness
from axion.runners import MetricRunner

metric = ToolCorrectness(check_parameters=True)
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score:.2f}")
    print(f"Explanation: {item_result.explanation}")

Example Scenarios

βœ… Scenario 1: Perfect Match (Score: 1.0)

All Tools Correct

Expected Tools:

  1. search(query="weather")
  2. parse(format="json")

Called Tools:

  1. search(query="weather")
  2. parse(format="json")

Result: 1.0

All expected tools called with correct parameters.

⚠️ Scenario 2: Partial Match (Score: 0.67)

Missing Tool

Expected Tools:

  1. fetch
  2. transform
  3. store

Called Tools:

  1. fetch
  2. transform (store not called)

Result: 2/3 = 0.67

Explanation: "Correctly called: ['fetch', 'transform']; Missing tools: ['store']"

❌ Scenario 3: Wrong Tool (Score: 0.0)

Incorrect Tool Called

Expected Tools:

  1. calculate

Called Tools:

  1. search

Result: 0.0

Explanation: "Missing tools: ['calculate']; Unexpected tools: ['search']"

⚠️ Scenario 4: Parameter Mismatch

Wrong Parameters

Config: check_parameters=True, strategy='exact'

Expected:

search(query="Python tutorials")

Called:

search(query="python tutorial") (different text)

Result: 0.0

Exact matching fails on parameter difference.

Fix: Use strategy='fuzzy' or strategy='subset' for flexibility.


Why It Matters

πŸ€– Agent Evaluation

Verify AI agents select and call the right tools for tasks.

πŸ”§ Function Calling

Test LLM function calling capabilities and parameter handling.

πŸ“Š Workflow Validation

Ensure multi-step agent workflows execute correctly.


Quick Reference

TL;DR

Tool Correctness = Did the agent call the expected tools correctly?

  • Use it when: Evaluating AI agents or function calling
  • Score interpretation: Fraction of expected tools called correctly
  • Key configs: check_parameters, strict_order, parameter_matching_strategy