Langfuse Integration¶

Integrate Axion evaluation with Langfuse observability to close the feedback loop between production LLM operations and quality metrics.

Why Use Langfuse with Axion?¶

Close the feedback loop: Connect evaluation results directly to production traces
Production-grade observability: Track LLM performance, costs, and quality metrics in one place
Continuous evaluation: Automate evaluation pipelines on real production data

Evaluation Workflows¶

Axion supports three distinct evaluation workflows, each designed for different use cases. Understanding when to use each workflow is critical for effective LLM evaluation.

Workflow Overview¶

Workflow	Testing Style	Primary Use Case	Publishing Method
API-Driven	Black-box	Regression testing, CI/CD	`publish_as_experiment()`
Trace-Based	White-box	Historical analysis, debugging	`publish_as_experiment()` or `publish_to_observability()`
Online Production	Continuous	Real-time quality monitoring	`publish_to_observability()`

Offline: API-Driven¶

Best for: Regression testing, CI/CD pipelines, comparing API versions

This workflow treats your agent as a black-box. You provide inputs from a golden dataset, call your API endpoint, and evaluate the responses. You don't need access to internal traces or spans.

flowchart LR
    subgraph Source["Source"]
        direction TB
        A[(Langfuse)] -->|read_from_langfuse| B[/Golden Dataset/]
    end

    subgraph Execute["Execute"]
        direction TB
        B -->|execute_dataset_items_from_api| C[[APIRunner]]
        C -->|calls| D([Agent API])
        D -->|responses| B
    end

    subgraph Evaluate["Evaluate"]
        direction TB
        B --> E[[evaluation_runner]]
        E --> F{{EvaluationResult}}
    end

    subgraph Publish["Publish"]
        direction TB
        F -->|new experiment| G[publish_as_experiment]
        G --> A
    end

When to use:

You want to test API behavior without trace instrumentation
Running automated regression tests in CI/CD
Comparing responses between different API versions or models
The agent is deployed as a service and you only care about input/output behavior

Example:

from axion import Dataset
from axion.metrics import AnswerRelevancy, Faithfulness
from axion.runners import evaluation_runner

# 1. Load golden dataset from Langfuse
dataset = Dataset.read_from_langfuse(golden, name='my_eval_dataset')

# 2. Execute API calls to populate actual_output
dataset.execute_dataset_items_from_api('Agent API', 'config.yaml')

# 3. Run evaluation
result = await evaluation_runner(
    evaluation_inputs=dataset,
    scoring_metrics=[AnswerRelevancy(), Faithfulness()],
    evaluation_name='API Regression Test',
)

# 4. Publish as experiment for version comparison
result.publish_as_experiment(
    dataset_name='api-regression-tests',
    run_name='v2.1-release',
)

Offline: Trace-Based¶

Best for: Historical analysis, A/B experiments, debugging with span-level insights

This workflow evaluates agent runs where you have access to internal traces. You trigger agent runs from a golden dataset, collect the traces (with span-level details), then evaluate. This gives you white-box visibility into agent internals.

flowchart LR
    subgraph LF["Langfuse"]
        direction TB
        A[/Golden Dataset/]
        C[(Traces)]
    end

    subgraph Run["Run Agent"]
        direction TB
        B([Agent])
    end

    subgraph Fetch["Fetch"]
        direction TB
        D[LangfuseTraceLoader]
        E[/Dataset with Actuals/]
        D --> E
    end

    subgraph Evaluate["Evaluate"]
        direction TB
        F[[evaluation_runner]]
        G{{EvaluationResult}}
        F --> G
    end

    subgraph Publish["Publish"]
        direction TB
        H[publish_to_observability]
        I[publish_as_experiment]
    end

    A -->|for each item| B
    B -->|traces| C
    C -->|fetch_traces| D
    E --> F
    G -->|attach to traces| H
    G -->|new experiment| I
    H --> C
    I --> C

When to use:

You need span-level insights (LLM calls, tool usage, retrieval steps)
Debugging why specific responses failed
Running A/B experiments with different agent configurations
Evaluating historical production runs

Key difference from API-Driven: The actual_output is extracted from traces, giving you access to intermediate steps, not just final responses.

Example:

from axion.tracing import LangfuseTraceLoader
from axion.metrics import AnswerRelevancy, Faithfulness
from axion.runners import evaluation_runner
from axion import Dataset, DatasetItem

# 1. Run your agent against golden dataset items (externally)
# This creates traces in Langfuse with tags ['experiment-v1']

# 2. Fetch traces with outputs
loader = LangfuseTraceLoader()
traces = loader.fetch_traces(tags=['experiment-v1'])

# 3. Convert to Dataset (preserving trace_id for linking)
items = [
    DatasetItem(
        id=t.id,
        query=t.input.get('query', ''),
        actual_output=t.output.get('response', ''),
        trace_id=t.id,  # Preserves link to original trace
    )
    for t in traces if t.input and t.output
]
dataset = Dataset(items=items)

# 4. Run evaluation
result = await evaluation_runner(
    evaluation_inputs=dataset,
    scoring_metrics=[AnswerRelevancy(), Faithfulness()],
    evaluation_name='Trace-Based Evaluation',
)

# 5. Publish - choose based on your needs:
# Option A: Attach scores to existing traces (for debugging)
result.publish_to_observability()

# Option B: Create experiment (for comparison UI)
result.publish_as_experiment(
    dataset_name='agent-experiments',
    run_name='config-v2',
    link_to_traces=True,  # Links experiment runs to original traces
)

Online: Production¶

Best for: Continuous quality monitoring, real-time alerts, production health tracking

This workflow evaluates live production traces. You fetch recent traces from Langfuse, run evaluation metrics, and attach scores back to those traces for monitoring dashboards.

flowchart LR
    subgraph Production["Production"]
        direction TB
        A([Agent]) -->|traces| B[(Langfuse)]
    end

    subgraph Fetch["Fetch"]
        direction TB
        B -->|fetch_traces| C[LangfuseTraceLoader]
        C --> D[/Dataset/]
    end

    subgraph Evaluate["Evaluate"]
        direction TB
        D --> E[[evaluation_runner]]
        E --> F{{EvaluationResult}}
    end

    subgraph Publish["Publish"]
        direction TB
        F -->|attach scores| G[publish_to_observability]
        G --> B
    end

When to use:

Monitoring production quality in real-time
Setting up quality alerts and dashboards
Tracking quality drift over time
Evaluating a sample of production traffic

Example:

from axion.tracing import LangfuseTraceLoader
from axion.metrics import AnswerRelevancy, Toxicity
from axion.runners import evaluation_runner
from axion import Dataset, DatasetItem

async def evaluate_production_traces():
    # 1. Fetch recent production traces
    loader = LangfuseTraceLoader()
    traces = loader.fetch_traces(
        limit=100,
        tags=['production'],
    )

    # 2. Convert to Dataset
    items = [
        DatasetItem(
            id=t.id,
            query=t.input.get('query', ''),
            actual_output=t.output.get('response', ''),
            trace_id=t.id,
        )
        for t in traces if t.input and t.output
    ]
    dataset = Dataset(items=items)

    # 3. Run lightweight evaluation metrics
    result = await evaluation_runner(
        evaluation_inputs=dataset,
        scoring_metrics=[AnswerRelevancy(), Toxicity()],
        evaluation_name='Production Monitoring',
    )

    # 4. Attach scores to production traces
    stats = result.publish_to_observability(tags=['automated-eval'])
    print(f"Evaluated {stats['uploaded']} production traces")

# Run periodically (e.g., every hour via cron)

Cost Considerations for Online Evaluation

For high-volume production systems, consider:

Sampling: Evaluate a random sample (e.g., 1-5%) of traces instead of all
Lightweight metrics: Use heuristic metrics instead of LLM-based metrics for high-frequency evaluation
Batching: Aggregate traces and evaluate in batches during off-peak hours

Disabling Evaluation Tracing¶

By default, evaluation_runner creates traces for each metric execution. If you don't need these evaluation traces (most publishing workflows only use source traces from DatasetItem.trace_id), you can disable them to reduce overhead.

Important: Configure tracing to NOOP before creating metric instances, since tracers are cached at instantiation time.

from axion.tracing import configure_tracing
from axion.runners import evaluation_runner
from axion.metrics import AnswerRelevancy, ExactStringMatch

# 1. Disable tracing BEFORE creating metrics
configure_tracing('noop')

# 2. Create metrics (they will use NOOP tracers)
config = {
    'metric': {
        'Relevance': AnswerRelevancy(model_name='gpt-4o'),
        'ExactStringMatch': ExactStringMatch(),
    },
    'model': {
        'ANSWER_QUALITY': {
            'Relevance': 1.0,
            'ExactStringMatch': 1.0,
        },
    },
    'weights': {
        'ANSWER_QUALITY': 1.0,
    }
}

# 3. Run evaluation (no evaluation traces created)
results = evaluation_runner(
    evaluation_inputs=dataset,
    scoring_config=config,
    evaluation_name='My Evaluation',
)

# 4. Restore tracing for publishing
configure_tracing('langfuse')

# 5. Publish results (uses source traces from DatasetItem.trace_id)
results.publish_to_observability()

When to keep evaluation tracing enabled

Keep tracing enabled (default) when:

Using publish_as_experiment(score_on_runtime_traces=True)
You need to debug metric execution in Langfuse
You want visibility into LLM calls made by metrics

Choosing the Right Workflow¶

Use this decision tree to select the appropriate workflow:

Do you have existing traces in Langfuse?
├── No → Do you need span-level insights?
│         ├── No → Use API-Driven (black-box testing)
│         └── Yes → Run agent with tracing, then use Trace-Based
└── Yes → Are these production traces?
          ├── Yes → Use Online Production (monitoring)
          └── No → Use Trace-Based (historical analysis)

Question	API-Driven	Trace-Based	Online
Need span-level debugging?	No	Yes	Depends
Requires trace instrumentation?	No	Yes	Yes
Creates new traces?	Optional	Yes	No
Best for CI/CD?	Yes	Possible	No
Best for monitoring?	No	No	Yes

Quick Summary¶

flowchart LR
    A[Agent] -->|traces| B[Langfuse]
    B -->|fetch| C[LangfuseTraceLoader]
    C -->|convert| D[Dataset]
    D -->|evaluate| E[evaluation_runner]
    E -->|scores| F[EvaluationResult]
    F -->|publish| B

Detailed Workflow

flowchart LR
    subgraph Production["Production"]
        direction TB
        A([Agent]) -->|traces| B[(Langfuse)]
    end

    subgraph Fetch["Fetch"]
        direction TB
        B -->|fetch_traces| C[LangfuseTraceLoader]
        C --> D[/Dataset/]
        E[/Local Data/] --> D
    end

    subgraph Evaluate["Evaluate"]
        direction TB
        D --> F[[evaluation_runner]]
        F --> G{{EvaluationResult}}
    end

    subgraph Analyze["Analyze"]
        direction TB
        G --> H[summary]
        G --> I[to_dataframe]
        G --> J[to_scorecard]
    end

    subgraph Publish["Publish"]
        direction TB
        G -->|existing traces| K[publish_to_observability]
        G -->|new experiment| L[publish_as_experiment]
        K --> B
        L --> B
    end

Complete Example¶

This example demonstrates the full workflow: fetching traces, running evaluation, viewing results, and publishing back to Langfuse.

import asyncio
from axion.tracing import LangfuseTraceLoader
from axion.metrics import AnswerRelevancy, AnswerCompleteness
from axion.runners import evaluation_runner
from axion import Dataset, DatasetItem

async def main():
    # 1. Fetch traces from Langfuse
    loader = LangfuseTraceLoader()
    traces = loader.fetch_traces(limit=50, tags=['production'])

    # 2. Convert to Dataset
    items = [
        DatasetItem(
            id=t.id,
            query=t.input.get('query', ''),
            actual_output=t.output.get('response', ''),
            trace_id=t.id,
        )
        for t in traces if t.input and t.output
    ]
    dataset = Dataset(items=items)

    # 3. Run evaluation
    result = await evaluation_runner(
        evaluation_inputs=dataset,
        scoring_metrics=[AnswerRelevancy(), AnswerCompleteness()],
        evaluation_name='Production Evaluation',
    )

    # 4. View results
    from axion.runners.summary import MetricSummary
    MetricSummary().execute(result.results, total_time=100)
    result.to_scorecard(display_in_notebook=True)

    # 5. Publish back to Langfuse
    stats = result.publish_as_experiment(
        dataset_name='my-eval-dataset',
        run_name='experiment-v1',
        tags=['production']
    )
    print(f"Published {stats['scores_uploaded']} scores")

asyncio.run(main())

Understanding Results¶

After running evaluation_runner, use these methods to analyze results:

Method	Description
`MetricSummary().execute(result.results, total_time)`	Generate detailed metric analysis report
`result.to_dataframe()`	Convert results to a pandas DataFrame for analysis
`result.to_scorecard(display_in_notebook=True)`	Display an interactive scorecard visualization
`result.to_latency_plot()`	Visualize metric latency distributions

Quick Analysis¶

from axion.runners.summary import MetricSummary

# Generate detailed summary report
MetricSummary().execute(result.results, total_time=100)

# Export to DataFrame for custom analysis
df = result.to_dataframe()
print(df.describe())

# Visual scorecard (in Jupyter notebooks)
result.to_scorecard(display_in_notebook=True)

Workflow Steps¶

Configure Langfuse: Set up credentials and environment
Create Traces: Instrument your LLM operations with @trace decorator
Explore Traces: Navigate and filter traces with TraceCollection
Fetch & Evaluate: Retrieve traces and run metrics
Publish Results: Send scores back to Langfuse

Next Steps¶

Configuration: Set up Langfuse credentials and options
Tracing: Learn how to create and manage traces
Trace Collection: Rich trace exploration, filtering, and dataset conversion
Publishing: Publish scores to existing traces or create experiments
Metrics Guide: Available metrics and customization options