Skip to main content
Evals

Experiments

Experiments are the core abstraction for running evaluations in VoltAgent. They define how to test your agents, what data to use, and how to measure success.

Creating Experiments​

Use createExperiment from @voltagent/evals to define an evaluation experiment:

import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";

export default createExperiment({
id: "customer-support-quality",
label: "Customer Support Quality",
description: "Evaluate customer support agent responses",

// Reference a dataset by name
dataset: {
name: "support-qa-dataset",
},

// Define the runner function to evaluate
runner: async ({ item, index, total }) => {
// Access the dataset item
const input = item.input;
const expected = item.expected;

// Run your evaluation logic
const response = await myAgent.generateText(input);

// Return the output
return {
output: response.text,
metadata: {
processingTime: Date.now(),
modelUsed: "gpt-4o-mini",
},
};
},

// Configure scorers
scorers: [
scorers.exactMatch,
{
scorer: scorers.levenshtein,
threshold: 0.8,
},
],

// Pass criteria
passCriteria: {
type: "meanScore",
min: 0.7,
},
});

Experiment Configuration​

Required Fields​

interface ExperimentConfig {
// Unique identifier for the experiment
id: string;

// The runner function that executes for each dataset item
runner: ExperimentRunner;

// Optional but recommended
label?: string;
description?: string;
}

Runner Function​

The runner function is what you're evaluating. It receives a context object and produces output:

type ExperimentRunner = (context: ExperimentRunnerContext) => Promise<ExperimentRunnerReturn>;

interface ExperimentRunnerContext {
item: ExperimentDatasetItem; // Current dataset item
index: number; // Item index
total?: number; // Total items (if known)
signal?: AbortSignal; // For cancellation
voltOpsClient?: any; // VoltOps client if configured
runtime?: {
runId?: string;
startedAt?: number;
tags?: readonly string[];
};
}

Example runners:

// Simple text generation
runner: async ({ item }) => {
const result = await processInput(item.input);
return {
output: result,
metadata: {
confidence: 0.95,
},
};
};

// Using expected value for comparison
runner: async ({ item }) => {
const prompt = `Question: ${item.input}\nExpected answer format: ${item.expected}`;
const result = await generateResponse(prompt);
return { output: result };
};

// With error handling
runner: async ({ item, signal }) => {
try {
const result = await processWithTimeout(item.input, signal);
return { output: result };
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
};

// Accessing runtime context
runner: async ({ item, index, total, runtime }) => {
console.log(`Processing item ${index + 1}/${total}`);
console.log(`Run ID: ${runtime?.runId}`);

const result = await process(item.input);
return {
output: result,
};
};

Dataset Configuration​

Experiments can use datasets in multiple ways:

// Reference registered dataset by name
dataset: {
name: "my-dataset"
}

// Reference by ID
dataset: {
id: "dataset-uuid",
versionId: "version-uuid" // Optional specific version
}

// Limit number of items
dataset: {
name: "large-dataset",
limit: 100 // Only use first 100 items
}

// Inline items
dataset: {
items: [
{
id: "1",
input: { prompt: "What is 2+2?" },
expected: "4"
},
{
id: "2",
input: { prompt: "Capital of France?" },
expected: "Paris"
}
]
}

// Dynamic resolver
dataset: {
resolve: async ({ limit, signal }) => {
const items = await fetchDatasetItems(limit);
return {
items,
total: items.length,
dataset: {
name: "Dynamic Dataset",
description: "Fetched at runtime"
}
};
}
}

Dataset Item Structure​

interface ExperimentDatasetItem {
id: string; // Unique item ID
label?: string; // Optional display name
input: any; // Input data (your format)
expected?: any; // Expected output (optional)
extra?: Record<string, any>; // Additional data
metadata?: Record<string, any>; // Item metadata

// Automatically added if from registered dataset
datasetId?: string;
datasetVersionId?: string;
datasetName?: string;
}

Scorers Configuration​

Configure how experiments use scorers:

import { scorers } from "@voltagent/scorers";

scorers: [
// Use prebuilt scorer directly
scorers.exactMatch,

// Configure scorer with threshold
{
scorer: scorers.levenshtein,
threshold: 0.9,
name: "String Similarity",
},

// Custom scorer with metadata
{
scorer: myCustomScorer,
threshold: 0.7,
metadata: {
category: "custom",
version: "1.0.0",
},
},
];

Pass Criteria​

Define success conditions for your experiments:

// Single criterion - mean score
passCriteria: {
type: "meanScore",
min: 0.8,
label: "Average Quality",
scorerId: "exact-match" // Optional: specific scorer
}

// Single criterion - pass rate
passCriteria: {
type: "passRate",
min: 0.9,
label: "90% Pass Rate",
severity: "error" // "error" or "warn"
}

// Multiple criteria (all must pass)
passCriteria: [
{
type: "meanScore",
min: 0.7,
label: "Overall Quality"
},
{
type: "passRate",
min: 0.95,
label: "Consistency Check",
scorerId: "exact-match"
}
]

VoltOps Integration​

Configure VoltOps for cloud-based tracking:

voltOps: {
client: voltOpsClient, // VoltOps client instance
triggerSource: "ci", // Source identifier
autoCreateRun: true, // Auto-create eval runs
autoCreateScorers: true, // Auto-register scorers
tags: ["nightly", "regression"] // Tags for filtering
}

Experiment Binding​

Link experiments to VoltOps experiments:

experiment: {
name: "production-quality-check", // VoltOps experiment name
id: "exp-uuid", // Or use existing ID
autoCreate: true // Create if doesn't exist
}

Running Experiments​

Via CLI​

Save your experiment to a file:

// experiments/support-quality.ts
import { createExperiment } from "@voltagent/evals";

export default createExperiment({
id: "support-quality",
dataset: { name: "support-dataset" },
runner: async ({ item }) => {
// evaluation logic
return { output: "response" };
},
});

Run with:

npm run volt eval run --experiment ./experiments/support-quality.ts

Programmatically​

import { runExperiment } from "@voltagent/evals";
import experiment from "./experiments/support-quality";

const summary = await runExperiment(experiment, {
concurrency: 5, // Run 5 items in parallel

onItemComplete: (event) => {
console.log(`Completed item ${event.index}/${event.total}`);
console.log(`Score: ${event.result.scores[0]?.score}`);
},

onComplete: (summary) => {
console.log(`Experiment completed: ${summary.passed ? "PASSED" : "FAILED"}`);
console.log(`Mean score: ${summary.meanScore}`);
},
});

Complete Example​

Here's a complete example from the codebase:

import { createExperiment } from "@voltagent/evals";
import { scorers } from "@voltagent/scorers";
import { Agent } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";

const supportAgent = new Agent({
name: "Support Agent",
instructions: "You are a helpful customer support agent.",
model: openai("gpt-4o-mini"),
});

export default createExperiment({
id: "support-agent-eval",
label: "Support Agent Evaluation",
description: "Evaluates support agent response quality",

dataset: {
name: "support-qa-v2",
limit: 100, // Test on first 100 items
},

runner: async ({ item, index, total }) => {
console.log(`Processing ${index + 1}/${total}`);

try {
const response = await supportAgent.generateText({
messages: [{ role: "user", content: item.input.prompt }],
});

return {
output: response.text,
metadata: {
model: "gpt-4o-mini",
tokenUsage: response.usage,
},
};
} catch (error) {
return {
output: null,
metadata: {
error: error.message,
failed: true,
},
};
}
},

scorers: [
{
scorer: scorers.exactMatch,
threshold: 1.0,
},
{
scorer: scorers.levenshtein,
threshold: 0.8,
name: "String Similarity",
},
],

passCriteria: [
{
type: "meanScore",
min: 0.75,
label: "Overall Quality",
},
{
type: "passRate",
min: 0.9,
scorerId: "exact-match",
label: "Exact Match Rate",
},
],

experiment: {
name: "support-agent-regression",
autoCreate: true,
},

voltOps: {
autoCreateRun: true,
tags: ["regression", "support"],
},
});

Result Structure​

When running experiments, you get a summary with this structure:

interface ExperimentSummary {
experimentId: string;
runId: string;
status: "completed" | "failed" | "cancelled";
passed: boolean;

startedAt: number;
completedAt: number;
durationMs: number;

results: ExperimentItemResult[];

// Aggregate metrics
totalItems: number;
completedItems: number;
meanScore: number;
passRate: number;

// Pass criteria results
criteriaResults?: {
label?: string;
passed: boolean;
value: number;
threshold: number;
}[];

metadata?: Record<string, unknown>;
}

Best Practices​

1. Use Descriptive IDs​

id: "gpt4-customer-support-accuracy-v2"; // Good
id: "test1"; // Bad

2. Handle Errors Gracefully​

runner: async ({ item }) => {
try {
const result = await process(item.input);
return { output: result };
} catch (error) {
// Return error info for analysis
return {
output: null,
metadata: {
error: error.message,
errorType: error.constructor.name,
},
};
}
};

3. Add Meaningful Metadata​

runner: async ({ item, runtime }) => {
const startTime = Date.now();
const result = await process(item.input);

return {
output: result,
metadata: {
processingTimeMs: Date.now() - startTime,
runId: runtime?.runId,
itemCategory: item.metadata?.category,
},
};
};

4. Use Appropriate Concurrency​

// For rate-limited APIs
await runExperiment(experiment, {
concurrency: 2, // Low concurrency
});

// For local processing
await runExperiment(experiment, {
concurrency: 10, // Higher concurrency
});

5. Tag Experiments Properly​

voltOps: {
tags: ["model:gpt-4", "version:2.1.0", "type:regression", "priority:high"];
}

Next Steps​

Table of Contents