Skip to main content
Evals

Live Evaluations

Live evaluations run scorers against real-time agent interactions. Attach scorers to agents during initialization to sample production traffic, enforce safety guardrails, and monitor conversation quality without running separate evaluation jobs.

Configuring Live Scorers​

Define scorers in the eval config when creating an agent:

import { Agent, VoltAgentObservability } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const observability = new VoltAgentObservability();

const agent = new Agent({
name: "support-agent",
instructions: "Answer customer questions about products.",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 0.1 },
scorers: {
moderation: {
scorer: createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5,
}),
},
},
},
});

Scorers execute asynchronously after the agent response is generated. Scoring does not block the user-facing response.

Eval Configuration​

Required Fields​

None - all fields are optional. If no scorers are defined, evaluation is disabled.

Optional Fields​

triggerSource​

Tags the evaluation run with a trigger identifier. Use to distinguish between environments or traffic sources.

triggerSource: "production"; // live traffic
triggerSource: "staging"; // pre-production
triggerSource: "manual"; // manual testing

Default: "live" when unspecified.

environment​

Labels the evaluation with an environment tag. Appears in telemetry and VoltOps dashboards.

environment: "prod-us-east";
environment: "local-dev";

sampling​

Controls what percentage of interactions are scored. Use sampling to reduce latency and LLM costs on high-volume agents.

Ratio-based:

sampling: {
type: "ratio",
rate: 0.1, // score 10% of interactions
}

Count-based:

sampling: {
type: "count",
rate: 100, // score every 100th interaction
}

Always sample:

sampling: { type: "ratio", rate: 1 }  // 100%

When unspecified, sampling defaults to scoring every interaction (rate: 1).

Sampling decisions are made independently for each scorer. Set sampling at the eval level (applies to all scorers) or per-scorer to override.

scorers​

Map of scorer configurations. Each key identifies a scorer instance, and the value defines the scorer function and parameters.

scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
},
keyword: {
scorer: keywordMatchScorer,
params: { keyword: "refund" },
},
}

redact​

Function to remove sensitive data from evaluation payloads before storage. Called synchronously before scoring.

redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
output: payload.output?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
});

The redacted payload is stored in observability but scoring uses the original unredacted version.

Scorer Configuration​

Each entry in the scorers map has this structure:

{
scorer: LocalScorerDefinition | (() => Promise<LocalScorerDefinition>),
params?: Record<string, unknown> | ((payload: AgentEvalContext) => Record<string, unknown>),
sampling?: SamplingPolicy,
id?: string,
onResult?: (result: AgentEvalResult) => void | Promise<void>,
}

Fields​

scorer (required)​

The scoring function. Use prebuilt scorers from @voltagent/scorers or custom implementations via buildScorer.

Prebuilt scorer:

import { createModerationScorer } from "@voltagent/scorers";

scorer: createModerationScorer({ model, threshold: 0.5 });

Custom scorer:

import { buildScorer } from "@voltagent/core";

const customScorer = buildScorer({
id: "length-check",
type: "agent",
label: "Response Length",
})
.score(({ payload }) => {
const length = payload.output?.length ?? 0;
return { score: length > 50 ? 1 : 0 };
})
.build();

params​

Static or dynamic parameters passed to the scorer.

Static:

params: {
keyword: "refund",
threshold: 0.8,
}

Dynamic:

params: (payload) => ({
keyword: extractKeyword(payload.input),
threshold: 0.8,
});

Dynamic params are resolved before each scorer invocation.

sampling​

Override the global sampling policy for this scorer.

sampling: { type: "ratio", rate: 0.05 }  // 5% for this scorer only

id​

Override the scorer's default ID. Useful when using the same scorer multiple times with different params.

scorers: {
keywordRefund: {
scorer: keywordScorer,
id: "keyword-refund",
params: { keyword: "refund" },
},
keywordReturn: {
scorer: keywordScorer,
id: "keyword-return",
params: { keyword: "return" },
},
}

onResult​

Callback invoked after scoring completes. Use for custom logging, alerting, or side effects.

onResult: async (result) => {
if (result.score !== null && result.score < 0.5) {
await alertingService.send({
message: `Low score: ${result.scorerName} = ${result.score}`,
});
}
};

Scorer Context​

Scorers receive an AgentEvalContext object with these properties:

interface AgentEvalContext {
agentId: string;
agentName: string;
operationId: string;
operationType: "generateText" | "streamText" | string;
input: string | null; // normalized string
output: string | null; // normalized string
rawInput: unknown; // original input value
rawOutput: unknown; // original output value
userId?: string;
conversationId?: string;
traceId: string;
spanId: string;
timestamp: string;
metadata?: Record<string, unknown>;
rawPayload: AgentEvalPayload;
}

Use input and output for text-based scorers. Access rawInput and rawOutput for structured data.

Building Custom Scorers​

Use buildScorer to create scorers with custom logic:

import { buildScorer } from "@voltagent/core";

const lengthScorer = buildScorer({
id: "response-length",
type: "agent",
label: "Response Length Check",
})
.score(({ payload, params }) => {
const minLength = (params.minLength as number) ?? 50;
const length = payload.output?.length ?? 0;
return {
score: length >= minLength ? 1 : 0,
metadata: { actualLength: length, minLength },
};
})
.reason(({ score, params }) => {
const minLength = (params.minLength as number) ?? 50;
return {
reason:
score >= 1
? `Response meets minimum length of ${minLength} characters.`
: `Response is shorter than ${minLength} characters.`,
};
})
.build();

Builder Methods​

.score(fn)​

Defines the scoring function. Return { score, metadata? } or just the numeric score.

.score(({ payload, params, results }) => {
const match = payload.output?.includes(params.keyword);
return {
score: match ? 1 : 0,
metadata: { keyword: params.keyword, matched: match },
};
})

Context properties:

  • payload - AgentEvalContext with input/output
  • params - Resolved parameters
  • results - Shared results object for multi-stage scoring

.reason(fn) (optional)​

Generates human-readable explanations. Return { reason: string }.

.reason(({ score, params }) => ({
reason: score >= 1 ? "Match found" : "No match",
}))

.build()​

Returns the LocalScorerDefinition object.

LLM Judge Scorers​

Use AI SDK's generateObject to build LLM-based evaluators:

import { buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { generateObject } from "ai";
import { z } from "zod";

const JUDGE_SCHEMA = z.object({
score: z.number().min(0).max(1).describe("Score from 0 to 1"),
reason: z.string().describe("Detailed explanation"),
});

const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness Judge",
})
.score(async ({ payload }) => {
const prompt = `Rate the response for clarity and helpfulness.

User Input: ${payload.input}
Assistant Response: ${payload.output}

Provide a score from 0 to 1 with an explanation.`;

const response = await generateObject({
model: openai("gpt-4o-mini"),
schema: JUDGE_SCHEMA,
prompt,
maxTokens: 200,
});

return {
score: response.object.score,
metadata: {
reason: response.object.reason,
},
};
})
.build();

The judge calls the LLM with a structured schema, ensuring consistent scoring output.

Prebuilt Scorers​

Moderation​

import { createModerationScorer } from "@voltagent/scorers";

createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5, // fail if score < 0.5
});

Flags unsafe content (toxicity, bias, etc.) using LLM-based classification.

Answer Correctness​

import { createAnswerCorrectnessScorer } from "@voltagent/scorers";

const scorer = createAnswerCorrectnessScorer({
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
expected: params.expectedAnswer,
}),
});

Evaluates factual accuracy. Requires expected in params. Users implement scoring logic.

Answer Relevancy​

import { createAnswerRelevancyScorer } from "@voltagent/scorers";

const scorer = createAnswerRelevancyScorer({
strictness: 3,
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
context: params.referenceContext,
}),
});

Checks if the output addresses the input. Strictness controls evaluation level.

Keyword Match​

import { buildScorer } from "@voltagent/core";

const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0 };
})
.build();

// Usage:
scorers: {
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
}

VoltOps Integration​

When a VoltOps client is configured globally, live scorer results are forwarded automatically:

import VoltAgent, { Agent, VoltAgentObservability } from "@voltagent/core";
import { VoltOpsClient } from "@voltagent/sdk";

const voltOpsClient = new VoltOpsClient({
publicKey: process.env.VOLTAGENT_PUBLIC_KEY,
secretKey: process.env.VOLTAGENT_SECRET_KEY,
});

const observability = new VoltAgentObservability();

new VoltAgent({
agents: { support: agent },
observability,
voltOpsClient, // enables automatic forwarding
});

The framework creates evaluation runs, registers scorers, appends results, and finalizes summaries. Each batch of scores (per agent interaction) becomes a separate run in VoltOps.

Sampling Strategies​

Ratio Sampling​

Sample a percentage of interactions:

sampling: { type: "ratio", rate: 0.1 }  // 10% of traffic

Use for high-volume agents where scoring every interaction is expensive.

Count Sampling​

Sample every Nth interaction:

sampling: { type: "count", rate: 100 }  // every 100th interaction

Use when you need predictable sampling intervals or rate-limiting.

Per-Scorer Sampling​

Override sampling for specific scorers:

eval: {
sampling: { type: "ratio", rate: 1 }, // default: score all
scorers: {
moderation: {
scorer: moderationScorer,
sampling: { type: "ratio", rate: 1 }, // always run moderation
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.05 }, // 5% for expensive LLM judge
},
},
}

Error Handling​

If a scorer throws an exception, the result is marked status: "error" and the error message is captured in errorMessage. Other scorers continue executing.

.score(({ payload, params }) => {
if (!params.keyword) {
throw new Error("keyword parameter is required");
}
// ...
})

The error appears in observability storage and VoltOps telemetry.

Best Practices​

Use Sampling for Expensive Scorers​

LLM judges and embedding-based scorers consume tokens and add latency. Sample aggressively:

sampling: { type: "ratio", rate: 0.05 }  // 5% for LLM judges

Combine Fast and Slow Scorers​

Run lightweight scorers (keyword match, length checks) on all interactions. Sample LLM judges at lower rates.

scorers: {
keyword: {
scorer: keywordScorer,
sampling: { type: "ratio", rate: 1 }, // 100%
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.1 }, // 10%
},
}

Use Redaction for PII​

Strip sensitive data before storage:

redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
output: payload.output?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
});

Scorers receive unredacted data. Only storage and telemetry are redacted.

Use Thresholds for Alerts​

Set thresholds and trigger alerts on failures:

scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.7 }),
onResult: async (result) => {
if (result.score !== null && result.score < 0.7) {
await alertingService.send({
severity: "high",
message: `Moderation failed: ${result.score}`,
});
}
},
},
}

Tag Environments​

Use environment to distinguish between deployments:

environment: process.env.NODE_ENV === "production" ? "prod" : "staging";

Filter telemetry by environment in VoltOps dashboards.

Examples​

Moderation + Keyword Matching​

import { Agent, VoltAgentObservability, buildScorer } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const moderationModel = openai("gpt-4o-mini");

const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0, metadata: { keyword, matched } };
})
.build();

const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
sampling: { type: "ratio", rate: 1 },
scorers: {
moderation: {
scorer: createModerationScorer({ model: moderationModel, threshold: 0.5 }),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
},
},
});

LLM Judge for Helpfulness​

import { Agent, buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

const HELPFULNESS_SCHEMA = z.object({
score: z.number().min(0).max(1),
reason: z.string(),
});

const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness",
})
.score(async ({ payload }) => {
const agent = new Agent({
name: "helpfulness-judge",
model: openai("gpt-4o-mini"),
instructions: "You rate responses for helpfulness",
});

const prompt = `Rate the response for clarity, accuracy, and helpfulness.

User Input: ${payload.input}
Assistant Response: ${payload.output}

Provide a score from 0 to 1 with an explanation.`;

const response = await agent.generateObject(prompt, HELPFULNESS_SCHEMA);

const rawResults = (payload as any).results?.raw ?? {};
rawResults.helpfulnessJudge = response.object;

return {
score: response.object.score,
metadata: { reason: response.object.reason },
};
})
.reason(({ results }) => {
const judge = results.raw?.helpfulnessJudge as { reason?: string };
return { reason: judge?.reason ?? "No explanation provided." };
})
.build();

const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
sampling: { type: "ratio", rate: 0.1 }, // 10% sampling
scorers: {
helpfulness: { scorer: helpfulnessScorer },
},
},
});

Multiple Scorers with Different Sampling​

const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 1 }, // default: score everything
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
sampling: { type: "ratio", rate: 1 }, // always run
},
answerCorrectness: {
scorer: createAnswerCorrectnessScorer(),
sampling: { type: "ratio", rate: 0.05 }, // 5% (expensive)
params: (payload) => ({
expectedAnswer: lookupExpectedAnswer(payload.input),
}),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
sampling: { type: "ratio", rate: 1 }, // cheap, always run
},
},
},
});

Combining Offline and Live Evaluations​

Use live evals for real-time monitoring and offline evals for regression testing:

  • Live: Sample 5-10% of production traffic with fast scorers (moderation, keyword match)
  • Offline: Run comprehensive LLM judges on curated datasets nightly

Both share the same scorer definitions. Move scorers between eval types as needed.

Next Steps​

Table of Contents