Live Evaluations
Live evaluations run scorers against real-time agent interactions. Attach scorers to agents during initialization to sample production traffic, enforce safety guardrails, and monitor conversation quality without running separate evaluation jobs.
Configuring Live Scorers​
Define scorers in the eval config when creating an agent:
import { Agent, VoltAgentObservability } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";
const observability = new VoltAgentObservability();
const agent = new Agent({
name: "support-agent",
instructions: "Answer customer questions about products.",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 0.1 },
scorers: {
moderation: {
scorer: createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5,
}),
},
},
},
});
Scorers execute asynchronously after the agent response is generated. Scoring does not block the user-facing response.
Eval Configuration​
Required Fields​
None - all fields are optional. If no scorers are defined, evaluation is disabled.
Optional Fields​
triggerSource​
Tags the evaluation run with a trigger identifier. Use to distinguish between environments or traffic sources.
triggerSource: "production"; // live traffic
triggerSource: "staging"; // pre-production
triggerSource: "manual"; // manual testing
Default: "live" when unspecified.
environment​
Labels the evaluation with an environment tag. Appears in telemetry and VoltOps dashboards.
environment: "prod-us-east";
environment: "local-dev";
sampling​
Controls what percentage of interactions are scored. Use sampling to reduce latency and LLM costs on high-volume agents.
Ratio-based:
sampling: {
type: "ratio",
rate: 0.1, // score 10% of interactions
}
Count-based:
sampling: {
type: "count",
rate: 100, // score every 100th interaction
}
Always sample:
sampling: { type: "ratio", rate: 1 } // 100%
When unspecified, sampling defaults to scoring every interaction (rate: 1).
Sampling decisions are made independently for each scorer. Set sampling at the eval level (applies to all scorers) or per-scorer to override.
scorers​
Map of scorer configurations. Each key identifies a scorer instance, and the value defines the scorer function and parameters.
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
},
keyword: {
scorer: keywordMatchScorer,
params: { keyword: "refund" },
},
}
redact​
Function to remove sensitive data from evaluation payloads before storage. Called synchronously before scoring.
redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
output: payload.output?.replace(/\b\d{4}-\d{4}-\d{4}-\d{4}\b/g, "[CARD]"),
});
The redacted payload is stored in observability but scoring uses the original unredacted version.
Scorer Configuration​
Each entry in the scorers map has this structure:
{
scorer: LocalScorerDefinition | (() => Promise<LocalScorerDefinition>),
params?: Record<string, unknown> | ((payload: AgentEvalContext) => Record<string, unknown>),
sampling?: SamplingPolicy,
id?: string,
onResult?: (result: AgentEvalResult) => void | Promise<void>,
}
Fields​
scorer (required)​
The scoring function. Use prebuilt scorers from @voltagent/scorers or custom implementations via buildScorer.
Prebuilt scorer:
import { createModerationScorer } from "@voltagent/scorers";
scorer: createModerationScorer({ model, threshold: 0.5 });
Custom scorer:
import { buildScorer } from "@voltagent/core";
const customScorer = buildScorer({
id: "length-check",
type: "agent",
label: "Response Length",
})
.score(({ payload }) => {
const length = payload.output?.length ?? 0;
return { score: length > 50 ? 1 : 0 };
})
.build();
params​
Static or dynamic parameters passed to the scorer.
Static:
params: {
keyword: "refund",
threshold: 0.8,
}
Dynamic:
params: (payload) => ({
keyword: extractKeyword(payload.input),
threshold: 0.8,
});
Dynamic params are resolved before each scorer invocation.
sampling​
Override the global sampling policy for this scorer.
sampling: { type: "ratio", rate: 0.05 } // 5% for this scorer only
id​
Override the scorer's default ID. Useful when using the same scorer multiple times with different params.
scorers: {
keywordRefund: {
scorer: keywordScorer,
id: "keyword-refund",
params: { keyword: "refund" },
},
keywordReturn: {
scorer: keywordScorer,
id: "keyword-return",
params: { keyword: "return" },
},
}
onResult​
Callback invoked after scoring completes. Use for custom logging, alerting, or side effects.
onResult: async (result) => {
if (result.score !== null && result.score < 0.5) {
await alertingService.send({
message: `Low score: ${result.scorerName} = ${result.score}`,
});
}
};
Scorer Context​
Scorers receive an AgentEvalContext object with these properties:
interface AgentEvalContext {
agentId: string;
agentName: string;
operationId: string;
operationType: "generateText" | "streamText" | string;
input: string | null; // normalized string
output: string | null; // normalized string
rawInput: unknown; // original input value
rawOutput: unknown; // original output value
userId?: string;
conversationId?: string;
traceId: string;
spanId: string;
timestamp: string;
metadata?: Record<string, unknown>;
rawPayload: AgentEvalPayload;
}
Use input and output for text-based scorers. Access rawInput and rawOutput for structured data.
Building Custom Scorers​
Use buildScorer to create scorers with custom logic:
import { buildScorer } from "@voltagent/core";
const lengthScorer = buildScorer({
id: "response-length",
type: "agent",
label: "Response Length Check",
})
.score(({ payload, params }) => {
const minLength = (params.minLength as number) ?? 50;
const length = payload.output?.length ?? 0;
return {
score: length >= minLength ? 1 : 0,
metadata: { actualLength: length, minLength },
};
})
.reason(({ score, params }) => {
const minLength = (params.minLength as number) ?? 50;
return {
reason:
score >= 1
? `Response meets minimum length of ${minLength} characters.`
: `Response is shorter than ${minLength} characters.`,
};
})
.build();
Builder Methods​
.score(fn)​
Defines the scoring function. Return { score, metadata? } or just the numeric score.
.score(({ payload, params, results }) => {
const match = payload.output?.includes(params.keyword);
return {
score: match ? 1 : 0,
metadata: { keyword: params.keyword, matched: match },
};
})
Context properties:
payload-AgentEvalContextwith input/outputparams- Resolved parametersresults- Shared results object for multi-stage scoring
.reason(fn) (optional)​
Generates human-readable explanations. Return { reason: string }.
.reason(({ score, params }) => ({
reason: score >= 1 ? "Match found" : "No match",
}))
.build()​
Returns the LocalScorerDefinition object.
LLM Judge Scorers​
Use AI SDK's generateObject to build LLM-based evaluators:
import { buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { generateObject } from "ai";
import { z } from "zod";
const JUDGE_SCHEMA = z.object({
score: z.number().min(0).max(1).describe("Score from 0 to 1"),
reason: z.string().describe("Detailed explanation"),
});
const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness Judge",
})
.score(async ({ payload }) => {
const prompt = `Rate the response for clarity and helpfulness.
User Input: ${payload.input}
Assistant Response: ${payload.output}
Provide a score from 0 to 1 with an explanation.`;
const response = await generateObject({
model: openai("gpt-4o-mini"),
schema: JUDGE_SCHEMA,
prompt,
maxTokens: 200,
});
return {
score: response.object.score,
metadata: {
reason: response.object.reason,
},
};
})
.build();
The judge calls the LLM with a structured schema, ensuring consistent scoring output.
Prebuilt Scorers​
Moderation​
import { createModerationScorer } from "@voltagent/scorers";
createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5, // fail if score < 0.5
});
Flags unsafe content (toxicity, bias, etc.) using LLM-based classification.
Answer Correctness​
import { createAnswerCorrectnessScorer } from "@voltagent/scorers";
const scorer = createAnswerCorrectnessScorer({
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
expected: params.expectedAnswer,
}),
});
Evaluates factual accuracy. Requires expected in params. Users implement scoring logic.
Answer Relevancy​
import { createAnswerRelevancyScorer } from "@voltagent/scorers";
const scorer = createAnswerRelevancyScorer({
strictness: 3,
buildPayload: ({ payload, params }) => ({
input: payload.input,
output: payload.output,
context: params.referenceContext,
}),
});
Checks if the output addresses the input. Strictness controls evaluation level.
Keyword Match​
import { buildScorer } from "@voltagent/core";
const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0 };
})
.build();
// Usage:
scorers: {
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
}
VoltOps Integration​
When a VoltOps client is configured globally, live scorer results are forwarded automatically:
import VoltAgent, { Agent, VoltAgentObservability } from "@voltagent/core";
import { VoltOpsClient } from "@voltagent/sdk";
const voltOpsClient = new VoltOpsClient({
publicKey: process.env.VOLTAGENT_PUBLIC_KEY,
secretKey: process.env.VOLTAGENT_SECRET_KEY,
});
const observability = new VoltAgentObservability();
new VoltAgent({
agents: { support: agent },
observability,
voltOpsClient, // enables automatic forwarding
});
The framework creates evaluation runs, registers scorers, appends results, and finalizes summaries. Each batch of scores (per agent interaction) becomes a separate run in VoltOps.
Sampling Strategies​
Ratio Sampling​
Sample a percentage of interactions:
sampling: { type: "ratio", rate: 0.1 } // 10% of traffic
Use for high-volume agents where scoring every interaction is expensive.
Count Sampling​
Sample every Nth interaction:
sampling: { type: "count", rate: 100 } // every 100th interaction
Use when you need predictable sampling intervals or rate-limiting.
Per-Scorer Sampling​
Override sampling for specific scorers:
eval: {
sampling: { type: "ratio", rate: 1 }, // default: score all
scorers: {
moderation: {
scorer: moderationScorer,
sampling: { type: "ratio", rate: 1 }, // always run moderation
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.05 }, // 5% for expensive LLM judge
},
},
}
Error Handling​
If a scorer throws an exception, the result is marked status: "error" and the error message is captured in errorMessage. Other scorers continue executing.
.score(({ payload, params }) => {
if (!params.keyword) {
throw new Error("keyword parameter is required");
}
// ...
})
The error appears in observability storage and VoltOps telemetry.
Best Practices​
Use Sampling for Expensive Scorers​
LLM judges and embedding-based scorers consume tokens and add latency. Sample aggressively:
sampling: { type: "ratio", rate: 0.05 } // 5% for LLM judges
Combine Fast and Slow Scorers​
Run lightweight scorers (keyword match, length checks) on all interactions. Sample LLM judges at lower rates.
scorers: {
keyword: {
scorer: keywordScorer,
sampling: { type: "ratio", rate: 1 }, // 100%
},
helpfulness: {
scorer: helpfulnessScorer,
sampling: { type: "ratio", rate: 0.1 }, // 10%
},
}
Use Redaction for PII​
Strip sensitive data before storage:
redact: (payload) => ({
...payload,
input: payload.input?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
output: payload.output?.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN]"),
});
Scorers receive unredacted data. Only storage and telemetry are redacted.
Use Thresholds for Alerts​
Set thresholds and trigger alerts on failures:
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.7 }),
onResult: async (result) => {
if (result.score !== null && result.score < 0.7) {
await alertingService.send({
severity: "high",
message: `Moderation failed: ${result.score}`,
});
}
},
},
}
Tag Environments​
Use environment to distinguish between deployments:
environment: process.env.NODE_ENV === "production" ? "prod" : "staging";
Filter telemetry by environment in VoltOps dashboards.
Examples​
Moderation + Keyword Matching​
import { Agent, VoltAgentObservability, buildScorer } from "@voltagent/core";
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";
const moderationModel = openai("gpt-4o-mini");
const keywordScorer = buildScorer({
id: "keyword-match",
type: "agent",
})
.score(({ payload, params }) => {
const keyword = params.keyword as string;
const matched = payload.output?.toLowerCase().includes(keyword.toLowerCase());
return { score: matched ? 1 : 0, metadata: { keyword, matched } };
})
.build();
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
sampling: { type: "ratio", rate: 1 },
scorers: {
moderation: {
scorer: createModerationScorer({ model: moderationModel, threshold: 0.5 }),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
},
},
},
});
LLM Judge for Helpfulness​
import { Agent, buildScorer } from "@voltagent/core";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const HELPFULNESS_SCHEMA = z.object({
score: z.number().min(0).max(1),
reason: z.string(),
});
const helpfulnessScorer = buildScorer({
id: "helpfulness",
label: "Helpfulness",
})
.score(async ({ payload }) => {
const agent = new Agent({
name: "helpfulness-judge",
model: openai("gpt-4o-mini"),
instructions: "You rate responses for helpfulness",
});
const prompt = `Rate the response for clarity, accuracy, and helpfulness.
User Input: ${payload.input}
Assistant Response: ${payload.output}
Provide a score from 0 to 1 with an explanation.`;
const response = await agent.generateObject(prompt, HELPFULNESS_SCHEMA);
const rawResults = (payload as any).results?.raw ?? {};
rawResults.helpfulnessJudge = response.object;
return {
score: response.object.score,
metadata: { reason: response.object.reason },
};
})
.reason(({ results }) => {
const judge = results.raw?.helpfulnessJudge as { reason?: string };
return { reason: judge?.reason ?? "No explanation provided." };
})
.build();
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
sampling: { type: "ratio", rate: 0.1 }, // 10% sampling
scorers: {
helpfulness: { scorer: helpfulnessScorer },
},
},
});
Multiple Scorers with Different Sampling​
const agent = new Agent({
name: "support",
model: openai("gpt-4o"),
eval: {
triggerSource: "production",
environment: "prod-us-east",
sampling: { type: "ratio", rate: 1 }, // default: score everything
scorers: {
moderation: {
scorer: createModerationScorer({ model, threshold: 0.5 }),
sampling: { type: "ratio", rate: 1 }, // always run
},
answerCorrectness: {
scorer: createAnswerCorrectnessScorer(),
sampling: { type: "ratio", rate: 0.05 }, // 5% (expensive)
params: (payload) => ({
expectedAnswer: lookupExpectedAnswer(payload.input),
}),
},
keyword: {
scorer: keywordScorer,
params: { keyword: "refund" },
sampling: { type: "ratio", rate: 1 }, // cheap, always run
},
},
},
});
Combining Offline and Live Evaluations​
Use live evals for real-time monitoring and offline evals for regression testing:
- Live: Sample 5-10% of production traffic with fast scorers (moderation, keyword match)
- Offline: Run comprehensive LLM judges on curated datasets nightly
Both share the same scorer definitions. Move scorers between eval types as needed.
Next Steps​
- Offline Evaluations - Regression testing and CI integration
- Prebuilt Scorers - Full catalog of prebuilt scorers
- Building Custom Scorers - Create your own evaluation scorers