Overview
Building a voice AI agent is only half the challenge. You also need to know that it handles real conversations reliably. Evaluations give you that confidence through two complementary approaches:- Simulations: Automated test conversations that exercise your agent’s behavior across scenarios, edge cases, and failure modes before they reach users.
- Observability: Continuous evaluation of live calls so you can catch regressions, track quality over time, and close the loop between what you test and what users experience.
Bluejay
Bluejay is a simulation, observability, and evaluation platform purpose-built for voice AI agents. It provides no-code simulation testing and production call monitoring that integrate directly with Pipecat, whether you’re running on Pipecat Cloud or self-hosting.- Run automated simulations that call your agent and evaluate its responses
- Define test scenarios covering edge cases like interruptions, unexpected input, and multi-turn flows
- Monitor every production call with automated quality scoring
- Track evaluation metrics over time to catch regressions early
Pipecat Cloud users
If your agent is deployed on Pipecat Cloud, Bluejay offers two zero-configuration integration paths:No-Code API Integration
Enter your Pipecat Cloud API key and agent name in Bluejay’s dashboard.
Bluejay connects directly to your agent’s API to spin up simulation sessions
with no code changes required.
No-Code Telephony Integration
Enter your agent’s phone number into Bluejay and start running simulations
immediately. Bluejay calls your agent just like a real user would, testing
end-to-end behavior over telephony.
Self-hosted users
If you’re running Pipecat on your own infrastructure, Bluejay integrates via a WebSocket connection. Point Bluejay at your agent’s WebSocket endpoint and it will establish a session to run simulations against your agent directly. See the Bluejay WebSocket integration guide for setup instructions.Observability
Simulations cover pre-deployment testing, but observability ensures your agent maintains quality with real users. Bluejay’s Evaluate API lets you submit any production call for automated evaluation viaPOST https://api.getbluejay.ai/v1/evaluate.
Integrate the evaluate endpoint into your agent’s session cleanup logic to
automatically evaluate every production call without manual intervention.
Traces
Bluejay supports tracing to monitor and observe your agent’s execution flow, latency, and performance in real-time. Traces conform to the OpenTelemetry standard, so you can use any compatible instrumentation library, including OpenInference, Langfuse, and OpenLLMetry. To send traces to Bluejay:- Instrument your application to export traces to Bluejay’s OTLP endpoint
- Link traces to call evaluations by including the
trace_idin your Evaluate API requests - View traces alongside your call evaluations in the Bluejay dashboard
Example: OpenTelemetry setup
Configure the OpenTelemetry SDK to export traces to Bluejay:Next steps
Bluejay Documentation
Full setup guides, API reference, and configuration options.
Pipecat Integration Guide
Step-by-step guide for connecting Bluejay to your Pipecat agent.
Metrics
Learn about Pipecat’s built-in performance and usage metrics.
Saving Transcripts
Capture conversation transcripts to use with evaluation tools.