Skip to main content

Overview

Building a voice AI agent is only half the challenge. You also need to know that it handles real conversations reliably. Evaluations give you that confidence through two complementary approaches:
  • Simulations: Automated test conversations that exercise your agent’s behavior across scenarios, edge cases, and failure modes before they reach users.
  • Observability: Continuous evaluation of live calls so you can catch regressions, track quality over time, and close the loop between what you test and what users experience.
Together, these form a feedback loop: simulations validate changes before deployment, and observability surfaces issues that inform your next round of tests.

Bluejay

Bluejay is a simulation, observability, and evaluation platform purpose-built for voice AI agents. It provides no-code simulation testing and production call monitoring that integrate directly with Pipecat, whether you’re running on Pipecat Cloud or self-hosting.
With Bluejay, you can:
  • Run automated simulations that call your agent and evaluate its responses
  • Define test scenarios covering edge cases like interruptions, unexpected input, and multi-turn flows
  • Monitor every production call with automated quality scoring
  • Track evaluation metrics over time to catch regressions early

Pipecat Cloud users

If your agent is deployed on Pipecat Cloud, Bluejay offers two zero-configuration integration paths:

No-Code API Integration

Enter your Pipecat Cloud API key and agent name in Bluejay’s dashboard. Bluejay connects directly to your agent’s API to spin up simulation sessions with no code changes required.

No-Code Telephony Integration

Enter your agent’s phone number into Bluejay and start running simulations immediately. Bluejay calls your agent just like a real user would, testing end-to-end behavior over telephony.
The telephony integration tests the full call stack, from phone network to your agent and back, making it ideal for catching issues that only surface in real call conditions.

Self-hosted users

If you’re running Pipecat on your own infrastructure, Bluejay integrates via a WebSocket connection. Point Bluejay at your agent’s WebSocket endpoint and it will establish a session to run simulations against your agent directly. See the Bluejay WebSocket integration guide for setup instructions.

Observability

Simulations cover pre-deployment testing, but observability ensures your agent maintains quality with real users. Bluejay’s Evaluate API lets you submit any production call for automated evaluation via POST https://api.getbluejay.ai/v1/evaluate.
import requests

url = "https://api.getbluejay.ai/v1/evaluate"
headers = {"X-API-Key": "<your-bluejay-api-key>"}
payload = {
    "agent_id": "<your-bluejay-agent-id>",
    "start_time_utc": "2025-03-31T18:30:00Z",
    "participants": [
        {"role": "AGENT", "name": "Healthcare Agent Harry"},
        {"role": "USER", "name": "John Doe"},
    ],
    "recording_url": "https://s3.amazonaws.com/my-recordings/call-123.wav",
}

response = requests.post(url, json=payload, headers=headers)
Integrate the evaluate endpoint into your agent’s session cleanup logic to automatically evaluate every production call without manual intervention.

Traces

Bluejay supports tracing to monitor and observe your agent’s execution flow, latency, and performance in real-time. Traces conform to the OpenTelemetry standard, so you can use any compatible instrumentation library, including OpenInference, Langfuse, and OpenLLMetry. To send traces to Bluejay:
  1. Instrument your application to export traces to Bluejay’s OTLP endpoint
  2. Link traces to call evaluations by including the trace_id in your Evaluate API requests
  3. View traces alongside your call evaluations in the Bluejay dashboard

Example: OpenTelemetry setup

Configure the OpenTelemetry SDK to export traces to Bluejay:
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource

endpoint = "https://otlp.getbluejay.ai/v1/traces"
resource = Resource.create({SERVICE_NAME: "my-pipecat-agent"})

tracer_provider = trace_sdk.TracerProvider(resource=resource)
headers = {
    "X-API-KEY": "<your-bluejay-api-key>",
}

tracer_provider.add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint, headers=headers))
)
Once the tracer provider is configured, use it with any OpenTelemetry-compatible instrumentation. For example, to automatically trace LLM calls:
from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
With simulations, evaluations, and traces all flowing into Bluejay, you get a single dashboard for every dimension of agent quality, with no need to stitch together data from multiple tools. Looking to get started with Bluejay? Book a demo today.
By combining simulation testing, production monitoring, and tracing, you create a continuous evaluation loop: test before you ship, monitor after you deploy, and use trace data to understand exactly what happened in every conversation.

Next steps

Bluejay Documentation

Full setup guides, API reference, and configuration options.

Pipecat Integration Guide

Step-by-step guide for connecting Bluejay to your Pipecat agent.

Metrics

Learn about Pipecat’s built-in performance and usage metrics.

Saving Transcripts

Capture conversation transcripts to use with evaluation tools.