Wavelength: AI Voice Agent Platform
Built a full voice AI pipeline from the transport layer up. Not an API wrapper. Real telephony, real conversations, real sales outcomes.
In a single campaign batch
Call connection rate
Warm/hot qualification rate
Prompt iterations from real logs
The Problem
Our own company (Freedom With AI) runs weekly webinars for a community of 480,000+ learners. We needed to qualify leads, recover no-shows, and warm up registrants before each session. Doing this manually with a team of telecallers was expensive, inconsistent, and couldn't scale. We needed AI agents that could have real phone conversations — not chatbots, actual voice calls over telephony.
The existing solutions (Vapi, Retell) were either too expensive at scale or didn't give us the control we needed over conversation design, voice quality, and call flow logic.
So we built our own.
What We Built
A full voice AI pipeline called Wavelength, built from the ground up:
- Pipecat (open-source voice AI framework) for orchestration
- Plivo for telephony (SIP trunking, call recording)
- Deepgram and Sarvam Saaras V3 for speech-to-text (Sarvam specifically chosen for Indian English and Hindi accuracy)
- Google Gemini Flash for LLM (chosen for speed, evaluated against Groq + Llama 4 Maverick)
- Gemini Flash TTS for text-to-speech (migrated from Google Cloud Chirp3)
We didn't use a no-code voice AI builder. We built the pipeline from the transport layer up, including writing a custom stateful overlap-save resampler to fix audio chunking discontinuities that were causing garbled speech at chunk boundaries.
The Numbers
| Metric | Value | Context |
|---|---|---|
| Calls in single batch | 1,436 | Outbound warm-up campaign for one webinar |
| Call connection rate | 54.7% | Industry average for cold outbound is ~30-40% |
| Engaged beyond 30s | 25.9% | Meaning the AI held a real conversation |
| Engaged beyond 60s | 20.3% | Deep conversations with qualification |
| Warm/Hot qualification rate | 16.4% | From total calls, not just connected |
| Bot personas built | 10+ | Sales, qualifier, no-show recovery, warm-up, onboarding, attendance |
| Prompt iterations | v1 through v11 | Each version based on transcript analysis of real calls |
| Bot silence bug | 23.9% → under 5% | Diagnosed as concurrency issue: 46% silence at 6 PM peak, 0% at off-peak |
Key Engineering Challenges
- Audio Frame Loss (BOT_VAD_STOP_SECS): Calls were cutting out mid-sentence. Traced to a Voice Activity Detection timeout that was too aggressive. The bot was interpreting natural speech pauses as end-of-utterance, killing the audio stream. Fixed by tuning VAD parameters and implementing a buffer window.
- Audio Chunking Discontinuities: TTS output was arriving in chunks that didn't align at sample boundaries, causing audible pops and garbled transitions. Built a stateful overlap-save resampler that maintains phase continuity across chunk boundaries. This is low-level DSP work, not prompt engineering.
- 23.9% Silent Call Rate: Nearly 1 in 4 connected calls had the bot go completely silent after the greeting. Data analysis revealed it was a concurrency problem: at 3 PM (228 calls) silence was 30%, at 6 PM (196 calls) it hit 46%, but at 5 PM (249 calls) it was only 1%. Fixed by implementing call staggering and connection pooling.
- STT Aggregation Bug: ~24% of calls going silent after initial greeting due to partial transcripts not being properly assembled into complete utterances. Diagnosed by analyzing transcript patterns across 786 connected calls.
Want to build something like this? Let's talk.