Enterprise Voice AI: Overcoming the Common Pitfalls of Plug-and-play Approaches

Introduction

Starting with the first wave of enterprise Generative AI back in 2023, businesses have been on a rapid journey of ideating, building, and deploying agentic applications. The earliest and most dominant design pattern was the RAG-enabled, turn-based text chatbot.

Now, as these text-based agent applications reach maturity, consumer expectations are shifting to demand fluid, human-like, and real-time voice interactions. Meeting this trend, many vendors beyond the big hyperscalers are hitting the market with new products that promise to overcome the historical challenges of voice applications, revolutionise existing use-cases such as traditional IVR systems and expand the reach of voice to new experiences.

However, enabling Voice AI for an existing text-based agent stack is rarely as simple as “plug-and-play”. Plug-and-play approaches such as adding speech-to-text and text-to-speech on top of an existing application are not enough to enable enterprise-grade conversation, while voice engines offered by some vendors will rarely meet expectations when used off the shelf and without significant changes to your application’s back-end. These naive approaches are a recipe for user frustration and wasted development effort. And while these products are seeing some adoption, maturity for many voice offerings is likely to arrive in 6-12 months time. Making the correct technical and design decisions now is crucial in order to stay ahead of the pack and take advantage of expected innovations like speech-to-speech models.

Based on our recent deployments and key learnings from creating enterprise-grade Agentic AI applications and platforms, here is a guide to avoiding and overcoming the pitfalls of plug-and-play Voice AI and architecting for true conversation.

Pitfall 1: Relying on “touch and feel” to guide decisions for voice

Most of us converse daily and therefore have a strong intuition for how conversations should look and feel. However, this intuition can rapidly lead both business stakeholders and development teams astray, with gut feel and emotional impressions rarely guiding development in the right direction to satisfy end-users.

Validate your use-case

Due to the complex nature of the task that conversational AI tries to solve and the many business, technical and user requirements for voice applications, rigorous use-case ideation and validation must first be performed to ensure that the effort you and your team invest leads to a successful outcome.

To begin with, map out the existing customer journey that your text-based agentic application already solves. Understand at each step what data is being exchanged between the user and your application and how the user is able to move along their journey, which may be non-linear.

Next, based on your understanding of the user journey, identify stages where voice is likely to be useful and effective. These stages of the user journey are likely to have the following characteristics:

Target rapid exchange of simple information: eliciting simple user information like the purpose of their call is a good use-case, as voice can make this process feel smooth and natural
Contain only low stakes decision making: leave data-intensive or non-reversible decision making to more advanced agents or humans as this requires careful consideration and they make voice experiences feel sluggish
Have clearly delineated hand-offs: understanding clearly when the agent or user should hang up or proceed in their journey is important to help avoid long but ineffective conversations

Identifying these high-impact, low-risk slices of the customer journey allows your team to target their efforts judiciously and make their initiatives more likely to succeed.

Finally, it is important to ask yourself whether voice would be “nice to have” in your application at the stages identified above, or if it is “necessary”. While it may be technically feasible to deploy voice functionality in your application, the higher than usual implementation effort required to reach a satisfactory result should temper your expectations and willingness to divert effort from other initiatives. Identifying the criticality of voice to your business plan and product roadmap will determine the level of investment your organisation is willing to make.

We have found that demoing Voice AI is straightforward, but reaching the targets required for enterprise applications requires significant upfront investment of engineering effort, beyond what is usually justifiable for the typical feature. Compounding this issue are the ongoing technological innovations in this area, demos of which set executive expectations higher than usual. Throughout the process, it is important to understand the limits of state of the art versus what is genuinely possible for your organisation. Strong use-case selection and targeted effort unlocks value for your organisation and sidesteps low-impact sidequests.

Metrics matter

The landscape for voice is rapidly evolving, with many competing vendors and architectures. As mentioned above, these new developments lead to impressive demos, which, coupled with the usual intuitions about voice and conversation that many of us have, lead to disappointment when challenges are experienced during development.

Key to limiting this disappointment is a well-defined set of KPIs and metrics that objectively measure the progress of your voice application. Metrics matter, and the success of a voice application is measured in milliseconds. The golden metric for responsibility of voice applications is time-to-first-token (TTFT), which measures the end-user’s perceived latency of the application’s responses. It is the time from when the user stops speaking to when the user first hears the agent’s response.

For voice applications, the key to a good experience is keeping TTFT under about one second, as users quickly notice and easily disengage with longer pauses. While delays of several seconds are usually acceptable in text-based chat, this means that simply adding a voice layer on top of an existing text agent is unlikely to work well without extra engineering effort. To avoid this pitfall, teams should establish clear baselines for TTFT and related metrics such as inter-token latency (how smoothly responses stream) and word error rate, then use these measurements to understand where latency builds and where optimisation will have the most impact. If your existing agentic application does not have these metrics tracked and visualised, this should be the first priority when starting development on voice. Progress must be measured against these metrics and communicated to stakeholders throughout the project; this helps avoid the impression of stalled progress and keeps business objectives and technical outcomes aligned.

Before going live, business stakeholders are likely to call out subjective measures such as naturalness of conversation, tone of voice and branding consistency. While harder to quantify, these measures are equally as important as technical metrics such as TTFT. Starting voice initiatives with an agreed upon approach to assess your solution’s performance in these subjective measures, no matter how rudimentary, will allow technical and business teams to communicate desired outcomes and areas for improvement effectively. After deploying, do not neglect application metrics such as user adoption and CSAT to understand if your application is genuinely meeting user’s needs.

Pitfall 2: Assuming monolithic agents can handle the rigours of conversation

Achieving a sub 1 second TTFT metric is an impressive technological feat, but achieving this metric typically also requires a considered and thoughtful approach to both the design and execution of your agent stack’s architecture. Current agent orchestration frameworks and best-practices prioritise small, single-responsibility agents with judiciously selected models and efficiently implemented MCP tools.

Modularity + decomposition = observability + speed

As discussed above, latency and TTFT are core concerns for voice applications. Architecting from the ground up to prioritise speed is possible for greenfields applications, but fully rearchitecting an existing solution to introduce new voice functionality is rarely practical. However, choosing an agent orchestration framework that allows monolithic agents to be broken down is an effective way to meet these challenges.

While modular and decomposed agents are considered best practice, the additional advantages for voice are twofold:

Breaking apart your monolithic agent into smaller components that can be observed independently allows you to understand which tasks or tool calls take the most time. By moving away from a black-box design, you may be able to make significant gains by prioritising and optimising a single task or subagent while ensuring minimal degradation of response quality.
Decomposed agents allow you to optimally configure model size, tool selection, chat history length, compute resources and other parameters, which is not possible with monolithic agents. Using a large model for small tasks will take longer than needed. Likewise, some tasks will not require the full chat history or the full set of MCP tools. Carefully tuning model size reduces the generation time and TTFT, while limiting the available tools and chat history reduces total tokens thereby improving latency. Allocating more expensive compute such as GPUs for slow but critical tasks can lead to significant speedups with minimal changes.

While these practices are good for voice, they also improve performance for any text-based applications.

Streaming and parallelism are critical features for conversation

Streaming responses keeps users engaged and allows them to begin interpreting the response before it is finished. Hence, it is critical that your voice application is able to stream audio as it becomes available. This improves the end-users' impression of latency (TTFT). However it is equally important that ITL is small enough to avoid choppy audio that damages the user experience.

One critical architectural functionality for voice that is not as common for other use-cases is parallelism. Regardless of your voice architecture, one thing is clear: conversation management and processing user queries are two different tasks. (Stay tuned for a future blog post on traditional cascade vs. modern speech-to-speech architectures, which will cover this topic in more depth.)

For existing text-based agent stacks that are being expanded to voice, most of the processing of a user’s query does and indeed should happen textually, and any normalisation for TTS should be delegated to a final layer before pronunciation. There are therefore two primary patterns to parallelise conversational processing.

Passthrough pattern: simultaneously send the query to the conversation management agent and the processing agent. Immediately start streaming the response from the CMA to keep the user engaged and demonstrate a good TTFT. The processing agent’s stream can then be appended to the end of the CMA’s stream, appearing seamless to the user.
Parallel delegation or asynchronous tool call pattern: delegate the processing via your orchestration framework’s delegation mechanism or via a tool call made asynchronously or in parallel. Your chosen framework must be capable of responding with some initial thinking tokens (“Let me look that up for you…”) before delegating to or calling any subagents. This also allows the delegating model to condense and normalise responses for TTS and helps to maintain a consistent tone. Some frameworks allow multiple parallel or even pre-emptive tool calls; consider if your use-case requires this behaviour.

While both patterns lead to a good user experience, the passthrough pattern is easiest to implement but is considered wasteful. However, it can be a good starting point for a voice MVP, while more sophisticated approaches are implemented in the background. The parallel delegation or tool call pattern is generally more efficient but often requires some UI adjustments to keep users engaged while the agent is “thinking”, as well as more extensive back-end changes.

Additional advantages

The architectural patterns outlined above satisfy the minimal requirements for a voice architecture but come with additional benefits beyond just your use-case.

An architecture that allows good observability and highly configurable components permits rapid experimentation and validation across all use-cases, not just voice. By investing in the right combination of architecture and frameworks, you can accelerate your organisation’s overall development, while keeping your patterns flexible and adaptable.
Evaluation and testing of agents becomes much simpler, with targeted scenario and end-to-end tests made possible per agent. Small changes to a monolithic agent can cause unexpected regressions; changes to decomposed agents remain relatively small and testable, minimising the risk of unexpected impacts.
Parallelisable agents will prepare you for the future of voice, in particular speech-to-speech models. While STT + LLM + TTS architectures are currently considered production-ready, by parallelising conversation management and query processing, you will be able to take advantage of S2S models for conversation management without changing the underlying query processing. Future channels such as video, or further composing your agent with other agents, are also more achievable.

A later blog post is planned which will expand on the points above and help you to prepare for emerging voice architectures.

Pitfall 3: Expecting turn-based chat to translate to conversational experiences

Finally, we come to the true promise of voice and conversational systems. Investing time and effort into deciding on use-cases, capturing metrics and implementing a performant architecture is moot if an existing IVR (interactive voice response) system is simply reimplemented with LLMs. You can leverage the above technologies to genuinely transform how users interact with your application, delivering value for both users and your organisation.

Design for human conversation not web chat

Human conversation is noisy, messy and non-linear. Compare this to turn-based text chat: signals have virtually zero noise, turns are unambiguous and chat history is directly visible to all parties. While the messiness of human conversation may initially seem like a disadvantage for voice, treating this as a new set of requirements unlocks several major benefits.

Non-linearity and lack of history: Putting users in the driver’s seat lets them navigate to their solution faster. Traditional IVR systems focussed on getting you to the right operator with the specialised skills to resolve your particular issue, which came with the cost of lengthy information gathering, multiple triage steps and many operator hops. This was in direct opposition to the typical user desire of simply getting to an operator as quickly as possible, regardless of whether that operator could assist them, since the IVR experience was frustrating. With the parallelised architecture discussed above, all specialised agents can operate in tandem, delegating as needed and eliciting further information as and when required. Multiple tasks can be effectuated in a single call, without the need for a new conversation.
Interruptions and back and forth: Allowing users to interrupt is more aligned to human conversation and feels more natural. Good interruption handling allows users to guide the conversation to their desired destination faster and gives an impression of agency and responsiveness that traditional IVR systems lack. Likewise, handling repetitions and back-channelling allows users to confirm their understanding and helps build rapport and trust. Your architecture must implement an interruption handling mechanism that ensures the agent understands when it has been interrupted and that keeps message history coherent despite interruptions. Adding functionality such as push-to-talk or a mute button can help improve turn and interruption detection.
Noisy signals and open ended questions: Unlike a traditional IVR system, where users needed to travel along discrete but brittle intent paths, agentic voice systems can ask open ended questions and respond flexibly to messy requests. Eliciting rich user information allows users to feel listened to, while also obviating the need to concretely define all resolution paths. As an added bonus, metrics such as WER benefit from longer and richer user input. Your conversation management design should handle noisy environments or signals by asking users to rephrase their question or move to a quieter area.

With the above conversational features in mind, designing whole new experiences is now possible. For example, consider a voice agent that can coach a customer through filling out a complex web form in real time. This keeps the user engaged, reduces friction by providing specialised knowledge proactively and is more likely to result in a successfully completed user journey, translating to value for both the user and the business. Alternatively, consider assistive technologies like screen readers that could be simplified for new users or extended to previously inaccessible websites. For example, using voice commands instead of shortcuts allows new users to navigate more effectively without a steep learning curve, while a voice agent combined with a computer use agent would allow richer descriptions and easier navigation of image-heavy or poorly built websites.

Keep the user engaged

Similar to the earlier point around heightened expectations, the user and stakeholder expectations placed on voice system responsiveness are high. While traditional cascade architectures and newer speech-to-speech approaches are routinely capable of meeting latency expectations with respect to TTFT, users find long gaps without audible feedback jarring. Paradoxically, users expect voice systems to immediately have the answer they are looking for while human operators are expected to spend some time entering information and searching their systems before responding.

Key to overcoming this perceived deficiency is audio or visual feedback that keeps the user engaged. Starting agent responses with phrases such as “Thank you” or “I see” before continuing with processing keeps users engaged. To reduce the repetitiveness of the phrases above, you can design your user journey to minimise the number of turns required, which has the additional benefit of keeping the overall conversation shorter.

Since processing is likely to take several seconds, keeping the user updated on the status of the call and agent is also key. For voice calls over the web, where a browser or mobile interface is typically available, ensure the following information is displayed during a call.

Call status and quality: Ringing, connected, disconnected, call ended, network quality
Agent state: Listening, thinking, talking, interrupted
User stage: Un/muted, talking, noisy environment

Optionally, a running transcript of the call can be added but consider whether this is really necessary.

For voice calls over telephony systems or where visual feedback is not available, consider the following audio cues.

Call status: Ringing, connected, disconnected, on hold
Background media: Keyboard typing, office sounds, branded music

While the agent is thinking, putting the user on hold or playing background media allow the user to know the call hasn’t dropped or the agent hasn’t frozen. In either scenario, tailoring your conversation management to use phrases like “Let me put you on hold” or “Give me a minute to look that up” before calling tools to put the user on hold or play media is crucial to good user experience. Finally, as your agent transitions between different turns or stages in a conversation, make sure they include an audible transition phrase indicating success, failure or escalation before transitioning.

Regardless of their level of maturity, production voice systems should implement the above features. That being said, less mature systems benefit most from the above, allowing you to start testing your voice system in front of customers, while helping to minimise the risk of dropoffs. Meanwhile, your team can focus on optimising latency and the rest of the voice experience behind the scenes.

Key takeaways

Plug-and-play Voice AI almost never works for serious, enterprise-grade use. Simply wrapping a text agent with speech-to-text and text-to-speech won’t deliver a fluid, human-like conversation, especially if your stack can’t consistently hit sub 1 second response times. The biggest gains come from treating voice as its own problem: with the right use-cases, metrics, and architecture.

To set your voice initiatives up for success:

Be ruthless about where voice adds value. Map your existing customer journeys and pick low‑risk, high‑impact slices where voice is genuinely necessary, not just demo‑worthy. Define clear business outcomes up front.
Make metrics your source of truth. Establish and track a small, focused set of KPIs such as time‑to‑first‑token (TTFT), inter‑token latency, word error rate, task completion, and adoption/CSAT. Pair these with an agreed way to assess subjective qualities like tone, naturalness and on‑brand behaviour so stakeholders don’t end up debating “vibes”.
Invest in a voice-ready agent architecture. Decompose monolithic agents into smaller, observable components. Optimise model sizes, tool usage, and history per task, and support streaming and parallelism. This is what unlocks sub 1 second TTFT, easier experimentation, and future‑ready upgrades like speech‑to‑speech and new channels.
Design for human conversation, not web chat or IVR. Embrace the reality that real conversations are noisy, interruptible, and non‑linear. Build in interruption handling, open‑ended questioning, and flexible flows that let users drive the interaction rather than forcing them down brittle IVR‑style paths.
Keep users engaged while the system thinks. Use short, natural acknowledgement phrases, clear status indicators, and simple visual or audio cues (e.g., “Let me check that for you…”) to bridge pauses. This matters just as much as raw latency to avoid drop‑offs and maintain trust.
Voice can transform how customers interact with your organisation but only if you treat it as a first‑class capability, not an add‑on.