About
When the leading brand experience agency envisioned a real-time conversational AI avatar to embody their client’s product brand, they turned to us. Their goal? To create an avatar capable of engaging users and answering their questions instantly. But here's the twist – this avatar would be an audio chatbot aiming to achieve human-like response latency, ensuring a seamless and natural conversational experience. This initial phase served as a proof of concept, demonstrating the avatar's potential to be seamlessly integrated into a website, providing real-time, intelligent responses to user inquiries.
Challenge
Our team faced several significant challenges during the project. The foremost was reducing response latency to ensure that conversing with the avatar felt as natural as talking to a real person. Additionally, establishing smooth audio stream input and output communication between the backend and frontend proved to be tough. Lastly, integrating a 3D avatar with moving parts like a mouth and eyes presented its own set of difficulties.
Process
The journey began with a detailed discussion with the client to define the scope and objectives of the proof of concept (POC). Our dedicated team consisted of two backend AI developers, a project manager, and ad hoc frontend developers. With just three weeks to deliver, we were tasked with creating a functional product featuring a 2D, non-animated avatar capable of responding to questions with audio output.
From the outset, the backend developers took on the complex task of implementing a streaming architecture that would enable real-time processing of audio inputs and text outputs. They carefully selected and integrated the fastest components for each phase of the pipeline—speech-to-text, language model, and text-to-speech. Their work didn’t stop there; they continuously monitored and evaluated new component releases, ready to switch to better-performing alternatives as they became available.
One of the innovative strategies devised by the backend team to reduce perceived latency involved developing a "short filler response" approach. By streaming a pre-recorded audio snippet (such as “That’s an interesting question” or “Hmm, let me think”) while querying the language model, they created the illusion of a more immediate response.
Additionally, they handled the critical task of communication and format conversion between the backend and frontend components to ensure seamless audio stream input and output. This was achieved by assuming the audio stream output would be in PCM format at a specific rate (currently 16kHz) and then implementing the necessary format conversions and communication protocols.
Meanwhile, the frontend developers played an equally vital role in the project. They focused on gathering audio from the browser and sending it to the backend, ensuring that the captured sound was transmitted accurately and efficiently. Moreover, they set up authentication mechanisms to ensure that only authorized users could access the system, adding an essential layer of security to the project.
Solution
In just three weeks, our team transformed the concept into reality, achieving milestones along the way.
We implemented a real-time Speech-To-Text transcription system that could transcribe spoken words with an impressive delay of around 1 second. This quick turnaround ensured that users experienced a smooth and natural conversation with the avatar.
To enhance the avatar's responsiveness, we integrated Claude 3 Haiku, optimized for speed and tasks like instant customer support, with GPT-4's advanced reasoning capabilities. This powerful combination allowed the avatar to provide quick and intelligent answers, making the interaction more engaging and informative.
Our team achieved real-time Text-To-Speech audio streaming with a minimal delay of approximately 0.5 seconds. This swift response time was crucial in maintaining the conversational flow, ensuring that users felt as though they were speaking with a live person.
By introducing a "filler sentence" approach (described in the process section), we effectively bridged the gap while the system processed user queries. This tactic brought the overall latency feeling down to around 1.5 seconds, enhancing the user experience by making the avatar's responses seem more immediate.
What’s Next?
As the client is pleased with our POC results, we are not stopping and are continuing this exciting journey. On our roadmap are the next challenges:
While effective, filler sentences can't be used constantly. We are exploring additional UX strategies to further reduce perceived latency.
We are working on adding conversation memory, knowledge grounding, and moderation capabilities to provide more contextually relevant and grounded responses.
Identifying a scalable and resilient audio streaming platform that ensures high audio quality remains our priority.
Sounds interesting? Stay tuned as we will soon launch the next case study with the final project outcome!
See other case studies
Get in touch
Have a project in mind? Send us the details and we will reach out to you with the next steps.