Giving chatbots the gift of voice

A battle of the bots

Chatbots are proliferating like rabbits. Personally I lost count, but according to Gartner, at the moment of writing between 1500 and 2000 companies worldwide have in the past couple of years developed a chatbot platform that they offer their customers as the base for applications. Of course, not all of them are good and the one-shot question-answer bots abound. But some bots use a well-designed knowledge base and semantic / learning infrastructure based on AI to really recognize and understand what people type, keeping the context, and being able to follow up with more questions if the initial meaning is unclear.

But even these “good chatbots” are mostly text-based. While chat usually refers to text (embedded in a website, over a dedicated chat platform like Facebook Messenger or WhatsApp or even via text messages), it’s important to recognize that voice — and in particular voice over the telephone network — is still a big part of how customers interact with businesses.

At the beginning of 2019, voice-enabled bots are still the exception and not the rule. But the time is fast approaching when omni-bots (which can manage equally well voice and text conversations) are the ones that will emerge victorious from this “battle of the bots”. This in turn will be a factor in deciding the winners in the inevitable shake-out that the conversational interactions industry will experience in the next couple of years.

Why voice?

It’s undeniable that people like to text and type. It’s a very convenient way to communicate in most occasions, but it’s a fairly new habit, sparked by the availability of devices that let people text and type to communicate. Until 10 years ago customer service was only over voice. So, text-based communication is rising in importance as more people adopt it, but it has not supplanted voice as the main channel of customer service communication.

After all, if you get upset about a service or a product, it’s difficult to yell while typing: you could use ALL CAPS but, somehow, it’s not as satisfying. Putting jokes aside, voice is what people use when they need to have real-time feedback, and voice enables people to convey information much faster than chatting: if you type fast and well you can put in 40 words per minute. But an average talker will speak 150 words in the same time (even excluding from the stats the end of pharmaceutical commercials). Finally, when all else fails people pick up the phone and call, so one could say that while voice calls may be going down as a percentage of the total interactions, their importance is actually going up.

Also, there are occasions when it’s OK to talk, but not to type: don’t text and drive! Although there are also occasions when texting is the only way to communicate, like at a Metallica concert…

So, voice has a big role, and in particular voice over the telephone network: dialing 1–800-SUPPORT is still the easiest way to get you there. Adding voice support to bots is consequently a great way to expand the reach of conversational technology in the customer service domain to the 50%+ of communications that are currently out of reach.

For bots, voice is harder that text. While voice can be transcribed into text rather easily by an ASR (automatic speech recognition) and the transcription can be fed to the bot’s AI, this is still an additional step that needs to be integrated into the system. There are also several TTS (text-to-speech) services that can be used to convert the bot’s answers back to voice — still another step.

What’s more, the knowledge base and AI training for text and voice is not completely overlapping: we say things and use turn of phrases while speaking that we wouldn’t use while typing; on the other hand, the ASR will not make typographical mistakes that are common in chat and must be accounted for by chatbot engines. But these are issues that can be overcome with better AI training — we at Interactive Media know this since we support both voice and chat in our conversational Virtual Agents.

More challenging, the system needs to be very responsive for voice: while no-one would object to a 10-seconds pause between typing a message and receiving a response, try that with voice! And so, the integration needs to be architecturally sound and fast. And not all ASR systems are created equal: while recognition performance of the latest ASRs is uniformly quite good, some systems have an advantage for specific tasks: for instance, Google Speech APIs excel in recognizing addresses due to their integration with Google Maps. It makes sense to use different ASR vendors for different parts of an application.

And then, there is the telephone network to deal with. There are certainly RESTful APIs that are easily integrated into a conversational system, but at volume they can be expensive. Also, usually the companies deploying the bot already have their own telephony infrastructure, and it doesn’t make sense to overhaul it for the use of the bot. Be it implemented through a local switch (PBX) or a SIP trunk from a carrier, telephony is more challenging to integrate with than a purely HTTP based interface.

Finally, if the interaction does not complete within the self-service conversational domain it will need to be forwarded to a human agent. This implies not only forwarding the call to a Contact Center suite (usually over SIP), but also passing over the context gathered so far, and for this an integration with the CTI interface of the Contact Center is needed.

So, there are several factors that contribute in making voice and telephony for bots a complex proposition.

The challenges of voice

For bots, voice is harder that text. While voice can be transcribed into text rather easily by an ASR (automatic speech recognition) and the transcription can be fed to the bot’s AI, this is still an additional step that needs to be integrated into the system. There are also several TTS (text-to-speech) services that can be used to convert the bot’s answers back to voice — still another step.

What’s more, the knowledge base and AI training for text and voice is not completely overlapping: we say things and use turn of phrases while speaking that we wouldn’t use while typing; on the other hand, the ASR will not make typographical mistakes that are common in chat and must be accounted for by chatbot engines. But these are issues that can be overcome with better AI training — we at Interactive Media know this since we support both voice and chat in our conversational Virtual Agents.

More challenging, the system needs to be very responsive for voice: while no-one would object to a 10-seconds pause between typing a message and receiving a response, try that with voice! And so, the integration needs to be architecturally sound and fast. And not all ASR systems are created equal: while recognition performance of the latest ASRs is uniformly quite good, some systems have an advantage for specific tasks: for instance, Google Speech APIs excel in recognizing addresses due to their integration with Google Maps. It makes sense to use different ASR vendors for different parts of an application.

And then, there is the telephone network to deal with. There are certainly RESTful APIs that are easily integrated into a conversational system, but at volume they can be expensive. Also, usually the companies deploying the bot already have their own telephony infrastructure, and it doesn’t make sense to overhaul it for the use of the bot. Be it implemented through a local switch (PBX) or a SIP trunk from a carrier, telephony is more challenging to integrate with than a purely HTTP based interface.

Finally, if the interaction does not complete within the self-service conversational domain it will need to be forwarded to a human agent. This implies not only forwarding the call to a Contact Center suite (usually over SIP), but also passing over the context gathered so far, and for this an integration with the CTI interface of the Contact Center is needed.

So, there are several factors that contribute in making voice and telephony for bots a complex proposition.

An offer to help

Interactive Media knows a lot about voice and integration with other voice platforms. We started with voice applications, telephony and customer experience in 1996, and so we have both a long experience in what it takes to integrate successfully with the telephone network (it was the only game in town then!), and a super-solid platform that has evolved to incorporate the latest architectures and protocols into a proven foundation for all voice communication.

We also have a platform for conversational application with several sizable deployments, both for voice and chat. This has helped us understand the most impactful features of the telephony platform and optimize them as they relate to bots.

So the idea is simple: Interactive Media is on a mission to help chatbots add voice to their repertoire. This starts with telephony integration, of course, but continues with speech transcription and generation if necessary, and integration with Contact Center platforms — we integrate natively with several of the most common ones. Our software is ready in the cloud, but it’s also easy to install it on premise if the project requires it.

We are looking forward to giving all deserving chatbots the gift of voice.