It’s undeniable that people like to text and type. It’s a very convenient way to communicate in most occasions, but it’s a fairly new habit, sparked by the availability of devices that let people text and type to communicate. Until 10 years ago customer service was only over voice. So, text-based communication is rising in importance as more people adopt it, but it has not supplanted voice as the main channel of customer service communication.
After all, if you get upset about a service or a product, it’s difficult to yell while typing: you could use ALL CAPS but, somehow, it’s not as satisfying. Putting jokes aside, voice is what people use when they need to have real-time feedback, and voice enables people to convey information much faster than chatting: if you type fast and well you can put in 40 words per minute. But an average talker will speak 150 words in the same time (even excluding from the stats the end of pharmaceutical commercials). Finally, when all else fails people pick up the phone and call, so one could say that while voice calls may be going down as a percentage of the total interactions, their importance is actually going up.
Also, there are occasions when it’s OK to talk, but not to type: don’t text and drive! Although there are also occasions when texting is the only way to communicate, like at a Metallica concert…
So, voice has a big role, and in particular voice over the telephone network: dialing 1–800-SUPPORT is still the easiest way to get you there. Adding voice support to bots is consequently a great way to expand the reach of conversational technology in the customer service domain to the 50%+ of communications that are currently out of reach.
For bots, voice is harder that text. While voice can be transcribed into text rather easily by an ASR (automatic speech recognition) and the transcription can be fed to the bot’s AI, this is still an additional step that needs to be integrated into the system. There are also several TTS (text-to-speech) services that can be used to convert the bot’s answers back to voice — still another step.
What’s more, the knowledge base and AI training for text and voice is not completely overlapping: we say things and use turn of phrases while speaking that we wouldn’t use while typing; on the other hand, the ASR will not make typographical mistakes that are common in chat and must be accounted for by chatbot engines. But these are issues that can be overcome with better AI training — we at Interactive Media know this since we support both voice and chat in our conversational Virtual Agents.
More challenging, the system needs to be very responsive for voice: while no-one would object to a 10-seconds pause between typing a message and receiving a response, try that with voice! And so, the integration needs to be architecturally sound and fast. And not all ASR systems are created equal: while recognition performance of the latest ASRs is uniformly quite good, some systems have an advantage for specific tasks: for instance, Google Speech APIs excel in recognizing addresses due to their integration with Google Maps. It makes sense to use different ASR vendors for different parts of an application.
And then, there is the telephone network to deal with. There are certainly RESTful APIs that are easily integrated into a conversational system, but at volume they can be expensive. Also, usually the companies deploying the bot already have their own telephony infrastructure, and it doesn’t make sense to overhaul it for the use of the bot. Be it implemented through a local switch (PBX) or a SIP trunk from a carrier, telephony is more challenging to integrate with than a purely HTTP based interface.
Finally, if the interaction does not complete within the self-service conversational domain it will need to be forwarded to a human agent. This implies not only forwarding the call to a Contact Center suite (usually over SIP), but also passing over the context gathered so far, and for this an integration with the CTI interface of the Contact Center is needed.
So, there are several factors that contribute in making voice and telephony for bots a complex proposition.