A conversational journey has been saved
The popularity of voice assistants is on the rise, but learning to design an intuitive and effective model is still a work in progress. How can organizations use the three Ts—training, testing, and tuning—to create more human-like voice assistants?
In the first article of our conversational AI series, we explored how the proliferation of voice assistants and messaging platforms are giving way to a new era of user interfaces (see the sidebar, “A five-part series on conversational AI”). Whether it’s in the car, a phone, or a smart home device, nearly 112 million US consumers rely on their voice assistants at least once a month—and that number continues to grow.1
Explore the AI and cognitive technologies collection
Download the Deloitte Insights and Dow Jones app
Subscribe to receive more related content
Yet the popularity of voice assistants isn’t without its growing pains. These can range from the mundane, such as misinterpreting a request for ordering a roll of paper towel, to the more troubling error of providing a harmful health recommendation (or conversely, providing an accurate, but difficult to interpret recommendation).2 Despite the uptick in adoption of voice-enabled virtual assistants, designing effective products is a nontrivial endeavor. Virtual assistants often deal with multiple, sometimes complex scenarios that require understanding a range of queries to which users expect a quick, accurate, and easily interpretable response.
In our experience, designing an intuitive and effective voice assistant is not as straightforward as combining structured and unstructured data with powerful AI capabilities such as natural language processing (NLP) and machine learning. Instead, virtual voice assistants require designers to match their technical capabilities and resources with human intuition and oversight. Voice assistant design is both art and science. This means incorporating sociological and geographical factors (such as accounting for regional accents), and simultaneously ensuring these voice assistants are properly calibrated to deliver messages in a conversational manner (e.g., proper tone and tenor). In this article, we explore “three Ts” of designing dynamic and flexible voice assistants: training, testing, and tuning.
Over the next year, we will discuss the implications and use cases of conversational AI. In this chapter, we discuss three Ts to developing effective voice assistants. In our remaining chapters, we leverage secondary research and case studies to explore the following topics:
Conversational AI makes its business case: The initial chapter of this series breaks down what constitutes conversational AI and the myriad ways companies can leverage its capabilities.
Acoustic authentication: Explains how conversational systems can enhance security protocols by integrating voice into the multiauthentication process.
Industry use cases: Highlights how virtual assistants appear to be changing the face of customer service in banking, technology, and health care.
The liability of conversational systems: Explores how the more we integrate conversational bots into our work and lives, the more we should take steps to understand their liability in terms of insurance, training, auditing, and the ethical implications.
There’s a paradox to designing voice assistants. While these assistants are underpinned by advanced AI and NLP capabilities, AI is only “smart” in a very narrow sense—that is, it is most effective at solving well-defined problems.3 But consider the nature of a conversation: It’s free-flowing, words and turns of phrase can take on multiple meanings based on context and tone, and at a moment’s notice, we can jump from one topic to another. So how do designers marry an expansive need, conversational interaction, with a traditionally narrow solution?
Human-assisted trainers. Perhaps, a common misperception is that voice assistants need to be everything to everyone. Instead, most are usually asked to perform relatively specific tasks such as responding to routine call center issues or helping people select an artist from their music library. With this in mind, designers can benefit from working directly with stakeholders to identify requirements and goals. At its core, this means solving well-defined problems that are easily tied to productivity measures (e.g., an airport voice assistant can measure how quickly and accurately it resolves customer queries).
In some of our earlier research, we found some of the best systems are designed directly with the communities that will interact with the AI solutions.4 That is, they benefit from making the human the focal point of the design process (also referred to as keeping the “human in the middle”). In the call center example, this means working with and observing how call center employees interact with customers. What are the routine inquiries? Are there more complex asks that trip employees up? When does confusion arise between employees and customers?
Understanding these common challenges empowers designers to map a high-level process flow of the call fulfillment process. As demonstrated in figure 1, these mappings create the underlying foundation for recording and organizing calls into a manageable data set populated with keywords and phrases.
Indeed, figure 1 is a simplification of the data structure, but after the designers are able to properly categorize these conversations, millions of recorded conversations can be translated into text and processed through mappings similar to this example.
Training the right data for your AI solution. After designers map the high-level process flow, numerous data sources are processed to train the voice assistants. This starts with transcribing voice data to text and parsing it into “human utterances.” These utterances consist of speech broken up by pauses in conversation. These range from single words to clauses to complete sentences. As seen in figure 1, utterances could be structured into business issues and resolutions.
After transforming the unstructured text into structured utterances, machine learning techniques, such as clustering analysis, create incredibly granular groupings within the data to uncover common patterns in the conversation. At this point, more supervised algorithms provide confidence scores that subject matter experts can validate and, when appropriate, use to correct machine learning conclusions. Taken together, putting humans in the middle, coupled with machine learning, creates foundational insights that inform these prospective voice assistants.
Testing a conversational system, such as a voice assistant, is more than ensuring that business issues are correctly mapped to resolutions. As many of us know from our own experiences, one-to-one conversations can easily be misinterpreted. If we aren’t familiar with an accent, we may misunderstand a question or if we are speaking to someone from a different geographical location, words can take on different meanings (for instance, “chaps” can mean a good friend or something a cowboy wears). Conversational systems are no different—except, unlike us, they lack the ability to understand context.
For these reasons, designers should build quality assurance metrics that stress-test their models across a number of user personas, including:
All four dimensions show the importance of uncovering and accounting for implicit bias. If the algorithm doesn’t understand a specific accent, then it could be trained on a biased data set. In this case, the designers should work back to the training data to create a more inclusive design. Fortunately, the testing process can help bring these issues to light.
Voice assistants do not have to pass as humans, but they should be able to communicate in a pleasant and interpretable manner. In this spirit, designers can improve upon their voice assistants by tuning their models with a more natural delivery. Tuning a voice assistant includes:
These natural changes in prosody work in concert to make conversations more natural and inviting. And with the help of virtual assistants, designers can deliver helpful conversations at scale.
Building an accurate and natural voice assistant is an iterative process. While we start with training, it doesn’t end with testing, and then tuning. Instead, each part of the process builds and iterates on the other. Implicit biases can occur during training, but testing can help designers uncover and address these biases; and if pauses are inappropriate, then the training data should be restructured to properly account for these natural breaks in conversation.
When designing your own voice assistants, remember:
By establishing a well-articulated goal, designers can continually improve upon their voice assistants to sound a bit more human with each iteration.