"There's many a slip between the cup and the lip!"
A key challenge in the design of a voice interface is that people tend to behave with it like it is 'human'. This is part of a broader and complex human tendency to 'anthropomorphize' - give human-like qualities - to the world around them, from nature to cars. Especially, with computers and machines we seem to quickly develop a sense of their being 'sentient' - 'living' or 'conscious'. 'living' things, and often as things having 'personality'. (This human phenomenon has been well explored by researchers such as the late Clifford Nass in his books 'The Media Equation' and 'The Man Who Lied to His Laptop', both of which I highly recommend.)
It's fairly obvious when you think about it. We tend to see a display with lights and buttons as some form of 'streaming media', due to our familiarity with the television and the cinema. But a human voice, even if it emerges from a screen, seems to have an 'interactive quality' to it. It invites us into a conversation, our minds involuntarily shift into a very different mode: we sense an emotional connection with the voice, and consequently, from the device it comes from. We tend to make snap judgments about the 'speaker', his or her gender, or age, or even cultural or racial background. From hearing a single sentence the user could even make some assumption of the speaker's 'personality' - is she or he calm or aggressive, focussed or confused, open minded or not. And with a couple of conversational exchanges with a voice interface, this sense of the speaker's personality extends towards something 'human' or 'alive'. Some of us might even sense the political orientation of the speaker in time.
Voice has such a powerful and primitive effect on us. It's an interface unlike any other that we have developed in the history of technology.
It is only with voice, the user immediately develops an expectation of sentience from the machine. By 'sentience', I mean that the user starts to expect the interface to have a human-like consciousness - she expects the interface to 'know' and 'understand'' things about her without explicitly being told about those things.
Remember, the other side of user expectation is user frustration. So, a user when driving would get frustrated with a voice interface that doesn't adapt to changing road conditions or 'shut up and not bother the driver' when the traffic gets heavy.
The main challenge in the design of conversational interfaces is that people quickly tend to treat them as 'human'. It's an interface unlike any other that we have developed in the history of technology.
And this is where today's voice interfaces let us down; just as the user is getting warmed up to the sentience of the system, it reveals it's machine-heart and disappoints the user by revealing its true internal condition.
How do we get past this? As always in design, we are going to have to fake it till we make it. So here are a few tips on how to fake 'humanness' in conversational interfaces.
Conversational interfaces should be invoked spontaneously
Humans don't usually wait for an order to speak up. And so, conversational interfaces can't be perceived as 'conversational' if they aren't expressive. Showing expressiveness will become easier with deeper sensor and IoT integration into everyday technology.
Conversational interfaces should spontaneous, maybe even a teeny bit unpredictable
The above response from Siri was touching the first time, felt machine-like the second time around, and was annoying when it came up the third time. Human perception is what is entertaining or interesting is not consistent. In an age of constrained attention we are always on a slippery slope to annoyance. Sustained and enjoyable unpredictability in voice interfaces all come down to much more than alternating between a handful of pre-recorded responses.
An important digression: making interfaces spontaneous is a massively new and unexplored frontier in interaction design. Remember that predictability has been a cornerstone characteristic of certainly all machines and almost all digital and other interfaces, up until the arrival of the conversational interface. Embedding human qualities like spontaneity, unpredictability, impulsivity and tempermentality into machines are entirely new design challenges. With conversation, designers will have to create consistent, yet not entirely predictable, interfaces.
Unpredictable yes, but not too much!
Unpredictability can be fun, but too much variance, especially off-topic variance, can be downright off-putting. Imagine if Siri's response above came at a time when the user was fumbling for a shortcut stressing a deadline at three in the morning.
Conversational interfaces will evolve to be increasingly anticipative of context (what the user is doing, time of day, social situation, ongoing tasks, motivations, interests and goals).
Note: Header background image: The new Echo Dot is a clear play by Amazon to use voice to position itself to own the primary interface to connected consumers. (Image courtesy: Amazon.com)