The multimodal future is still voice-first

One of the most difficult thing for technology industry observers to do is to hold simultaneously in their minds the possibility that multiple “hot” new technologies will actually succeed. The temptation is always to pit one trend against the other and determine which will win. The truth is that the future typically involves the mashing up up more than one of these buzzwords once they have gone through their respective hype cycles. Mobile and social. Open source software and closed app stores. QR codes and NFC. AR and VR. And so on. But even where technology trends intersect, overlap and blend together, keystone technologies tip the scales and give the future a shape. Voice is one of those keystone technologies.

Nowhere do I feel the need to clarify that things will be “both and” rather than “either or” more strongly than in the realm of contextual computing – voice interfaces, messaging, chatbots and predictive GUIs. After all, the adoption of human-human messaging (whether c2c or b2c) is a direct ramp to chatbots. Voice interfaces are really just a form of chatbot, and they can return a GUI menu for user confirmation. You can already ask Bixby to identify what it is you are looking at in your camera viewport. Apple Watch can automagically suggest actions on the Siri watch face that you could actively invoke with your voice and vice versa. The future of contextual computing is clearly multimodal.


SiriKit’s multimodal responses

Brian Roemmele – the Rafiki of voice – has coined this entire category of computing “voice-first.” It is a term that has proliferated far and wide on the interwebs as a rallying flag for the emergent voice interface tribe. Having spent time working from the messaging piece backwards at HeyNow and Layer, I was always a believer in voice, but I really latched onto the idea that voice-first didn’t mean voice-only. And it doesn’t. Brian has been very vocal about the need for other modalities alongside voice, and that we suddenly aren’t going to stop using screens or typing altogether.

Yet there is something about voice in particular that feels different, and it wasn’t until yesterday’s Siri section in the WWDC keynote that I was able to really put my finger on it. Apple demoed Siri Suggestions on Monday – where Siri begins to learn about actions you take in apps and making contextually relevant suggestions as to what next actions you might want to take at a given point in time (context being a function of past usage patterns and the current state of your machine). And while represents a laudable improvement to the way iOS helps you make use of apps, it lays bare the limitations of an approach that does not put voice at the center of human computer interaction, however multimodal it may end up being.

Screen Shot 2018-06-05 at 10.44.30 AM.png
Siri Suggestions


Smartphone GUIs are paradoxically “single tracked” in that they demand your full attention, and yet smash your attention into dozens of pieces across apps, notifications and other stimuli. Even the most perfectly tuned GUI – with options and actions triaged ruthlessly by your own personalized context such as we are seeing with Siri Suggestions – at the same time absorbs you completely in the machine’s understanding of the world such that you can’t do anything else and bombards your eyes with stimuli. I’m not sure about you, but even as I go through well-tread workflows in apps I know inside and out to get stuff done, a sense of anxiety, distraction and mild panic is not far behind the leading edge of my perception. I feel like I am running on an ever quickening treadmill, constantly trying to outrun a robotic Red Queen who’s speed and parallelism leaves my wetware in the dust. My attention reserves are depleted each time I look at and interact with a screen, no matter how well designed and tuned.

Voice interfaces, on the other hand, are “dual tracked” in that you can do something else while engaging (driving, cleaning, working out, just passing by). And yet funnily enough, this dual tracked nature does not contribute to sensory overload or multitasking drowning, it rather focuses all inputs and outputs of the machine into a single, linear thread – just like the way the human mind works. Speaking to a computer and hearing responses – even ones that come with visual affordances – is a development in human computer interaction that most closely resembles the way we think. You can only have one thought at a time, only hear one thing at a time and only say one thing at a time. Indeed thoughts and speech are intertwined in a strange loop with one another, with Broca’s region (our internal voice) both shaping and being shaped by our speech. Do we speak our thoughts? Or do we think in words?

As our attention continues to fragment, even looking at a screen to evaluate Siri Suggestions and acting on that “next best action” is going to strain us. No matter the amount of personalization or context used to render visually options and actions to the user, the attentional price will always be higher than speaking. GUI will never go away, in fact in the AR world the entire FOV will be a GUI. But to deal with that overstimulation, the ultimate skeuomorphism will need to emerge for computers to interact with us the same way we think – that is, the same way we talk to each other and ourselves.

We’ll point our camera (or look with our glasses) at a thing and ask our assistant about it. Our assistant may present a notification to quietly nudge us about a recommended next action, but we will engage with it fully with our voice to get an answer to our question or unambiguously express our intent without futzing around with the interface. As we get ready in the morning, we will compose wildly complex queries by speaking a short sentence to our assistant, and have it resolved on our behalf without lifting more brain cells than required to express that need. Voice will be the shortest distance between a user declaring she has a job to be done and the computer working out how to do it for her. And in doing so, voice will become the first interface among equals in our multimodal future.


One thought on “The multimodal future is still voice-first

  1. Ben: Well thought out and written. I especially liked “Voice will be the shortest distance”. Clever and concise. In the B2B world, I envision a time when Voice is the default, and tactile entry required only to drill into details. Truly multimodal, driven by context, intent, and preference.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s