The multimodal future is still voice-first

One of the most difficult thing for technology industry observers to do is to hold simultaneously in their minds the possibility that multiple “hot” new technologies will actually succeed. The temptation is always to pit one trend against the other and determine which will win. The truth is that the future typically involves the mashing up up more than one of these buzzwords once they have gone through their respective hype cycles. Mobile and social. Open source software and closed app stores. QR codes and NFC. AR and VR. And so on. But even where technology trends intersect, overlap and blend together, keystone technologies tip the scales and give the future a shape. Voice is one of those keystone technologies.

Nowhere do I feel the need to clarify that things will be “both and” rather than “either or” more strongly than in the realm of contextual computing – voice interfaces, messaging, chatbots and predictive GUIs. After all, the adoption of human-human messaging (whether c2c or b2c) is a direct ramp to chatbots. Voice interfaces are really just a form of chatbot, and they can return a GUI menu for user confirmation. You can already ask Bixby to identify what it is you are looking at in your camera viewport. Apple Watch can automagically suggest actions on the Siri watch face that you could actively invoke with your voice and vice versa. The future of contextual computing is clearly multimodal.


SiriKit’s multimodal responses

Brian Roemmele – the Rafiki of voice – has coined this entire category of computing “voice-first.” It is a term that has proliferated far and wide on the interwebs as a rallying flag for the emergent voice interface tribe. Having spent time working from the messaging piece backwards at HeyNow and Layer, I was always a believer in voice, but I really latched onto the idea that voice-first didn’t mean voice-only. And it doesn’t. Brian has been very vocal about the need for other modalities alongside voice, and that we suddenly aren’t going to stop using screens or typing altogether.

Yet there is something about voice in particular that feels different, and it wasn’t until yesterday’s Siri section in the WWDC keynote that I was able to really put my finger on it. Apple demoed Siri Suggestions on Monday – where Siri begins to learn about actions you take in apps and making contextually relevant suggestions as to what next actions you might want to take at a given point in time (context being a function of past usage patterns and the current state of your machine). And while represents a laudable improvement to the way iOS helps you make use of apps, it lays bare the limitations of an approach that does not put voice at the center of human computer interaction, however multimodal it may end up being.

Screen Shot 2018-06-05 at 10.44.30 AM.png
Siri Suggestions


Smartphone GUIs are paradoxically “single tracked” in that they demand your full attention, and yet smash your attention into dozens of pieces across apps, notifications and other stimuli. Even the most perfectly tuned GUI – with options and actions triaged ruthlessly by your own personalized context such as we are seeing with Siri Suggestions – at the same time absorbs you completely in the machine’s understanding of the world such that you can’t do anything else and bombards your eyes with stimuli. I’m not sure about you, but even as I go through well-tread workflows in apps I know inside and out to get stuff done, a sense of anxiety, distraction and mild panic is not far behind the leading edge of my perception. I feel like I am running on an ever quickening treadmill, constantly trying to outrun a robotic Red Queen who’s speed and parallelism leaves my wetware in the dust. My attention reserves are depleted each time I look at and interact with a screen, no matter how well designed and tuned.

Voice interfaces, on the other hand, are “dual tracked” in that you can do something else while engaging (driving, cleaning, working out, just passing by). And yet funnily enough, this dual tracked nature does not contribute to sensory overload or multitasking drowning, it rather focuses all inputs and outputs of the machine into a single, linear thread – just like the way the human mind works. Speaking to a computer and hearing responses – even ones that come with visual affordances – is a development in human computer interaction that most closely resembles the way we think. You can only have one thought at a time, only hear one thing at a time and only say one thing at a time. Indeed thoughts and speech are intertwined in a strange loop with one another, with Broca’s region (our internal voice) both shaping and being shaped by our speech. Do we speak our thoughts? Or do we think in words?

As our attention continues to fragment, even looking at a screen to evaluate Siri Suggestions and acting on that “next best action” is going to strain us. No matter the amount of personalization or context used to render visually options and actions to the user, the attentional price will always be higher than speaking. GUI will never go away, in fact in the AR world the entire FOV will be a GUI. But to deal with that overstimulation, the ultimate skeuomorphism will need to emerge for computers to interact with us the same way we think – that is, the same way we talk to each other and ourselves.

We’ll point our camera (or look with our glasses) at a thing and ask our assistant about it. Our assistant may present a notification to quietly nudge us about a recommended next action, but we will engage with it fully with our voice to get an answer to our question or unambiguously express our intent without futzing around with the interface. As we get ready in the morning, we will compose wildly complex queries by speaking a short sentence to our assistant, and have it resolved on our behalf without lifting more brain cells than required to express that need. Voice will be the shortest distance between a user declaring she has a job to be done and the computer working out how to do it for her. And in doing so, voice will become the first interface among equals in our multimodal future.


Siri and all her friends: why it’s SiriOS or bust this WWDC

“A wizard is never late, nor is he early. He arrives precisely when he means to.” – Gandalf

Siri remains the biggest liability turned threat Apple has faced in quite some time. It’s clear from FB, Google IO an Microsoft Build (let alone the blistering pace of progress of Amazon Alexa) that Apple needs to move quickly to close the gap before it’s too late. And while it’s clear that digital assistants on smartphones doesn’t quite matter yet on mobile, the day where users begin to change their purchase behavior on the basis of assistance is drawing near. One can’t help but feel that this WWDC is a make or break moment for Siri and for Apple.

What began as a multi year lead has given way to a serious deficit compared to erstwhile competitors Alexa, Cortana and Google Assistant. iPhone user satisfaction with Siri is dramatically lower than their overall satisfaction, but that should not comfort the paranoid in Apple’s executive team. Safe for now, Apple’s formidable iOS ecosystem stands to face serious competitive pressure when the basis of competition shifts underneath their feet as it appears to be doing in the case of assistance. Left without a major upgrade in capability as a platform in its own right rather than simply an appendage to iOS, Siri will be alone in its fight against the other assistants, fighting with a vastly smaller data corpus and with far less mature cloud and data practices internally.

Apple struggles with all things cloud services, machine learning and data. Siri unfortunately relies quite heavily on all three, and as a result, its ability to even correctly transcribe my words lags the field significantly. This will always be an issue, and while Apple basically needs to build or buy their way out of this deficiency, its strategic ace in the hole does not necessarily require them to bear Alexa or Google Assistant’s voice to text capability overnight. By leveraging the power of arguably the most important and robust consumer facing developer platform, community and economy in history, iOS, Apple can bring to bear an ecosystem that will unlock differentiated, delightful conversational user experiences.

Rather than renting space on Alexa or Facebook Messenger, iOS developers can leverage the master assistant as a sort of “router” to assistance experiences totally owned and controlled by that business. Siri gains new superpowers to help get users’ jobs done: asking for help from the App Store, iOS’ crown jewel. By leaning on its developer community rather than trying to be itself the smartest AI out there, Apple can securely, richly and sustainably deliver the science fiction style digital assistants we’ve envisioned. We have Alexa, Erica, Cortana, Eno, Luvo, Cleo and more, and so rather than winning a platform war with a better product, Apple can win it with a superior ecosystem. It can win it with SiriOS.

SiriOS – Siri operating system – would more or less be a rewritten SiriKit, sans the domain guardrails and with the capability for some sort of developer (and possibly user) defined intents and ontologies. The new Siri “applet” would require the app to be installed on the device. From third party audio apps to shopping experiences and beyond, giving developers the type of flexibility afforded by other assistant platforms would return Apple to pole position, if only in the knick of time. Rather than being threatened by Alexa, Amazon would become Apple’s best friend in the voice world. There’s no need to go so far as Android’s recently announced ability to set other assistants like Alexa and Cortana as the primary. “Hey Siri, ask Alexa to order some more paper towels,” is winning the war without firing a shot.

As things progress, one can imagine usage of Siri beginning to climb, helping Apple with that voice to text problem by providing useful data for Siri to learn from. Apple Business Chat is another fascinating new piece of the puzzle, whereby we could see a convergence into a multimodal experience that not only mixes text, voice and rich GUI, but human and bot interactions as well. And as discovery of new Siri apps gets more robust, the ability for users to interact with Siri apps without the core iOS app needing to be download may come to the fore as Apple dabbles around things like app thinning, but core and (allegedly) declarative UI frameworks. Things start getting very interesting for iOS as SiriOS becomes a powerful abstraction leak that gets to the core of how we use computers.

Siri needs to become the preferred voice UI platform for consumers and developers, and a point of aggregation of the user experience which Apple can control entirely. In doing so, they stand poised to be the ones to spike the assistance football. Yet again, Apple goes last.

The Conversational Economy

Markets are conversations. Trade routes pave the storylines. Across the millennia in between, the human voice is the music we have always listened for, and still best understand.

— The Cluetrain Manifesto, 1999

Long ago it was obvious that markets were conversational. You’d visit the bazaar, browse the wares and meet the merchants. You might have a relationship with the shopkeeper who refilled your weekly staples or the cobbler that fixed your shoes. In the early industrial economy, you might have dealt with traveling salesmen for a number of different products. You talked, you bought, and they remembered you.

In the past, each sale, each “conversion” was highly dependent upon how the conversation went. Yes, was the product itself good, but also, was the salesperson knowledgeable? Did they help me find what I needed? Did I trust them, did they hear me? Do they remember what kinds of things I like and dislike? Do they serve my particular needs, even as those needs evolve over time?

Similarly, customer retention was a function of how that conversation evolved over time. Individual conversations with the salesperson, shopkeeper or craftsman constituted an ongoing “conversation” out of which your purchases organically emerged. Purchases were bookends to parts of an ongoing conversation, and as that conversation was sustained, in good faith and with trust on both sides, so too was your loyalty to that merchant.

Over the past half century of mass production and mass marketing, these conversations have been distorted and fragmented. In a world where physical distribution — of both products and of media — required massive scale, the business models that naturally arose to govern the exchange of stuff were often impersonal, uniform and alienating.

Mass marketers became experts in creating one-size-fits-all messages delivered by a handful of media gatekeepers to promote one-size-fits-all products carried by a handful of mega retailers. When marketing spoke, customers listened. These media and commerce channels enjoyed a tight symbiosis which primarily served the purpose of one-way communications from businesses to customers.

In the mass marketing era, the customer conversation didn’t go away, but it became diluted across every TV & print ad, every coupon, every unsatisfactory purchase, every support call where they sat on hold, and every email complaint gone unanswered. Even as advanced targeting capabilities became available with the rise of Google and Facebook, companies spoke to their customers as befit the media they were doing it with: as audiences. In many ways, online advertising has simply amplified the existing distortions in the customer conversation presented by the mass marketing era; the relationship between the business and the customer became even more lopsided.

Even when communications channels were made available to customers like mail-in feedback, customer service lines, and email support, customers rarely feel heard and frequently feel like they’re being given the runaround. How fun is it to navigate a phone tree when you urgently need to talk to a human? For millennials and gen-Z, merely being forced to talk to a salesperson or a support rep on the phone — a communication medium not even reserved for one’s immediate family — is a few steps short of torture.

Fortunately for consumers, the cracks that had begun to appear in this system with the rise of Amazon have become a slow-motion collapse of the mass-marketing status-quo over the last few years. Don Peppers and Martha Rogers, authors of the seminal 1993 book The One to One Future, were prescient to notice how the internet was accelerating trends towards a more personalized, individualized approach to marketing and sales. Instead of looking at markets simply in terms of psychographic segments and market share, they proposed companies think about their business as a collection of relationships withindividual customers, one by one, and over the long run. Their warning to companies in 1993 rings even truer today:

Don’t be confused, however, by the fact that technology, to date, has not made it easy for your customers to communicate their ideas, feelings and suggestions to you. Don’t let a momentary accident of technological history convince you that your customers don’t have individual feelings and suggestions they would like to communicate to you, if it were as easy for them as it is for you.

Because, lo and behold, the end of that “momentary accident of technological history” is upon us.

Rising expectations by customers around the holistic customer experience are well established across industries. Media and entertainmentever the canary in the internet coal mine with a product reducible down to pure ones and zeroes, showed us that people want what they want when they want it — not just what they’re given. Amazon gave it to us with low prices, two-day shipping, easy returns, proactive customer service and personalized recommendations. Through their tech-enabled business models and customer-centric practices, the companies of tomorrow are already displacing the giants of yesteryear.

Technology and new business models have coincided to deliver better customer experiences and in doing so have raised the bar for every other industry. People are frustrated when their banking app is slower and more cumbersome than Uber’s. Why should they care about how difficult regulatory and legacy code issues are to overcome, or about internal bureaucracy? Between 2014 and 2016 alone, the percentage of customers who reported they had stopped doing business with a company after a bad experience jumped from 76% to 82% (KPCB, Ovum)Customers judge companies by the ever-rising gold standards of customer experience, and large swaths of the Fortune 1000 have already begun to wake up to this reality.

The ubiquity of social media and the increasing role of word-of-mouth referrals in the purchase process both amplify the customer experiences people have across their networks as well as drive customers’ desire for authentic communications with companies. People are connecting with one another more frequently and transparently. And with always-on smartphones, our connectivity is real-time by default. No longer can businesses hide in their corporate ivory towers, blanketing the airwaves with their carefully crafted, one-way messages. In the same way that customer expectations are shaped by their experiences with other companies, so too are they shaped by the new ways they interact with the world and with their friends.

So what are most businesses to do? How can companies — old and new — keep up with the ever-rising tide of customer expectations? We at Layer believe the answer lies in another mega-trend precipitated by the mobile revolution.

As the smartphone install base matures, a powerful pattern has emerged in the way people use their devices: messaging consistently is the #1 thing people use their phones for. It only makes sense that a device whose ancestor was exclusively used for communication, and which was dubbed by Steve Jobs in the iPhone keynote as an “internet communicator,” would manifest the fundamental human need to connect and communicate.

Modern messaging apps, by their nature, are used as not simply a means of sending messages but of maintaining a conversation. That means a nearly constant loop of notifications, checking one’s phone, and responding, all the while maintaining a relevance that no other type of app notification can match. These notifications, when implemented correctly, are constantly being opted into by the user. So-called “over-the-top” (OTT) messaging is able to go far beyond mere text, and can incorporate voice, video, and a whole host of entirely programmable interactive message elements.

What Operator pioneered with its concierge shopping experience over rich messaging, others are taking to the next level. Laurel & Wolf and Havenlyconnect customers to interior designers to help you transform your home (and sell you furniture and accessories). Trunk Club connects you over rich messaging to a stylist with whom you collaboratively craft a custom outfit to fit your style and taste. Accolade Health cuts through the red tape of the healthcare bureaucracy by matching employees with their own personal health advisor.

The Layer-powered Trunk Club experience for stylists (left) and customers (right)

These companies are cultivating a differentiated, defensible customer relationship by anchoring their customer conversation in today’s communication medium of choice: messaging. Whether customers are talking to human reps, automated text or voice interfaces, or some combination, the UX metaphor of messaging is the anchor that companies are settling on.Beyond just chat, the companies that define the conversational economy are combining rich, interactive messages, synchronous voice and video, and powerful agent-side customer service dashboards to help their employees be more effective and efficient. The upstarts are not alone — established giants like Staples and Bank of America have gotten the message (😉). The race is on.

Rich messaging is now Amazon’s default customer service option on mobile

Companies that foster one-to-one, direct and personalized customer relationships will stand a chance in a game defined by platform giants like Amazon and agile incumbents like Walmart. Those that do not will go the way of the media companies that the internet has already hollowed out or destroyed outright. Using a rich, branded messaging experience as the backbone of the customer conversation is going to be table stakes.

But the Conversational Economy is about so much more than surviving digital disruption. It is about one-to-one technologies allowing us to return to personal and personalized commerce. It means we’ll get “mass customization at scale,” as Don Peppers put it, where customers are treated as human beings and companies are no longer guessing as to how to serve their needs. And as Trunk Club and others have demonstrated with their custom clothing services, the personalized future is about more than just raw data. It is about a conversation.

The revolution is here, and it has a voice.


What do we mean when we say bots solve the discovery problem?

Much of the early excitement around bots has been predicated not just upon the idea that developers and businesses have a problem reaching new customers via apps, but also that the end user has the problem of finding new apps to download. So enter the dream of bots: a frictionless way to message a service inside an app you’re already using where you can easily install software without really having to do much. Installing new apps becomes trivial and the status-quo preserving gatekeepers of the App Store / Play store are bypassed. Discovery problem solved. Bango.

There’s a pretty obvious problem here: outside of the tech early adopter community (and honestly, even within it), users don’t have the “problem” of not being able to find new apps and easily install themThey might have the problem of finding accurate weather info in a new city and search the app store, but that presupposes a high level of intent and awareness. It requires that the user has some idea that there’s a solution to this problem in an app store.

Launching a new social app, marketplace or even niche productivity tool is hard not simply because the person has to download an app, install it and enter their iTunes password. It’s hard because you need to get people to care about problems for which they don’t know there is a solution, or even problems that they may not know they have. You need to find a way to get to them, whether it’s through word of mouth, paid search or social. The app store is simply not enough because categories that the user knows to search for on the app store have already been well served by existing players.

We’ve all heard it, and sometimes it can sting for people in the startup world to be reminded of this but: users simply don’t need another app. The “problem” here for users is basically nonexistent, and the problem for developers is less app discovery than it is awareness and creating a relationship with the customer that doesn’t start with the Top 150 list on the App Store.

So bots are screwed, right? Most of the bots out there are pretty dumb and unpolished, and their stumbles have been roundly ridiculed in screenshots on social media. All this hype from techies, VCs and Facebook is predictably misplaced shiny object chasing, and it’s another example of Silicon Valley trying to solve a problem that doesn’t exist with cool technology. The doubters may very well turn out to be right, and I do think that bots are decently overhyped (which is not a very original view, and I don’t pat myself on the back for holding it 🐸☕️). But there are the seeds of new, novel and powerful use cases in the ashes of this narrative, and I actually think the biggest impact of chatbots in consumer will stem from bringing services and intelligent agents into users’ conversations with their friends. This looks to be missing from Facebook’s current Messenger Platform iteration, but I have no doubt that we will someday soon see gain the ability to add a bot to a Messenger chat of any size and interact with it together.

If the bot store migrated into the share sheet, users could explicitly share services for their friends to interact with on a demo basis and potentially add to their own Facebook contact book. Most successful new apps (and especially ones that need to build a graph and achieve network-effect critical mass) are very much spread at the beginning by a core evangelist in each group of friends who loves the product (or the idea of the product), and who persuades their friends to join them in using it. The potential to remove those barriers for that kind of user, by arming them with a proposition to their friends to try something without having to download or sign up for anything, shouldn’t be overlooked.

Services delivered via bots could actually indirectly solve for “discovery” in this sense. Supercharging word of mouth and helping power users convert their friends might prove to be very powerful advantages for services delivered via bots. But an even more exciting implication of mixed chats between friends and bots is the potential for truly multiplayer experiences inside of Messenger et al. You could imagine a digital bartender bot could take a group’s order for the night and fulfill it with a delivery service. Or maybe a group gift shopping tool that let users view a carousel of potential outfits to discuss, vote on and ultimately purchase. Perhaps none of these “going out” apps haven’t taken off because they actually just belonged inside of a rich conversation between friends and a robot ticket agent.

Much of our economic lives are social, and our social lives have a huge economic component to it. We constantly make group decisions on what to do, where to go, what to buy. Don’t be surprised if conversational commerce turns out to be as much about commerce weaved into our existing conversations as it is about texting a computer.

Originally published on Medium