Voice is what makes artificial intelligence come to life, says writer James Vlahos. It’s an “imagination-stirring” aspect of technology, one that has been part of stories and science-fiction for a long time. And now, Vlahos argues, it’s poised to change everything.
Vlahos is the author of Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think (Houghton Mifflin Harcourt). It’s already the case that home assistants can talk and show personality — and as this technology develops, it’ll bring a host of questions that we haven’t reckoned with before.
The Verge spoke to Vlahos about the science of voice computing, which people will benefit most, and what this means for the power of Big Tech.
This interview has been lightly edited for clarity.
What exactly is happening when you talk to a gadget like Alexa and it talks back?
If you’re just used to talking to Siri or Alexa and you say something and hear something back, it feels like one process is taking place. But you should really think about it as multiple things, each of which is complex to pull off.
First, the sound waves of your voice have to be converted into words, so that’s automatic speech recognition, or ASR. Those words then have to be interpreted by the computer to figure out the meaning, and that’s NLU, or natural language understanding. If the meaning has been understood in some way, then the computer has to figure out something to say back, so that’s NLG, or natural language generation. Once this response has been formulated, there’s speech synthesis, so that’s taking words inside a computer and converting them back into sound.
Each of these things is very difficult. It’s not as simple as the computer looking up a word in a dictionary and figuring things out. The computer has to get some things about how the world and people work to be able to respond.
Are there any really exciting advances in this area that piqued your curiosity?
There’s a lot of really interesting work being done in natural language generation where neural networks are crafting original things for the computer to say. They’re not just grabbing prescripted words, they’re doing so after being trained on huge volumes of human speech — movie subtitles and Reddit threads and such. They’re learning the style of how people communicate and the types of things person B might say after person A. So, the computer being creative to a degree, that got my attention.
What’s the ultimate goal of this? What will it look like when voice computing is ubiquitous?
The big opportunity is for the computers and phones that we’re using now to really fade in their primacy and importance in our technological lives, and for computers to sort of disappear. You have a need for information and want to get something done, you just speak and computers do your bidding.
That’s a huge shift. We’ve always been toolmakers and tool users. There are always things we hold or grab or touch or swipe. So when you imagine that all just fading away and your computing power is effectively invisible because we’re speaking to tiny embedded microphones in the environment that are connected to the cloud — that’s a profound shift.
A second big one is that we are starting to have relationships with computers. People like their phones, but you don’t treat it as a person, per se. We’re in the era where we start to treat computers as beings. They exhibit emotions to a degree and they have personalities. They have dislikes, we look to them for companionship. These are new types of things you don’t expect from your toaster oven or microwave or smartphone.
Who might benefit the most from the rise of voice assistants? The elderly is one group that we often hear about — especially because they can have poor eyesight and find it easier to talk. Who else?
The elderly and kids are really the guinea pigs for voice computing and personified AI. Elderly people have the issue often of being alone a lot, so they are the ones that might be more likely to turn to chitchat with Alexa. There are also applications out there where voice AI is used almost as a babysitter, giving medication reminders or letting family members do remote check-ins.
Though, and not to way overgeneralize, some older people have dementia and it’s a little bit harder to recognize that the computer is not actually alive. Similarly, for kids, their grasp of reality is not so firm so they are arguably more willing to engage with these personified AIs as if they were really alive in some way. You also see the voice AIs being used as virtual babysitters, like, I’m not at home but the AI can watch out. That’s not totally happening yet, but it seems to be close to happening in some ways.
What will happen when we get virtual babysitters and such and all the technology fades into the background?
The dark scenario is that we seek out human companionship less because we can turn to our digital friends instead. There’s already data pouring into Amazon that people are turning to Alexa for company and chat and small talk.
But you can spin that in a positive way and I sometimes do. It’s a good thing that we’re making machines more human-like. Like it or not, we spend a lot of time in front of our computer. If that interaction becomes more natural and less about pointing and clicking and swiping, then we’re moving in the direction of being more authentic and human, versus us having to make ourselves like quasi-machines as we interact with devices.
And I think we’re going to hand more centralized authority to Big Tech. Especially when it comes to something like internet search, we are less likely to browse around, find the information we want, synthesize it, open magazines, open books, whatever it is we do to get information versus just asking questions of our voice AI oracles. It’s really convenient to be able to do that, but also we give even greater trust and authority to a company like Google to tell us what is true.
How different is that scenario from the current worry about “fake news” and misinformation?
With voice assistants, it’s not practical or desirable for them to, when you ask them a question, give you the verbal equivalent of 10 blue links. So Google has to choose which answer to give you. Right there, they’re getting enormous gatekeeper power to select what information is presented, and history has shown that if you consolidate the control of information very highly in a single entity’s hands, that’s rarely good for democracy.
Right now, the conversation is very centered on fake news. With voice assistants, we’re going to skew in a different direction. Google’s going to have to really focus on not presenting [fake news]. If you’re only presenting one answer, it better not be junk. I think the conversation is going to more turn toward censorship. Why do they get to choose what is deemed to be fact?
How much should we worry about privacy and the types of analyses that can be done with voice?
I am equally worried about privacy implications as I am with just smartphones in general. If tech companies are abusing that access to my home, they can do it equally with my computer as they can do it with Alexa sitting across the room,
That’s not at all to play down privacy concerns. I think they’re very, very real. I think it’s unfair to single out voice devices as being worse. Though there is the sense that we’re using them in different settings, in the kitchen and living room.
Switching topics a little bit, your book spends some time discussing the personalities of various voice assistants. How important is it to companies that their products have personality?
Personality is important. That’s definitely key, otherwise why do voice at all? If you want pure efficiency, you might be better off with a phone or desktop. What hasn’t happened heavily yet is differentiation around the edges between Cortana, Alexa, Siri. We’re not seeing tech companies design vastly different personalities with an idea toward capturing different slices of the market. They’re not doing what cable television or Netflix do where you have all these different shows that are slicing and dicing the consumer landscape.
My prediction is that we will do that in the future. Right now, Google and Amazon and Apple just want to be liked by the most number of people so they’re going pretty broad, but [I think they will develop] the technology so my assistant is not the same as your assistant is not the same as your co-worker’s assistant. I think they’ll do that because it would be appealing. With every other product in our lives we don’t have a one-size-fits-all, so I don’t see why we would do that with voice assistants.
There’s some trickiness there, though, as we see in discussions around why assistants tend to have female voices. Is more of that in store?
We’re seeing questions already about issues relating to gender. There’s been very little conversation about the issue of race or perceived race of virtual assistants, but I have a sense that that conversation is coming. It’s funny. When you press the big tech companies on this issue, except for Amazon who admits Alexa is female, everyone else is like “it’s an AI, it doesn’t have a gender.” That’s not going to stop people from perceiving clues about what sort of gender or race identity it’s going to have.
All this to say, Big Tech is going to have to be really careful to negotiate those waters. They might want to specialize a little more, but they might get into dangerous waters where they do something that sounds like cultural appropriation, or something that is just off, or stereotypical.