In 2018, fears of fake news will pale in comparison to new technology that can fake the human voice. This could create security nightmares. Worse still, it could strip away from each of us a part of our uniqueness. But companies, universities and governments are already working furiously to decode the human voice for many applications. These range from better integration of our internet-of-things devices to enabling more natural interactions between humans and machines. Technologically adept nation states (the US, China and Estonia) have waded into this space and tech giants such as Google, Amazon, Apple and Facebook also have special projects on voice.
It’s not that hard to develop an artificial voice, then model and reproduce spoken words and phrases. I remember being amazed when my original Apple Macintosh informed me of the date and time in a dry, digital tone. Making a natural-sounding voice involves algorithms that are far more complex and computationally expensive. But that technology is available now.
As any speech pathologist will attest, the human voice is far more than vocal-chord vibrations. These vibrations are caused by air escaping our lungs and forcing open our vocal folds, a process that produces tones as unique as a fingerprint because of the thousands of waveforms that are conjured simultaneously and in chorus. But a voice’s uniqueness is also tied to qualities we rarely consider, such as intonation, inflection and pacing. These aspects of our speech are situational, often subconscious and they make all the difference to the listener. They tell us when a phrase such as, “Wow, that outfit is something!” should be interpreted as mean-spirited, sarcastic, loving or indifferent. This challenge explains the early use of emoji in text messages. They were needed to clarify the intent of a written message because it is extremely difficult to interpret the true meaning of conversational speech that’s written instead of spoken.
Details such as such as intonation, inflection and pacing are particularly difficult to model, but we are getting there. Adobe’s Project Voco is developing what is essentially a Photoshop of soundwaves. It works by substituting waveforms for pixels to produce something that sounds natural. The company is betting that, if enough of a person’s speech can be recorded (or data mined), it will require little more than a cut-and-paste action to alter a recording of their voice. Adobe’s initial results from Voco are eerie, as well as awe-inspiring. The prowess of the prototype indicates how soon common citizens will be unable to distinguish between real voices and spoof ones. If you have enough samples stored in your data library, then you can make anyone appear to say almost anything.
Technology companies and investors are betting on the idea that these systems will eventually have tremendous commercial value. Even before that situation arises, though, this particular type of technology will present big risks. By 2018, a nefarious actor may easily be able to create a good enough vocal impersonation to trick, confuse, enrage or mobilise the public. Most citizens around the world will be simply unable to discern the difference between a fake Trump or Putin soundbite and the real thing.
When you consider the widespread distrust of the media, institutions and expert gatekeepers, audio fakery could be more than disruptive. It could start wars. Imagine the consequences of manufactured audio of a world leader making bellicose remarks, supported by doctored video. In 2018, will citizens – or military generals – be able to determine that it’s fake?
-William Welser IV