THE FACE OF YOUR VOICE
PRIVACY and SURVEILLANCE. An exploration at the intersection of art, science and technology. Facial reconstruction through voice AI.
Every person is defined by a unique voice. That voice is mostly used to exchange information with other people. Other than stories, it says a lot about identity. Using contemporary technologies, a short voice recording can be used to recognize these identities. What once started as an abstract idea, remember the speech technology company Lernout & Hauspie from the 90’s, is now possible on a small scale. Voice can grant people access to personal accounts, or everybody can ask any question to Apple’s virtual assistant Siri. Another application is that a unique face can be reconstructed through artificial intelligence. The use of voices however evolves, and the possibilities of technology explode. If both go together, the opportunities, and also the risks, are endless.
Raw data is transferred to information through technology, and then gets transformed by intelligence. That intelligence is trained as a nervous system that learns to recognize speech-face related correlations, and then produces images that depict the physical attributes of the speaker, for example age, gender or ethnicity. Until now, that reconstruction happens in a canonic form: the face from the front with a neutral expression.
The reconstructed face can be printed in space through a 3D-model. If the speaker can interact with its own sculpture, it will lead to surprising situations. Interaction with an AI- system is different than with a real person, artificial intelligence doesn’t know the difference between good or bad. Sculpture and people can have conversations, but is it possible for the prints to think rationally? The logical question to follow is: can a computer be human? An important difference here is that humans are responsible for their own thoughts and actions.
A voice is an identity that cannot just be exchanged. Not all questions about personal identity can be solved, but important to remember is that there will always be answers. Am I the same person when I am five years old as when I am forty years old? Does a robot still exist when it is turned off? The use of voices in technology evolves so quickly that new techniques can manipulate identities.
What if we can use natural language to make machines that communicate as humans by using artificial intelligence? What if we match a voice to a 3D-model of a face, or even a whole person? What if technology becomes so intuitive that it disappears to the background? We could simply ask what we want, whenever we want to. When is that good or bad? Can we let new technological developments go hand in hand with ethical considerations thereabout? How can we let the paradoxes that go with innovations take part in our lives? Do we just wait, or are more revolutions required, or do we already live amidst revolutionary developments of what can happen with voice-technology? To be continued…
In the meantime, we can enjoy the discoveries our voice can say about ourselves. Maybe we don’t look like the physical features that artificial intelligence recognizes in our speech sound. Or maybe we do.
The project collaboration is an artistic continuation of Speech2Face: Learning the Face Behind a Voice: How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how–and in what manner–our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.