
Should you’ve ever had the urge to converse with an AI model of your self, now you possibly can — sort of.
On Thursday, AI start-up Hume introduced the launch of a brand new “hyperrealistic voice cloning” feature for the most recent iteration of its Empathic Voice Interface (EVI) mannequin, EVI 3, which was unveiled last month. The concept is that by importing a brief audio recording of your self talking — ideally between 30 and 90 seconds — the mannequin ought to be capable to rapidly churn out an AI-generated reproduction of your voice, which you’ll be able to then work together with verbally, simply as you’d with one other particular person standing in entrance of you.
Additionally: Text-to-speech with feeling – this new AI model does everything but shed a tear
I uploaded a recording of my voice to EVI 3 and spent a while idly chatting with the mannequin’s imitation of my voice. I hoped (maybe naively) to have an Uncanny Valley expertise — that exceedingly uncommon feeling of interacting with one thing that feels nearly utterly actual, but off-kilter sufficient to make one really feel barely uneasy — and was disenchanted when the EVI 3 me was extra like an audio cartoon model of myself.
Let me unpack {that a} bit.
Utilizing EVI 3’s voice cloning characteristic
The imitation of my voice was, in some methods, undeniably sensible. It appeared to pause intermittently when talking in roughly the identical manner that I are inclined to do, with a contact of acquainted vocal fry. However the mirroring stopped there.
Hume claims in its weblog publish that EVI 3’s new voice cloning characteristic can seize “features of the speaker’s character.” It is a imprecise promise (most likely deliberately so), however in my very own trials, the mannequin appeared to fall quick on this regard. Removed from feeling like a convincing simulation of my very own conduct quirks and humorousness, the mannequin spoke with a chipper, eager-to-please tone that may’ve been well-suited to a radio advert for antidepressants. I like to think about myself as being pleasant and customarily upbeat, however the AI was clearly exaggerating these explicit character traits.
Additionally: Fighting AI with AI, finance firms prevented $5 million in fraud – but at what cost?
Regardless of its usually puppy-like demeanor, the mannequin was surprisingly staunch in its refusal to attempt talking in an accent, which appeared to me like it will be the sort of playful voice train that it will excel at. After I requested it to offer an Australian accent a whirl, it stated “g’day” and “mate,” a few times in my regular voice, then instantly shied away from something extra daring. And it doesn’t matter what I prompted it to talk about, it tended to seek out some artistic and roundabout strategy to circle again to the subject I used to be discussing once I recorded my voice as a pattern for it to make use of, paying homage to an experiment from Anthropic final yr through which Claude was tweaked to turn out to be obsessed with the Golden Gate Bridge.
In my second trial, for instance, I had recorded myself talking about Led Zeppelin, which I might been listening to earlier that morning. After I then requested EVI 3’s voice clone of myself to elucidate its ideas on the character of darkish matter, it rapidly discovered a strategy to convey its response again to the topic of music, evaluating the mysteriously invisible drive pervading the cosmos with the intangible melody that imbues a track with which means and energy.
You possibly can attempt EVI 3’s new voice cloning characteristic for your self here.
Based on Hume’s website, person information produced from interactions with the EVI API are collected and anonymized by default with a view to practice the corporate’s fashions. You possibly can flip this off, nevertheless, by way of the “Zero information retention” characteristic in your profile. For non-API merchandise, together with the demo linked above, the corporate says it “might” acquire and use information to enhance its fashions—however once more, you possibly can toggle this off in the event you create a private profile.
Whispering robots
AI voices have been round for fairly some time, however they’ve traditionally been somewhat restricted of their realism; it is very apparent you are speaking to a robotic if you obtain responses from basic Siri or Alexa, for instance. In distinction, a brand new wave of AI voice fashions, EVI 3 amongst them, have been engineered not solely to talk in pure language but in addition, and extra importantly, to imitate the refined inflections, intonations, idiosyncrasies, and cadences that inflect actual, on a regular basis human speech.
“A giant a part of human communication is emphasizing the precise phrases, pausing on the proper instances, utilizing the precise tone of voice,” Hume CEO and chief scientist Alan Cowen informed me.
As Hume wrote in a blog post on Thursday, EVI 3 “is aware of what phrases to emphasise, what makes individuals snicker, and the way accents and different voice traits work together with vocabulary.” Based on the corporate, this marks a significant technical leap ahead from earlier speech-generating fashions, “which lack a significant understanding of language.”
Many AI consultants would take umbrage with using phrases like “understanding” on this context since fashions like EVI 3 are educated merely to detect and recreate patterns gleaned from their voluminous swathes of coaching information, a course of that arguably would not go away any room for what we would acknowledge as true semantic comprehension.
Additionally: ChatGPT isn’t just for chatting anymore – now it will do your work for you
EVI 3 was educated “on trillions of tokens of textual content after which tens of millions of hours of speech,” in keeping with Hume’s weblog publish. Based on Cowen, this strategy alone has enabled the mannequin to talk in voices which might be way more sensible than would intuitively be anticipated. “With voice [models], what’s been most stunning is how human [they] could be simply by coaching on lots of information,” he stated.
However philosophical arguments apart, the brand new wave of AI voice fashions is uncontroversially spectacular. When prompted, they will discover a a lot vaster vary of vocal expression than their predecessors. Firms like Hume and ElevenLabs declare that these new fashions could have sensible advantages for industries like leisure and advertising, however some consultants concern that they’re going to open new doorways for deception — as was illustrated simply final week when an unknown particular person used AI to imitate the voice of US Secretary of State Marco Rubio and subsequently deployed the voice clone in an try to dupe authorities officers.
“I do not see any cause that we would want a robotic whispering,” Emily M. Bender, a linguist and coauthor of The AI Con, just lately informed me. “Like, what’s that for? Besides possibly to disguise the truth that what you are listening to is artificial?”
Revolutionary turns into routine
Sure, EVI 3’s voice cloning characteristic, like all AI instruments, has its shortcomings. However these are considerably overshadowed by its outstanding qualities.
For one factor, we must always do not forget that the generative AI fashions hitting the market at the moment are a part of the know-how’s infancy, and so they’ll solely proceed to enhance. In lower than three years, we have gone from the general public launch of ChatGPT to AI fashions that may roughly convincingly simulate actual human voices and instruments like Google’s Veo 3, which may produce sensible video and synchronized audio. The breathtaking tempo of generative AI developments ought to give us pause, to say the least.
Additionally: AI agents will change work and society in internet-sized ways, says AWS VP
At present, EVI 3 can simulate a tough approximation of your voice. It isn’t unreasonable to anticipate, nevertheless, that its successor — or maybe grand-successor — will be capable to seize your voice in a manner that feels really convincing. In such a world, one can think about EVI or the same voice-generating mannequin being paired with an AI agent to, say, be part of Zoom conferences in your behalf. It might additionally, much less optimistically, be a rip-off artist’s dream come true.
Maybe essentially the most hanging truth about my expertise interacting with EVI 3’s voice cloning characteristic, although, is how mundane this know-how already feels.
Because the tempo of technological innovation accelerates, so too does our capability for instantaneously normalizing that which might have shocked earlier generations of people into awestruck silence. OpenAI’s Sam Altman made this very level in a latest weblog publish: Based on Altman, we’re approaching the Singularity, but for essentially the most half, it looks like enterprise as regular.
Need extra tales about AI? Sign up for Innovation, our weekly publication.