I spent the last 10 days of 2023 on “vacation” from work, researching how I could use AI not to replace creatives, but to turbo-charge my creativity. During my ten-day dive, one of my investigations was whether I could perform all the voice overs for a cartoon I wanted to make, then use voice morphing to “recast” it with AI voices (giving my performances). These are the results.
Converting vs. Generating
First, it’s important to note that this is not generative AI. I am not giving a tool a text to speak. I am providing it a recording of a human giving a vocal performance and having it convert that to a new recording of that performance in another voice.
Also, note, I wouldn’t even try to use a famous voice for the cartoon. But some of the tests below show that it would be possible. I’m happy to start a discussion with a blog on the implications of that, but this one is all about results… not what they might mean.
The Base Audio Sample(s)
The base sample was recorded using a Blue Yeti mic and Audacity, not well tuned for the room/distance, and was just me getting through a dashed-off script quickly. It uses 5 vocal styles. I warn right now, none of them were rehearsed. This was a cold-read on mic with plenty of warts. But the warts were useful, especially the ones that created distortion.
If you don’t like my scripting or voicing, please be like Elsa and let it go because they’re supposed to be exaggerated and the sample isn’t the point. The point is how a morphing tool and voice model handles the tonal differences. Does it remove accents, gruffness, inflection? Does it enhance or distort them?
I was looking at this as not just a test of how good the conversion was, but as I progress and create voice models for my characters, I need to understand how much of the acting will need to be in the voice model and how much in the recording to which we apply the model? Long term, the goal is to understand the nuances as a performer, director, and technologist so I can get the best quality when I finally learn enough about animation (likely in Blender) to start making the cartoon.
As I went along, some set-ups didn’t like the raw style, so I ran it through Adobe’s free podcast cleaner to remove echo and distortion. Some tools had a 30-second limit, so here are those versions for your reference:
Cleaned:
Cleaned & trimmed:
Altered (8/10)
Altered Studio lets you sign up for free and get 5 minutes of voice morphing, which was more than enough to test my 80 second clip with three voices. It’s fairly simple. I uploaded my clip, picked a voice, and went with the default settings.
I started with the default “Austin” voice. At the beginning, it sounded good. It even got some of the gruffness in the tractor pull ad, but some of the words seemed not to come out right. Then, going into the bad foreign accents, it sounded like a loud whisper. It caught some of the accent, but not enough.
When I tried the “Danai” voice to see if it could morph it into a woman, it didn’t charge me any seconds from my quota, so it seems the quota is for converting the sample to a baseline it can apply the new voices to, not each morph of the same sample. That’s cool.
Danai has some of the same issues. The phrase “tell the cows from the queens” was muddled and the foreign accents got progressively more whispery. One of the good things is there are a lot of options to fine tune the morph.
I took Danai, switched from Timbre-2.5 to Style-changer-2 and chose an aggressive style. This ended up making things a bit more monotone, but louder. The accents weren’t as pronounced, but they were at a conversational volume and didn’t sound whispered.
I was afraid that if I chose this route, I’d need to buy the $120 monthly pro membership to have the time to really familiarize myself with the settings. But it seems that the real cost is on turning the audio samples into tokens they can morph into another voice, so with a few minutes of properly set up samples, I could explore the different options and potentially find a good fit.
They also have a “rapid-clone” service (which I haven’t tried yet) so I can record some dialogue in a more distinctive character voice (like the tractor pull commercial), edit it a bit, then model it and maybe be able to make my throat less sore when I need that kind of voice.
While the conversions were not as good as I’d like, it got the 8/10 because there’s a lot more to explore in nailing down the right settings and the usage of quota just to take in the sample, instead of for every morph, were plusses.
Austin Timbre 2.5:
Danai Timbre 2.5:
Elevenlabs* (9/10)
Elevenlabs has five subscription levels. The free version is great for exploring and the initial test of the recording used up 13.35% of my free monthly quota.
I tried a morph with their Drew voice and it came out pretty well. It muddled some of the accent stuff at the end, but it stayed consistent with the new voice and just made the accents less pronounced.
Given that I had 6 more tries left in my freebie quota, I adjusted Drew’s voice by turning “stability” down to 30 and “style exaggeration” up to 30. It improved the accents a bit, but it was fairly similar, you can try Drew Default and Drew Adjusted below.
Then I tried the Freya voice with the default settings and again we had the clarity of Drew, but maybe a bit better with the accents. If I need an accent consistently, I guess I can use a model that has an accent. But there may be times when I want an established character who has one accent to try to pull off a different accent successfully.
Overall, though, the clarity was very good, and while it costs quota for every try, it could be mastered using a lower-priced plan. They offer both a voice design lab that lets you randomly generate and then tweak a voice, or you can do “instant” voice cloning if you subscribe to a paid plan.
Drew Default:
Drew Adjusted:
Freya Default:
FineVoice (6/10)
One of a number of products from FineShare, this was pretty disappointing. I signed up for the free account, picked a model (Michael), uploaded my sample, and… It was raspy and sounded sort of autotuned. I really didn’t like it.
But I figured I might as well try it with another voice model. They have a wide library, both “pro” and “community” models. I used the pro model, Michael, on my first try. On the second, I tried the community model of “Darth Vader.” It wasn’t “bad.” The announcer segment sounded somewhat like Lord Vader, but it was still raspy and autotuned. I might use FineVoice to make a voice for a computer character, but not a human one.
Wondering if any ambient echo or noise in my sample was causing some of the distortion, I ran it through Adobe’s podcast enhancer. That did create a noticeable improvement, but not enough to make this feel natural.
Michael model:
Michael model using Adobe podcast cleanup:
Darth Vader model:
Kits.ai (7.5/10)
Like FineVoice, it felt like this did more pitch and timbre changing than actual vocal conversion. That preserved a lot of the accent stuff, but overall was sensitive to echo and noise in the original recording.
I tried their “Male Pop Rock” voice twice, once with the original, and once with the Adobe-enhanced recording. It definitely had distortion issues with the first one, but while the second wasn’t perfectly natural, the improved source file did improve the overall quality a lot. I might come back in 6 months and see how they’ve improved.
Kits.ai pop rock base:
Kits,ai pop rock with cleaned file:
Replay (6/10)
Replay is designed mostly to recast songs in another voice. It’s a desktop app that does the transforms locally with a wide catalog of celebrity and character voices modeled by different members of its community. You download a voice, add in the song or audio clip you want to transform, and then it will separate out the music (if any), convert the voice, then blend them back together. For fun, I had it make Frank Sinatra sing a Justin Timberlake song, but I won’t post that because I’m not a sadist.
As it’s tuned for singing, my earliest test with this was that spoken word could come out a little sing-song. It even says “Create Song” on the button to do the conversion and the downloadable file is named by default “[name on model] sings [name of audio file].mp3.”
For this particular test, I chose their model for Joe Rogan because he has a distinctive voice, then Mark Hamill because he’s a voice acting superstar, and his model was 3-4x the size of the others. I also tried their Matthew McConaughey and Morgan Freeman models and even came back and tried Morgan with the cleaned-up audio file and a sing-songy one. Neither Morgan or Matthew were at all good, so it seems like the quality of the model is a major factor.
Mark Hamill was identifiable, but not perfect. It reproduced the different tones, inflections, and even accents with reasonable accuracy.
Joe Rogan… if you know it’s supposed to be Joe Rogan, you can hear him, but it’s still not great. At times it sounds more like Bert from Sesame Street. It picks up most of the intonations and accents with reasonable accuracy.
The downside was that when I tried to get that rough Announcer/Wrestler voice going, all the rasp and rumble was removed. It seems that when I want that kind of voice, I’ll have to create a model for it with the rasp and rumble in the model.
You can create your own models for Replay, but at the time of this writing it looks a bit fiddly to create an RVC2 model and to understand what the perfect script to get a really good model would be.
Respeecher (4.5/10)
Right off the bat, this was hampered by a 30 second limit on conversions with a free trial. I clipped the enhanced clip down to 30 seconds (finishing the sentence about the sled).
After uploading the 30 second clip, it sat and whirled a circle for quite a long time on the “selecting voice models.” Actually, the WebUI seemed to lock up regularly and the only way to get to the next step was to wait a while and refresh or open the URL in a new tab. This was using the latest Chrome on Mac.
Beyond the 30-second limit, there was no option to download the samples. And the next level up from free was $199 a month, which gave you 2 hours of conversion… at $1.65 a minute. I didn’t feel the quality of the 30 second sample was significantly better than ElevenLabs or Altered, which cost significantly less. They weren’t significantly worse either.
Since I had to cut my test sample, couldn’t download my test samples to share with you, and the cheapest upgrade was $199 a month, I averaged their 8 for conversion quality with a 1 for broken UI and high prices for the 4.5 above.
UberDuck (6/10)
This is another one that’s focused on converting singing rather than spoken word. It only gave me 4 voices to choose from in the free trial. The Grimes one sounded a bit like Phyllis Diller with laryngitis with some auto-tune artifacting on the base recording.
I uploaded the cleaned-up version and tried it with the Quackmaster Pika voice, dropping it down a bit in key. That came out a LOT better.
It’s only $8 a month ($96 a year) for a paid account, but the documentation is really sparse and focuses more on the API than the web UI. They have “private voices,” but it’s not immediately clear how to make one and voice cloning is only a feature in the enterprise plan. Overall, it feels like a promising MVP, but it has a way to go in functionality, quality, and documentation to be a serious contender.
Grimes test:
Quackmaster Pika test (with cleaned sample)
Voice Swap (6/10)
Another singing-focused service, it contains “free” session singer voices and named artist voices. The licensing for the named singer voices requires a contract to be able to use them beyond social media it seems, maybe more restrictive.
I tried the “Chaucer” voice with the 30-second cleaned up clip due to the free trial limits. The reproduction wasn’t bad. I could see where I might use that voice in a workflow. But this is FAR from a Swiss Army knife of voice morphing. I did not see any options for using/making your own models and they only have 15 models, only 3 of which don’t require a special contract for commercial use.
Like UberDuck, it seems like a promising MVP, but still has a way to go.
Voice Swap Chaucer on cleaned 30-second clip:
Conclusion
ElevenLabs (9/10) won this shootout by having not only one of the highest qualities, but features and pricing that make me feel I can move forward with a high level of confidence that it can meet my needs, I can master it with a reasonable learning curve, and I can keep my costs in line.
My second place choice actually has a deceptive score: Replay (6/10). While it is not delivering as good a quality of conversion, thus its score, it seems that some of that can be attributed to the quality of the models. If I could learn to create my own high-quality models, Replay gives me two benefits. First, it’s free for now, which helps me bootstrap more efficiently. Second, it runs locally. And that local operation is VERY useful, because if the creator of the app stops updating it and pulls down the site, I still have the tool and my models in perpetuity. It’s definitely one I’ll watch.
So, while I’ll probably start out on ElevenLabs, I’ll be looking for open source tools to model and morph voices that I can run locally on a Mac. If the projects go quiet but are at a usable point, I don’t need to worry about support… mostly. Plus I can contribute to documentation or maybe even some code if I find myself in uncharted territory.
If you have any suggestions of open source voice morphers, please leave them in the comments.
And if you’d like to see one of my newest creative projects, check out Todd The God, a comic strip about an ancient god of hammers trapped inside a rock.
* Since it looks like I’ll be using elevenlabs, the link to it is an affiliate link that will generate some free credits for me if you use it and become a paid user. I did not research affiliate options until after this blog post was written and affiliate options did not influence the conclusions.