Latest

The future speaks and it sounds like us

LondonInsights

October 13, 2025

Imagination · The future speaks and it sounds like us

This article was written by, Sam Martin, Technical Director from our London studio.

Imagine joining a live event where every word is instantly transcribed, translated, and streamed, even in the speaker’s voice, but in another language. What once sounded like sci-fi is now transforming how we connect, tell stories, and engage across cultures.

At Imagination, we’re exploring how these advances in speech-to-text, translation, and speech synthesis are reshaping the way brands communicate and how creativity can thrive in an age where language is no longer a barrier.

The landscape of tooling that supports speech-to-text, translation and speech synthesis feels like magic these days. In some ways, it always has felt that way, but in 2025, we truly have the modern-day Babel Fish. Services like OpenAI Whisper, AWS Translate, Elevenlabs Text to Speech, Azure AI Speech and countless other local or hosted models, services and platforms enable us to not only enrich experiences, but change the way we communicate.

The evolution of deep neural networks that power modern-day tooling sees a marked difference from early-day products, but before looking at what we have now, it’s worth remembering what we had to make do with. Most applications required meticulous manual configuration by humans, and once they shipped, that was pretty much it. No real meaningful upgrades, no learning, no improvements to get better.

Microsoft’s Speech API and voice of nightmares, Microsoft Sam, was just one example of a platform that followed defined rules on how to say something, rather than understanding why. There was no way to configure it to better pronounce brand names, perfect the intonation of certain words, or just generally sound better. Throw in a word like “soy” and you’d get something very weird.

If you wanted to use your own voice, Dragon’s Naturally Speaking filled a gap. It required installed software to allow you to dictate your own speech into documents, as long as you spoke at just the right speed and volume. It followed early signs of learning by being trained on your voice. Read a few sentences out loud, or more if you had the patience, and the specific way you said sausages would be better matched next time. If the system didn’t know enough about how you speak, it could give very unpredictable results.

It’s different these days, though. Today, we can leverage speech-to-text services that can listen to, break down and identify what you can say with huge degrees of accuracy. Your accent rarely throws it off (though there are still exceptions), and tone usually isn’t an issue either.. If you’ve ever tried to use the wake word for your Amazon Echo device more than three times and it’s simply not listening today, you’ll know the next “ALEXA!” you shout isn’t the happiest, but usually does the trick.

Through this concept of ever-evolving capabilities, the services and underlying models used get better as more and more examples of voice are made available to train on.

Once our words are captured and understood, the next step is to make sure they mean the right thing in every language.

There was a point in the not too distant past where Google Translate was the well-recognised get out of jail free card when it comes to translation, but also came with a cloud above your head, reminding you of all the times you’ve been told how wrong it can be. But just as speech recognition now continually improves, so has translation.

NMT, or Neural Machine Translation, has shifted these services from simply following linguistic rules to understanding them. Sentences are statistically constructed using enormous amounts of training data to generate translated sentences comparable to humans, figuring out what word should come next.

Importantly, when in a live translation pipeline, an AI approach requires most, if not all, of the sentence to be spoken in the original language first before it can be translated. They are getting better at incremental translations, like Azure. This is to ensure the full context of the subject is being captured, including the addition of nouns and adjectives.

Translation is not a one-to-one mapping of one language to another word for word. Translating “the red car” to “la rouge voiture” would not be acceptable in French, and it would instead be “la voiture rouge”. Or so Google Translate assures me. I hope it’s right.

Lastly, the speech synthesis helps bring to life a whole conversation to an audience who might otherwise remain in the dark at a conference, ordering dinner at a foreign restaurant, or watching an Instagram reel on how to cook burritos (mobile only).

Platforms offer a wide range of incredibly natural-sounding voices to choose from to find the perfect match for your use case, or, in the case of Meta on the reel linked, even allow you to clone your own voice. These systems also continually improve with more use and time, sampling huge amounts of provided data. If you chose to listen to this article using the audio player at the top, you’re actually listening to my cloned voice. If you’re reading instead, it’s your own voice, but feel free to click the play button for a few seconds. It’s not perfect, but to my ears, I can hear an instant similarity. ElevenLab’s voice cloning can somewhat convincingly change your voice with just 10 seconds of audio.

More platforms are opening their doors to exploring what they have to offer. If you ever wondered what you sound like in another language, SeamlessExpressive by Meta is quite something.

From the rule-bound voices of the past to the adaptive, expressive systems of today, speech technology has evolved from novelty to necessity. These tools don’t just help us talk to machines; they help us talk to each other.

As they continue to improve, we’re edging ever closer to a world where language is no longer a barrier but a bridge, one that sounds remarkably like us.

For the things worth sharing, we’ve got a newsletter for that.

Sign me up