Speech to Text to Speech: Using AI for Video Translation

Artificial Intelligence (AI) automation is extremely popular now. With the current growth in video content, a lot of people are wondering if AI tools could be used to reduce the time and cost of their video translations.

What tools are available for companies looking to use AI in video translation?

And are such tools reliable and accurate enough to use in place of human translations?

There are certainly some promising benefits to using AI tools in video translation… with some caveats.

Do AI tools mean the end for human translation?

You might think that we would shy away from talking about AI tools, given that we are a translation provider. After all, it might seem like automated services would mark the end of human translation…

But, here at Rubric, we love technology!

We’re as much technology experts as we are language experts.

If an automated tool can help us to streamline our process and help our clients to streamline theirs, why wouldn’t we use it?

The two “headline benefits” of AI-assisted translation are:

Faster translations
Cheaper translation costs

These benefits can certainly be the reality. However, you need to make sure that you are using those tools in the most effective way possible.

We actively look for ways that we can use technology, including AI, to reduce our clients’ long-term localization costs. But, we always do it in a measured way and only use such technology when appropriate — when it aligns with the client’s strategy, content, and the needs of their end-users.

How AI technology fits into the video translation process

There are 3 stages of the video translation process where AI tools can be helpful:

Speech to Text (STT) that turns the video’s audio into a source text for translation.
Machine Translation (MT) that converts the source text into the target language.
Text to Speech (TTS) that creates artificial spoken audio of the translated text.

Stage 1: Speech to Text (STT)

Speech to text AI “listens” to a person speaking in natural language and automatically turns this information into readable text. STT algorithms are currently used in a wide variety of situations, including live meeting transcription, digital personal assistants, and smart home technology.

For video translation, STT can be used to quickly turn a spoken video into (almost) translatable text. You just feed your video into the algorithm and it produces a script of what is being said.

What to watch out for with STT

It’s important to remember that speech to text technology is not perfect. The most common STT algorithms have hugely varied accuracy in the ranges of about 40-95% with some even lower than this.

Even in perfect conditions, STT still struggles with some words and phrases. As Rubric’s Global Content Business Analyst Rebecca Metcalf explains:

“I have found that the most common type of error that speech to text AI makes is the omission of words. These can be very important words which completely alter the meaning, for example omitting “non” in “non-flammable. Another very common error is the misspelling of proper nouns like the names of people, companies, or products.”

The key to getting the most from STT is to add the human touch. Someone should always check and edit the transcription before sending it for translation.

Stage 2: Machine Translation (MT)

A lot of our clients ask us if they can use machine translation. Google Translate now gives everyone instant translation into many languages so, naturally, people are interested in using such technology for their video translations as well.

MT tools take a text written in the source language and automatically translate that text into the target language.

What to watch out for with MT

While our clients ask if they can use machine translation, we always ask if they should.

As with speech to text technology, MT is not perfect. Tools can have around 70% accuracy, which plummets if the source content is of low quality. For example, if it has been created by a speech to text algorithm and hasn’t been edited…

Rebecca explains:

“If the narrator in a video says “And so, well, ummm, next up, next we’re going to take a look at…”, the STT might transcribe this verbatim. But, it’s usually better to just transcribe it as “Next…” How concise the original subtitle is has a big impact on the final translated version.”

This is true with human translation but it is especially important with MT.

One way that you can make the best use of MT is to use it in conjunction with a translation memory. This is where you store your business’s commonly translated words and phrases in a database. The MT algorithm can then access the translation memory and automatically switch these phrases for their accepted translations, which reduces the cost of translation.

In our work with AccuWeather, we used a translation memory to reduce their translations from 1 million words to just 50 thousand.

The key to getting the most from MT is also the human touch. It should be a tool to help streamline human translation, not remove humans completely.

Stage 3: Text to Speech (TTS)

When you are creating videos, overdubbing in another language can be a huge expense. Text to speech provides an option to turn translated text into artificially spoken words and is a significantly cheaper alternative.

While the quality of TTS voices is still rather unnatural, the technology is getting better all the time and can be a usable solution for some types of video. For example, it might be okay to use AI overdubbing for an internal training video.

What to watch out for with TTS

One of the difficulties with text to speech is that some words aren’t pronounced correctly by the TTS engines. The accuracy of text to speech can range between 60-98%. For example, proper nouns can sometimes sound strange when spoken by a machine.

Pronunciation problems are exacerbated when the input text is of low quality, for example when it has been produced by machine translation and hasn’t been edited.

You can improve pronunciation accuracy by tweaking the spelling of incorrectly pronounced words and punctuation. However, depending on the type of video you are producing, this might not be necessary.

As with the previous stages, the human touch is vital. Both reviewers and translators should look over the final video to ensure that no problems exist.

How to get started with better video translation

Overall, AI technologies can be a great tool to streamline the video localization process, but only if you use them in the right way and don’t forget the importance of the human touch.

Whatever type of videos you are creating, there are also many other ways that you can improve your video localization process.