When to Use Speech to Text Translation for Videos

You need to translate some spoken videos…

Should you use speech to text technology to prepare them for translation?

There are some important factors to remember when you are translating speech into another language. Speech to text translation, with the aid of artificial intelligence (AI), can be a good option.

But, you need to use it correctly…

Why some spoken videos are tricky to translate

Video projects are relatively complex when it comes to translation and localization.

Although the translation of the words is usually straightforward, the engineering work that is associated with video localization can be significant. Extra time is needed for tasks like formatting subtitle files, stripping text from on screen graphics, and precisely matching the timings of the video file.

One factor that affects the complexity of the translation is when the video contains someone speaking in a conversational style.

For example, a video of a speaker giving a conference presentation is likely to be unscripted or semi-scripted. The person will speak with more filler words (for example “um”, “er”) and their language will be less structured than, say, a highly scripted marketing video.

To reduce the work needed to turn such speech into translatable text, people often use speech to text technology.

What is speech to text translation?

Speech to text translation simply involves turning spoken words into text and then translating those words into another language.

The transcription can be carried out either…

by a human transcription service.
by an AI-powered speech to text program.
by a mixture of both.

Speech to text AI has grown in popularity over recent years and is now a fundamental feature in our lives. If you have dictated a message to your cellphone or searched for something with a smart speaker, you have interacted with a speech to text algorithm… and probably experienced the shortcomings of the technology.

It’s important to remember that auto-generated transcriptions are not perfect. At best, they produce around a 70%-85% accuracy, with results varying hugely. As a result, the output of a speech to text program should always be checked by a human.

This reduced accuracy means you need to make specific allowances when it comes to video localization.

Why use speech to text translation?

Speech to text technology can be a valuable method to help streamline the translation process.

AI algorithms provide a tool to quickly turn spoken words into (almost) translatable text. For example, an hour-long video of a conference presentation could take just a few minutes to process with an automated service, compared to hours for human transcription.

With 1-hour of conversational language possibly coming in at around 8,000 words, giving the task to automation can mean a significant time-saving.

Even with the additional step of manual checking and correction, there is certainly a benefit to using speech to text automation.

Due to the complexity of video projects, any time saving that we can make is valuable. Translation is usually the last stage in a long process and the time for translation often gets cut down as other activities run longer than originally expected.

How not to use speech to text translation

Although speech to text can be a useful tool, there are several pitfalls that you can run into if you are not careful.

Here are some common mistakes people make with speech to text translation:

Not sending the source video file to translators — Even when you have transcribed the video yourself and had it checked by a human, it’s still important to send the original video file to the translation provider. Information can be lost in the transcription process (especially when using automation) so the translator needs the original file to get the best result.
Ignoring the human step — Many people think that speech to text technology is good enough and ignore the need for a human to check the transcription. This is a mistake. It may be better to give the video file directly to the translation provider and have them do all the work than to give them a low-quality machine transcription. An added benefit of the human touch is that the translation itself becomes cheaper as editing makes the text short, sharp, and concise. For example, the speech “And so, well, ummm, next up, next we’re going to take a look at…” becomes simply “Next…” This is much better for subtitles.
Not being clear about jargon — AI tools often struggle with jargon and slang. This can wreak havoc on a translation project if you’re not careful. This is especially relevant with unscripted or semi-scripted speech, where slang and jargon are more common.
Forgetting the environment — Speech to text technology works best when there is a clear audio sound with clear accents, no background noises, and no overtalk. It can struggle when people are speaking in a non-studio environment.

These issues might seem small, but they can have significant impacts on the translation.

For example, the other day, one of our clients was translating a video where the speaker often used the phrase “kit” to mean “a set of products.” The AI detected this as “kid” (i.e. child). If we hadn’t been on top of this jargon, it could have been translated into niño, enfant, or дитя instead of productos, les produits, or товары which would have produced a lot of issues in the review process.

How to use speech to text translation effectively

When you are aware of the challenges of speech to text translation, there are some simple steps you can take to ensure that your video localization project is carried out effectively.

Steps you can take include:

Talk to your translation provider early to discuss your needs with the translated video content.
Understand what you want the final outcome to look like upfront and communicate this to your provider.
Provide the correct files (for example original video, scripts if available) in the right format.
Be clear on the extra time that will be needed for specific tasks.
Decide with your provider if they will transcribe the speech for you or you will take care of transcription yourself (and ask them for advice on which is best for your unique project).
If possible, ensure that you are sending the final version of the video to them as subtitle timings shouldn’t change.

Of course, factors affect a video localization project, not just the speech to text step of the process.