Microsoft brings real-time translation to custom apps with Speech Translation API

Arif Bacchus

Microsoft’s artificial intelligence technologies have long powered the Microsoft Translator API in both the Skype and the Microsoft Translator app. Now, Microsoft is expanding on this and is releasing a new version of a Microsoft Translator API which will add real-time speech translation capabilities to the existing text translation API.

This means that businesses will be able to add speech translation capabilities to their own services and bring a better experience to their customers. Available in eight languages, and spoken in 18 languages, the new version is also the first end-to-end speech translation solution optimized for real life conversations.

This speech translation service can also be used in a numerous amount of situations, both person to person, or human to machine. Examples include personal translation, subtitling, or remote or in person multilingual communications. It even can be used in groups, such as in online gaming chatrooms or real-time presentations such as keynotes.

Nonetheless, the process of translation is very challenging. Microsoft claims that it uses the latest AI technologies and that there is no other fully integrated speech solution available in the market today that would support real-life speech translation scenarios.

A look at how the Microsoft Translator API works
A look at how the Microsoft Translator API works

The full details on how this technology works can be seen below.

  1. Automatic Speech Recognition (ASR) — A deep neural network trained on thousands of hours of audio analyzes incoming speech. This model is trained on human-to-human interactions rather than human-to-machine commands, producing speech recognition that is optimized for normal conversations.
  2. TrueText — A Microsoft Research innovation, TrueText takes the literal text and transforms it to more closely reflect user intent. It achieves this by removing speech disfluencies, such as “um”s and “ah”s, as well as stutters and repetitions. The text is also made more readable and translatable by adding sentence breaks, proper punctuation and capitalization. 
  3. Translation — The text is translated into any of the 50+ languages supported by Microsoft Translator. The eight speech languages have been further optimized for conversations by training on millions of words of conversational data using deep neural networks powered language models.
  4. Text to Speech — If the target language is one of the eighteen speech languages supported, the text is converted into speech output using speech synthesis. This stage is omitted in speech-to-text translation scenarios such as video subtitling.

If you’re curious about the technology, a free 2-hour trial of the API is available by clicking here. You can also learn more about the API, and also read the documentation by visiting Microsoft’s Swagger page.