Azure AI Speech Service Fast Transcription
Azure recently updated its Speech to Text service to bring what is known as Fast Transcription. It aims to improve on Speech to Text transcription turn-around as the name describes. It is currently in preview at the time of this writing through a REST endpoint and still comes with the same features that were there before such as profanity masking and speaker diarization. Although largely similar, it should also not be confused with Azure AI Speech Real-time Transcription (which appears to be Microsoft's implementation and hosted service around OpenAI's RealTime API 🤔ðŸ’).
Some clarifications:
- Azure AI Speech Realtime Transcription - Azure's service for transcribing 'live' audio as it streams over an Input device.
- Azure AI Speech Fast Transcription - Azure's transcription service for static files that have to be uploaded as a payload to the service (which this post discusses).
- OpenAI RealTime API - OpenAI's state of the art API Service for continuous dialog with their multimodal models with interruption capabilities and tone/sentiment understanding.
We will take a look at how it performs compared to the regular Azure AI Speech to Text Service I wrote about just 7 months ago.
Azure AI Speech Fast Transcription Example
The new updated Speech Service with Fast Transcription works no differently compared to how a standard Azure AI Speech Service call would work. Provide an endpoint using the Fast Transcription specific API, provide a subscription key, provide the audio file as a payload plus request headers and deserialize the response back. The key difference is the inference speed and we will see that shortly:
Using this audio sample (which is from the Last Week In AI podcast):
And using the following code
The Output is seen as follows, in an amazing 7.8 seconds of time(an over 8x improvement), compared to 68seconds last time in the context of a Docker container:
The Fast Transcription can be previewed at ai.azure.com
Conclusion
This result of 7.8 seconds for a 2minute clip even beats out the result from OpenAI's standalone Open Source Whisper tiny model as it was in April 2024 which inferred the same audio clip in 13seconds in a previous blog.
It should also be noted that part of the reason inference was taking a long time last time under the conditions of a Azure AI Speech Service Docker Container was likely an artefact of the overheads brought on by the container image itself which was weighing in at 9 to 12GB in size and required 6GB of memory just to operate, However, even giving a whole 25s away to these overheads (leaving us with 48s of actual inference computation) would mean that the standard Azure AI Speech Service was taking at least 20seconds.