Azure AI Speech to Text vs OpenAI Whisper
Azure AI Speech and OpenAI's Whisper model are services that are capable of speech to text transcription not only in the cloud but at the edge too. We do not always entirely need to rely on the use of API keys that authenticate into a cloud-supported inference service all the time.
Let's quickly have a look at Azure AI Speech to Text and OpenAI's Whisper model running locally for inference. Although Azure AI Speech to Text primarily runs as a pure cloud service, it can also be run as what could be called a tethered docker container (we will see why shortly). The OpenAI Whisper model, containing Speech to Text capabilities, can run fully as a docker container and comes in different model sizes to suit various compute capacities and devices. We will be transcribing this 2min 12s long audio clip as a .wav clip in practice (mp3 shown is for illustration only and blog file limitations):
Download the mp3 here (convert to .wav):
Azure AI Speech to Text Container
It should be noted the official Azure AI Speech docker image currently available as a Linux image designed for x64 machines only (it would be nice if there was ARM support 🙂). I will be using a Windows PC running this Linux container.
Some pre-requisites:
- Azure AI Speech Service (Free Tier is fine)- yes, even with a docker container running locally for this, we will require a billable Speech Service in Azure. According to Microsoft, the docker images are not licensed to operate without being tethered back to Azure for metering/billing. It is still possible to run these docker containers completely offline but this requires a permissioned application process where you fill in a form through with Microsoft directly (these are called 'Disconnected Containers').
- Docker Desktop for Windows
With Docker Desktop installed and running on the machine. Use the following command to pull the image:
It will take some time to get the image:
Run your container instance with the following as an example, supplying your Azure AI Speech Service API Key and your Speech Service endpoint URL:
The container start-up will take a few seconds to complete. After running a container instance and going to your localhost endpoint (localhost on port 5000) you should see this as evidence of the container running successfully:
Speech Recognition using local Azure Speech to Text Container
Then to interact with the container in code, we specify the speech service as coming from the localhost on port 5000 instead, compared to how we would do it using the cloud service API key and region, and we carry out speech recognition into text. The following C# code (for .NET8) will work for transcribing long audio where using the RecogniseOnceAsync() method only recognises up to 20s of speech at the time of writing:
With the container running in the background, this is what our code will produce and then displays the time taken to complete inference (68s for a 2min 12s audio clip):
OpenAI Whisper Speech to Text
The Whisper model comes in different sizes as mentioned before. For a standard Windows machine like mine, I'll be selecting the tiny model to execute the transcription. I'll be containerising a small python script and I'll transcribe the same audio as before and getting the transcription at the end. The python script is as simple as follows:
My dockerfile is as follows:
To build the image, navigate to the local path with the app.py and Dockerfile and execute:
To run the OpenAI Whisper model and start inferencing, you can execute with the following to run a container from the newly built whisper-tiny image:
And you get a result fairly quickly with the OpenAI tiny model after it gets loaded (in 13.4s!!!):
Final Conclusions
The results are in to transcribe a 2min 12s audio clip . The machine used had the following specifications:
-CPU - Intel Core i5 4690k , 4 cores
-RAM - 16GB DDR3
-VRAM(Whisper takes advantage of this) - 2GB Nvidia GTX 960 GPU
Azure AI Speech to Text (allocated 6GB RAM, 4cores) - 68seconds
OpenAI Whisper (using fastest tiny model) - 13seconds!!! 😎
It could be argued that Azure AI Speech was slower due to the lower RAM allocation but I had limitations on my machine as it could barely run anything else at 8GB of memory allocation to the Azure AI Container.