I tried uploading mp3 files and the text generated is a total of 10 words and the same 3 words repeated over and over for a few pages. The files are SBS Swahili podcasts and are 10 minutes (around 10MB as well).
@bamboozled might know more about other models.
You can try this Swahili only model and see if it works better
You can also see this post for more ideas
Generally, Whisper is state of the art for about 10 (mostly European) languages and pretty decent for another 10. Most other languages are not really useable out of the box. Swahili appears to be one of those less lucky languages. The process in this case is to take OpenAI’s model as basis and just continue training, it’s called fine-tuning. There are many ways to do it, but a popular one is to follow this guide: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
As for training data, a go-to choice is the Mozilla Commonvoice project: Common Voice
Since I’m completely unfamiliar with Swahili, I can’t say anything about other possible sources for training data, but they probably exist if you look for specific datasets.
The linked model is indeed a fine-tuned version of Whisper, but the resulting accuracy is still insufficient at 27% WER (i.e. only 73% correct words), I’d say at around 10% it becomes useable. The model size is also not appropriate, I don’t think a size below ‘large’ will cut it.
As for the fine-tuning process, I believe it should be pretty straightforward. But there is a chance that the results might not be satisfactory, because the OpenAI model doesn’t seem to have any understanding of Swahili.
Obviously fine-tuning will cost some money, you need pretty big GPUs (e.g. NVIDIA A100) and time.