ChatGPT voice... tried it

I watched some videos about using ChatGPT to have spoken conversations in other languages. And I thought I would try it for myself.

To do this, you need to install the ChatGPT app on iOS or Android, and have a paid account. And you can give some grounding to ChatGPT to smooth out the experience.

It works. It’s just not very good. The reason it was a bad experience for me is that the word error rate is still relatively high for speech recognition. You’ll have to trust me that my non-native Spanish pronunciation is decent, I guess. ChatGPT would get about 3 in 4 of my sentences correctly recognized for all words.

From technical point of view, ChatGPT is not the weak link here. It’s the speech recognition (Whisper, I believe) that isn’t ready for this task. ChatGPT will do a great job of simulating a conversation in Spanish (and with varying quality in other languages) as long as you are typing/reading.

The next problem of voice ChatGPT is the lengthy delay of uploading audio to the server, having it processed, and then the LLM part of ChatGPT responding back. While technically amazing that it can even work at all, it’s still at least a few seconds and sometime five or six to hear a response to what you say.

The last problem is that ChatGPT will interrupt you unless can speak very quickly with no pause. If you are learning to speak in a language, there are lots of pauses. It is possible to use a button to indicate when you are still speaking, but this is cumbersome.

I feel compelled to write this because I watched two videos that presented the experience as wonderful. And I guess they were just lying. I mean, one Youtuber edited out the delays to make the ChatGPT conversation seem natural.

I leave open the possibility that I’m doing it wrong. And also, I’m curious if anyone else has tried it and had a similar or different experience.


Interesting, thanks for sharing! It does make sense that the clunkiness you experienced was disappointing and unnatural. But in my view it is (as you say) pretty impressive that it even worked that well. Considering the pace of progression with ChatGPT, I’m sure the recognition responsiveness will improve dramatically before too long; it won’t surprise me if someone finds/creates a better solution to the interruption issue as well.

I’ve been having pretty good results with a somewhat related but admittedly much simpler task: using generative AI including text-to-speech to generate simple content and audio clips to recreate something similar to the Mini-Stories experience for my studies of Amharic, which is not yet supported on LingQ. It’s of course nowhere near as good as a native speaker, but for my purposes it is working pretty well for these initial stages of getting used to hearing a completely unfamiliar language. Anyway, thanks again for sharing your experience.


Interesting experience, Erik. Thanks for your post!

That said, my experience with the AI is a bit different:
I’ve used ChatGPT v3.5 on Memrise with various languages (English, French, Spanish, Dutch, and Portuguese) and I’m amazed at how good the conversational quality already is.

Yes, the AI has issues with sloppy and fast pronunciation, some words / collocations or code switching, but overall my experience has been pretty smooth.

The difference could be that

  1. The topic of the conversation is predetermined,
  2. Memrise is probably using a customized genAI version.

PS -
Perhaps it’s better to use customized genAI versions for talking in our L2s such as:


I’ve tried it as well on chatgpt4 and it works well enough for me in French. It recognizes everything I say, but I’m quite advanced.

My main problem with it is that it speaks for far too long. The point would be for me to speak and give my opinion, but usually I just say a couple of sentences and then Chatgpt responds by giving me its 4 paragraph long view on the subject.

At that point it becomes mostly a listening exercise and not a speaking one.


You could tell the AI to give you only short answers :slight_smile:


I’ve used ChatGPT Voice for a few sessions to practice my German and my experience is that it understood every single word I said. If you’re having problems with ChatGPT understanding you, and you’re sure it’s not your accent, maybe you should see if it’s a problem with the microphone you’re using.

My issue with ChatGPT Voice is that the interface is too clunky. Sometimes it will respond with a proper German accent, but other times it defaults to an American voice speaking German. It’s very off-putting. Also, setting up good prompts is very time-consuming.

As for delays after speaking, I had no real issue with that. The delays I experienced weren’t long. Maybe it has something to do with processing power and upload/download speeds.

I think AI like this will become very useful for language learners, but it’s just not there yet. I think we need someone to develop some kind of interface so that it works flawlessly for language learners. I suspect we’ll see such apps appearing in the near future.


The customized GPT versions are already there.
And they probably will get much better once ChatGPT v5 becomes available in 2024.

1 Like

I actually by chance saw an icon when I opened ChatGPT yesterday night, I clicked it and with my surprise ChatGPT was talking. I don’t have a paid account yet!

I had a brilliant conversation about Dune for 30’, and I continued the conversation this morning about the same subject. It was very cool because I was doing something else and didn’t want to write, or read, and having the conversational option was just great!

The experience has been awesome so far. Just a couple of times ChatGPT answered me in 2 different languages, and I had to ask to go back to English.

I like that it stays engaged with the previous conversations, so I can keep asking things on the subject hours later, or the next day, for example.

I also like that ChatGPT stays engaged after answering, so I can do something else before talking, or better thinking before asking my next question, and don’t have to worry about having to click anything again.

Very excited! I admit! :rofl:

I believe I will soon go with a paid account. I’m deciding between ChatGPT and Poe, but I have seen OpenAI’s app has been updated a lot, there are lots of other services included now, and I’m interested in image generation as well. I want to pay a service that gives me more things (image generation, writing improvement, conversational…).

PS: I’m also trying to understand if it works well with an homepod mini that I don’t have yet. If anyone of you have an iPhone and an homepod mini, and would like to do some test, that would be great. It would be the only reason to buy one now.


You can train a specific chat just with the prompt you want to. As @PeterBormann said, you can ask for shorter answers on that specific chat. You can create your own prompt stating that you want to practice speaking, and to keep its answers short and engage you to speak.
Then you can use that chat only when you need to, and sometimes you need to remind the AI your initial prompt.


all good tips, hadn’t thought of it. thx


Y’all convinced me to give it another try this weekend. Checklist:

- Try it from another device/microphone.
- Set the prompt grounding to limit to a sentence in replies.
- Get better at using the UI to signal when I’m talking.
- Remove the instructions I had in the prompt grounding to correct me when I’ve made a mistake.

We’ll see how it goes.


Okay, I came back and tried it again. It’s a decent experience after improving the prompt grounding and learning to hold down the button walkie-talkie-style. I found that ChatGPT wanted to switch to other languages if I happened to speak a tiny bit off or if its speech recognition got confused, and the experience was improved by grounding it to always, always, always stay in Spanish.

Here is a GPT I made that just asks you to say three things you like and three things you don’t like in Spanish.

The grounding I used is:

I am a Spanish student. This is a role-playing exercise, where you can help me learn. The goal will be for me to tell you three things that I like and three things that I do not like. You can ask me questions to point me towards this goal. Once I have completed the goal, you should congratulate me on completing the exercise.

All of our conversation will be in Spanish. Never speak in English, even if it seems I am speaking in a different language. My intent is to always speak in Spanish. Never say more than one sentence, and keep sentences short.


I agree. I had the same feeling as well. What I usually do, it is just say something like: “keep staying in the English language”.

With a prompt sometimes you might want to say something like: “ remember to follow your initial instructions or remember your prompt or something like that”


Hey guys, if you keep pressing the white circle in the center of the screen, the icon will start spinning, and the AI won’t interrupt you while you’re pressing it.


Game changing tip!! Thank you so much!!

Being interrupted by ChatGPT was the main reason I stopped using it for speaking practice, but this should solve that issue. My other pet peeve was the length of the responses, but as other users pointed out, you can adjust this by requesting shorter responses. I actually find that this doesn’t always work, though - sometimes, it eventually defaults back to outputting paragraphs, or flat out ignores my request if my message was too complex (ChatGPT is apparently not very good at multitasking).

Some users have pointed out that it will switch languages if your pronunciation is off - you can go to the settings and double-check that the “Main Language” is set to your target language and not to Auto-Detect.

Even with my main language set to Spanish (my native language) or German (target language), though, I noticed that it has a slight American accent. Its pronunciation is still overall good imo, but I think that its inflection is lacking in non-English languages. The Siri/Cortana/AI-esque sentence inflection sounds overall unnatural to me, but I don’t think it’s a total deal breaker tbh. I guess it depends on your speaking/conversational goals


Awesome tip, thanks. I hadn’t check in the settings! :+1:

1 Like