Using ChatGPT to write Mini-Stories

bamboozled · April 2, 2023, 2:18pm

ChatGPT is really good with language, but I don’t trust it with factual information
What I wanted to express was, that these neural networks are a black box, at least to me. But I believe two things could help to understand what is going on:

a deep understanding of the underlying statistical / mathematical theory that form the basis of machine learning
exact information about the training data
I can’t boast with no. 1 unfortunately, but I have been able to find a table showing the exact numbers of words / characters that formed the training data for GPT3, sorted by language: gpt-3/dataset_statistics/languages_by_word_count.csv at master · openai/gpt-3 · GitHub
I found this by looking through the original paper introducing GPT3: https://arxiv.org/pdf/2005.14165.pdf (p.14).

I think this table can be a good guide for estimating the performance in various languages. It also shows clearly why GPT3 is an English native speaker - 93% of all words in the training data were English. This can also explain why its performance in languages like Chinese is mediocre, only a minuscule amount of the training data was in Chinese. Since Germanic languages seem to be well represented, transfer or abstraction of underlying principles seem to be reasonable explanations for its performance in the absence of training data (couldn’t find Low Saxon in the list).

As for newer models like GPT4, I suspect the ratios are similar to GPT3, after all the Internet is predominantly English and they most likely have a hard time finding good training data in other languages. That’s just my personal speculation.

Plafont · April 2, 2023, 2:28pm

Agreed.