Percentage of new words

Purely out of curiosity, I have been trying to figure out how the percentage of unknown new words is calculated in a lesson but am having difficulty doing so.

For example, in my current lesson…

  • number of words: 2851
  • known words: 591 (obviously words are repeated)
  • unknown new words: 111 (10.9%)

??? What am I not seeing?

I just did a test, because I have wondered this myself. I imported the text given at the bottom of this post into LingQ and also used

http://textalyser.net/
http://www.online-utility.org/text/analyzer.jsp

to analyse the text. Accoring to the online text analyser, there are 127 words, and 99 different word forms. According to LingQ, there are 125 words (I don’t know why the two word difference) and 18 new words which makes up 18.6% of the document. The number 18.6% is the percentage of word forms that are new.

In your example, the test must have 2851 words, but only approximately 1020 word forms (based on the fact that 111 words is 10.9% of the document). This means that there must be a huge number of words appearing many times. I did not know how reasonable this is, so I ran a few more large documents through the online text analyser and got similar results.


Here is the text:

"Seit Tagen, seit Jahren wird über die Agenda 2010 diskutiert. Eine Volkspartei ist darüber fast eingegangen, eine ganz neue Partei hat sich deswegen gebildet. Befürworter sagen: Viele, die damals von Sozialhilfe lebten, arbeiten jetzt immerhin in Niedriglohnjobs. Die Gegner erwidern: Aber nun gibt es eine Kaste von Geringverdienern. Gerade bei den Urhebern der Reform, den Sozialdemokraten, ist keine Versöhnung zu sehen. Mindestlohn hin oder her – man ist für Hartz IV oder dagegen. Wer soll diese Ansichten versöhnen?

Vielleicht ist das zehnjährige Jubiläum der Verkündung der Agenda einfach der falsche Anlass, um zu diskutieren, was sich verändert hat. Schließlich ist die Frage nicht nur eine volkswirtschaftliche. Schaut man sich die noch immer erbittert verfeindeten Lager an, muss man sich auch fragen, in welcher Stimmung diese Gesetze entstanden."

@mikebooks - As ColinPhilipJohnstone touched on, the word count for the text will include the total number of words, not unique words. The Known, New and LingQed counts, however, only count unique words, so two instances of the same new word only counts as 1 new word in the text. Let me know if you still have questions about this!

Remember there are also “non-words”.

For some reason, when I start a new lesson, there are usually words that I swear I have never met before that are not in blue, ie they are considered either “known” or “non”. You need to go through and, if they look interesting, mark them as unknown so you can lingQ them. If you choose to ignore them, I don’t know how they affect the percentage new words counts.

@ skyblueteapot “For some reason, when I start a new lesson, there are usually words that I swear I have never met before that are not in blue, ie they are considered either ‘known’ or ‘non’.”

The same for me. Not often, but it does happen.

I suspect you have met them before and just forgotten about them, or just not noticed them.