Statistics for Dummies (seemingly inconsistent New Word-%)

So, I believe it has been talked about before, but I am still not sure I exactly understand this:

Very commonly (or universally?), the percentage of New Words in a course will be shown as much higher than in any individual lesson. For example, I am currently reading a long ebook in Spanish (105 lessons, imported before the expanded word limit). The individual lessons average somewhere between 15%-25% News Words, with the highest individual lesson being 30%. Yet, the entire course says it contains 50% new words.

What causes this discrepancy? Is it a fluke of statistics that’s not obvious to me? Or is one counting total words vs. unique words? If so, is there a particular reason to change the metric between courses and lessons?

2 Likes

My assumption is they are talking about unique new words and unique words (but I could be wrong).

(please double check my math)
Let’s say I have:

Lesson 1 - 100 unique words. 20 new (unique) words. 20% new words
Lesson 2 - 100 unique words, 20 new (unique) words. 20% new words

Course – let’s say, in the extreme case that the 80 non unique words from each lesson are the same words across both lessons, and the new (unique) words are not the same in each lesson. Then you’d have 120 unique words. 40 new. 33% new words across the course. So it’s possible for the percentage across the course to be higher than the average. Likely there is some overlap of the unique words presumably so this would probably notch a little lower. Not sure if it makes sense for it to be as high as 50% across the course, but maybe??? But definitely you could have a higher percentage across the entire course. (assuming my math is correct).

Again, though, I’m not entirely sure how THEY calculate and whether there is a difference between the lessons vs. course.

3 Likes

Ah! I see! So it is a non-obvious (to me) statistical phenomenon, after all. Yes, true, it is much more likely for the known words to be repeated across lessons than the new words. Now it makes sense to me. Repeated known words count for every individual lesson anew, but for the whole course they count only once.

Let me create a hypothetical example to help myself grasp this. Let’s say bold are “new words”.

  • Lesson 1: “The man gave me a book.”
    3 known words, 3 new words - 50%
  • Lesson 2: “The woman gifted me a pen.”
    3 known words, 3 new words - 50%
  • Lesson 3: “The teacher handed me a pencil.”
    3 known words, 3 new words - 50%
  • Lesson 4: “The boy sent me a message.”
    3 known words, 3 new words - 50%

Overall:

  • 3 known words: The, me, a
  • 12 unknown words: man, gave, book, woman, gifted, pen, teacher, handed, pencil, boy, sent, message
    80% new words

Alright, well, that does clear things up. Thank you!

3 Likes

Hence why I’m on a language learning platform in my spare time, not a math platform :stuck_out_tongue: :joy:

2 Likes

Well…I didn’t know for sure myself until I did the playing around with the numbers like you did for yourself. I am math inclined, but never took statistics so it wasn’t inherently obvious to me either.

2 Likes