French tokenizer issue: apostrophes split words and break existing LingQs

I’m experiencing what looks like a parser regression for French apostrophes, and it’s affecting existing content as well.

Previously, words like “c’est” were recognized and selectable as a single unit. Recently, in the same texts, LingQ now consistently splits them into “c” + “est” when creating a LingQ.

More importantly, this change also affects existing vocabulary:

  • I already have LingQs like “C’est” saved

  • These LingQs no longer match or get recognized in texts

  • They can’t be recreated in the same form anymore

Additionally, there are selection issues with some elided forms:

  • For example “l’autoroute” is sometimes not selectable at all

  • The text itself is correct and unchanged

Important details:

  • This happens across multiple devices

  • The apostrophes in the text are correct

  • The same texts were parsed differently before

  • This strongly suggests a server-side/tokenizer change, not a local or formatting issue

I understand that function words may be split intentionally, but the current behavior:

  • breaks matching with existing LingQs

  • creates unselectable words in some cases

  • causes inconsistency within the same language

2 Likes

Thanks for reporting, we are looking into this.