Identical words not being treated as known?

kianjones9 · April 19, 2022, 8:16pm

I just uploaded an ebook and I was shocked to see that my known words and lingqs in this text were almost 0. When I started reading it, I noticed many very common words that I have marked as known, showing up in blue. I am not sure why this is. Could it be because the font is different or because the file was a PDF? Maybe because I tagged it with an accent/dialect? Notably I am not seeing any user translations on the words either. Any advice?

kianjones9 · April 19, 2022, 9:06pm

Update I used Optical Character Recognition on one page of the PDF and then uploaded it and it worked properly. I still wonder why it doesn’t work when the upload is PDF.

rhess · April 19, 2022, 10:20pm

It could be that in the raw PDF, there are hidden characters in the text. I’m not sure exactly what causes it, but it happens sometimes, and I’m not familiar with how to fix it (if that is even the cause in the first place)

zoran · April 20, 2022, 3:25am

Sounds like some sort of coding issue with that specific PDF file. Most of the PDF files imported will work just fine.

kianjones9 · April 20, 2022, 6:17pm

Final solution was as follows:

Convert PDF to PNGs (https://pdf2png.com/)
Find effective OCR software (Can be tricky, especially depending on language. I used Arabic OCR (Online & Free) — Convertio but I would recommend uploading the same page and comparing results until you find what you’re looking for)
Only necessary if your software yields a single page at a time: Collate the txt files for the pages into a single document
Import Ebook into LingQ

Edit: I converted the PDF to PNGs first to prevent the OCR software from just scraping the readable text since it lead to encoding issues.