Getting around poorly executed word parsing

I’m relatively new to this site, but I’m liking what I’m seeing and looking forward to taking advantage of the resources available here. That being said, I have one major issue that I am having a hard time getting over; the automatic word detection, particularly for Chinese, is really frustrating. It tends to cut words apart in Japanese and connect characters that shouldn’t be connected for Chinese.

I don’t want to make a bunch of cards that are redundant because they are actually two or three words on a single card. I find that the system as it stands actually slows my reading down quite dramatically.

If we could at least manipulate the boundaries of words, so that one could use an external tool such as the the Chinese Text Annotation at Mandarin Spot (link below), I would feel much better about this problem.

Any advice on how to best deal with this? At the moment, I feel like the best move for me would be to export the lessons and read them off the site, which doesn’t leave me with much enthusiasm to stay as a paying member…

Thanks for showing me that link.

I hear that LingQ are working on the parsing of Chinese characters by using the feedback of users that override the default LingQparser. You can do that by just “selecting” the characters you know to be words and ignoring ones you know not to be. What you select becomes blue and you can make LingQs. Of course if you are learning this is a little absurd. Maybe use the mandarin spot parser as a guide to making LingQs by overriding the LingQ parser.

That’s what I was planning on doing, but it is impossible to do so thoroughly because of the automatic word recognition.

Here’s an example that illustrates why this is the case, from " LingQ 101 - Getting Started", lesson “2)选课”.
Look at the last line:
After the comma, there are the two characters “你就”. These are in fact two separate words, but they are recognized as one by the automatic word parser. Unfortunately, even if I only highlight one character, both of them are selected as the “new LingQ” because it believes it is a single word and thus treats it as one.

For now, I’m going to deal with it by not adding LingQ’s at all and just reading the text on the site I linked to above. Luckily the Japanese is not quite as frustrating for me. If anyone has a better solution than this, let me know. I’m looking forward to a day when we have better control over the “barriers” of a new LingQ.

I see what you mean. I do not think the problem is with the parser, it is something to do with the stickiness of the selection process. I guess I would focus on what you can do with LingQ knowing that as long as the texts and audio have integrity, you will always get ahead with little inconsistencies being sorted out naturally.

This came up somewhere before, where I noted that as the number of lessons you’ve opened and your known words total increases the automatic word parser becomes more accurate. This can be a pain in the beginning if you are actually a true beginner. Otherwise, it’s not that bothersome.

The way I dealt with this was by going through many lessons, and as I read through them I would X out the false new words so they don’t appear again. At the same time, it should be fairly early on when you come across the characters 你 and 就 highlighted as individual words, at which point you can LingQ or mark them known. If you are unsure about something, just leave it blue and you will come across it again later as you learn more individual words and can decide what to do with it.

Right now I have nearly 8k known words saved on LingQ and each lessons I open is usually quite accurate, with only truly new words appearing blue. Occasionally an odd pairing of a couple unrelated characters will appear, but by that point I already know them individually, so I just X it out.

So that’s my advice. If you read through many lessons, you’ll be hitting a lot of the X key, and soon enough things will begin to be more accurate.

I’ll do just that. Thanks for the encouraging advice, glad to know it gets less frustrating over time.

@u50623 “That’s IMHO not true. LingQ uses some kind of algorithm with a fixed word list in behind which does not use the saved words by LingQ members.”

I don’t know about that, but in my experience (not opinion) it is true.

The more lessons you go through, the more false new words you will notice and ignore by X-ing them out. At the same time, words you have marked as known will show up less in other character combinations as false new words, as far as I’ve seen.

Also when a good portion of your lessons, or each sentence in the lessons, is white or yellow, it’s clear when blue words are actual words or false words and it is really not a problem.

I have nearly 8k known words saved (and have done plenty X-ing as I went through lessons), and I rarely have anything other than truly new words appear blue anymore and it has been that way for a while.

I can see this being a real problem for true beginners though, who may not know when something is a word or not. For that, I would also suggest installing a popup dictionary extension on your browser, such as “Zhongwen” on Chrome.

Just hover your curser on the first character and valid words will be identified, whether they are single or multiple character words. I use that in addition to the LingQ tools for identifying words, and that really helps with knowing what to X-out and what to keep.

ddw0 mentioned the problem of not being able to reselect individual words or characters that are incorrectly parsed after X-ing them out, because it will automatically reselect the false words. A way around that is to use the Import Vocabulary button on the Vocabulary page, or import a lesson with just those individual words together on a single line. These options may not be available on free accounts though.

Oh, I’ve never payed attention to that. All the lessons I’ve used have been from the LingQ library (lots and lots). So whatever they are by default.

We realize that the word splitting algorithms in Japanese and Chinese are not ideal and will eventually get around to improving this. I personally am not bothered by the small number of words that are not split correctly. I know that others are. We will eventually get to it. Sorry and thanks for your patience.

I’m excited to tell you all I’ve found a great way to speed up the vocab-creating process for Chinese!
Simply paste the lesson into the Chinese Text Annotator at Mandarin spot (Chinese Text Annotation - MandarinSpot), and check the “For printing” box. The page it generates for you will have a complete list of all the vocabulary terms on the page, which you can then copy and paste directly into the “Import Vocabulary” tool here on LingQ. This makes it much faster and more accurate, which is very useful for beginners like me.

EDIT: Sorry, shouldn’t have posted that without fully testing it out. The result is less than desirable; the information is all put into the “term” box, which really doesn’t help much. CSV file created in Excel never seems to import.

I’m not sure how frequent the word splitting in Chinese incorrectly doesn’t split characters that should be split but in Japanese I don’t find it to be that frequent. I would suggest putting the definition for both words in the hint box for the time being. As for words that get broken up into smaller parts, that shouldn’t be that big of an issue considering that words can be joined into a phrase.

@ ddw0

You also would lose the context phrase.

I would check out if you need a spoon feeding of Chinese. (I am enjoying it) I do not work for this site BTW :slight_smile: I actually think it is the answer to a lot of the frustrations that LingQers have with starting a language from scratch. memrise have made flash carding fun. It would be great if LingQ had a component like this or at least a links to super-beginner content in memrise.