Issue with Characters in Exported Excel file

Grism · April 16, 2020, 8:37am

Howdy - I am having issue with the exported CSV file where it seems characters with umlaut are not represented properly. For example, the German word “Erklärung” comes in CSV as “erklÃ¤rung”. Anyone else have this issue or know what to do? Regards, Graham

zoran · April 17, 2020, 4:50am

After you download a CSV file from LingQ, close it and then create new regular EXCEL file. Open that new blank excel file and then:
Click on the DATA tab on the top and choose FROM TEXT option. Popup will appear and you should find your CSV file downloaded from LingQ, select it and click on IMPORT. On the next popup choose DELIMITED, click on NEXT and check TAB and COMMA options. Click on NEXT and then FINISH.
That should be it. You can then save the file again in CSV format if you want to import it somewhere. Hope this helps!

Grism · April 17, 2020, 6:51am

Thanks although that did not work, but that could be because I have Office 365 and I might need to do some additional fiddle to make it work. I am a bit skeptical though - seems like language format issue and reminds me of wrangling with Excel’s date formats. Gra

herbm · April 17, 2020, 4:47pm

Ok, you are in my domain now.

You may already know this is likely due to a character set misidentification.

TLDR:
Just knowing the general ideas below about encodings and characters sets can give you enough info to import the file into your application correctly or to find and use an editor like NotePad++ or another utility that can convert it.

The problem is that “text” files don’t all have a definitive indicator what “kind” of text they contain, even though most modener computer systems and programs like Excel can handle Unicode (extended) characters just fine – IF they know what the file contains.

THE most likely actual problem today is that the file is that the file is a “UTF-8 without BOM”.

BOM == byte order mark I believe but it also signals this is NOT a simple text file and programs like Excel may need that signal to figure it out. (Notepad++, a free download, can easily deal with this in almost all cases and even allows you to change the encoding and save the file back to disk to make it easier for other programs to process.)

UTF is just a set of encodings for Unicode – and there are a pile of them UTF-8-bom, UTF-16 etc.
8 means that the data is encoded in 8 bits (or a byte) just like an older ASCII (American) standard or other language file was encoding historically *.
ASCII files are legal UTF-8 without a BOM but not necessarily the other way around.
Without the BOM, a program may not realize that non-ASCII characters * are in the file

You probably need to either:

Convert the file to a UTF form with a BOM (e.g., UTF-8) – see NotePad++ above, though you could do this from a PowerShell (or Linux) command prompt without having to learn how “to program”.
Or you need to tell Excel when opening the file what to expect, not sure it has this setting but Google is our friend.

PowerShell like this should do it:

Get-Content -path “OldFile.txt” -Encoding UTF8 | Out-File -Encoding UTF8 -FilePath “NewFile.txt”

AND it cannot do any harm since the -NoOverwrite prevents overwriting an existing file.

Also not this may look like we are converting from something to the same thing but turns out that reading UTF8 works whether there is a BOM or not, and even if the file is ASCII while writing UTF8 with the Out-File does add the BOM.

For most complicated choices you’ll need something like NotePad++ or a purpose build file converter unless your app (e.g. Excel) allows you to specify the file type when you open it.

It is also possible that the file is encoded with an ASCII-like encoding that is using an older encoding designed for non-English languages (Western Europe, Japan, ISO-8859-1, or 8859-X etc.). The files probably will need a specific file converter or something like NotePad++ which has this in the Encoding menu as “Character Sets”.

This latter is less likely but since the file is for German (or might have originated on any non-English operating system) it isn’t completely unlikely.

Note: Before I understood the above, I could frequently just putz around and get the characters to display and even today, I don’t always have to think about all the details.

BOTTOM LINE:
Just knowing the general ideas above can give you enough info to import the file into your app correctly or to find and use an editor or utility that can convert it.

Grism · April 18, 2020, 7:48am

Thanks I may in future give some of those a try to develop a more automated method, however for the time being I have a quick work around using Excel’s find and replace function. I only need to do this 4 times in a document to get the 3 umlauts (Ä,Ö,Ü) and one ligature (ß). For example, ß is exported as ÃŸ, and finding and replacing corrects it. This does not address Upper or Lower Case for the start of a word, but that is not a big thing in LingQ anyway, especially if you already know, for example, what is a noun in German. Gra

herbm · April 18, 2020, 8:10am

The expedient way is fine if that’s all you have.

Typically with French (or languages with even more non-English characters) there is no practical way for me to do it as a searching and replace even with full progamming and regular expression tools

I would end up having to write a full blown character converter – and other people have already done those much better than my time would allow.

Though I have been meaning to write a simple PowerShell program filter that would look at files and tell me if any non-ASCII characters were present.