Umlauts in CSV Import

spatterson · November 18, 2012, 4:42am

I’m attempting to import some commonly used phrases.

I tried to import a file with:

Ich bin auf Geschäftsreise,I’m on business,empty

but it’s importing as:

Ich bin auf GeschaÌˆftsreise

I’ve put the file up on my github site:

I created the file in Linux and the file appears to be in the proper format:
[shaun@home phrases] file test.csv
test.csv: UTF-8 Unicode text

[shaun@home phrases] cat test.csv
Ich bin auf Geschäftsreise,I’m on business,empty

Does the import function support utf-8?

spatterson · November 18, 2012, 5:21am

Well… the plot thickens.

This file works

This file doesnt

I presumed this would work

the file “this_works.csv” is UTF-8 with BOM encoded. this_should.csv is also UTF-8 with BOM. this_doesnt.csv is only UTF-8 encoded.

this_should.csv imports “testä” as “testa”.

[shaun@home phrases] cat this_works.csv
testä,blah,blah,[shaun@home phrases] file this_works.csv
this_works.csv: UTF-8 Unicode (with BOM) text, with no line terminators
[shaun@home phrases] cat this_doesnt.csv
testä,blah,blah,
[shaun@home phrases] file this_doesnt.csv
this_doesnt.csv: UTF-8 Unicode text, with CRLF line terminators
[shaun@home phrases] cp this_doesnt.csv this_should.csv
[shaun@home phrases] vi this_should.csv
[shaun@home phrases] file this_should.csv
this_should.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators
[shaun@home phrases] cat this_should.csv
testä,blah,blah,

nobody · November 18, 2012, 10:15am

The import format is:
“term”,“hint”,“sentence”
“term”,“hint”,“sentence”
…
The characters ", <, >, & should be avoided in all 3 fields.
Hint and sentence should be less than 250 characters.
The file should encoded in UTF-8 (important!), no BOM

If I use this file, UTF-8 encoded:

“中國”,“zhōng guó = China”,“你去過中國幾次？ = Nǐ qù guò Zhōngguó jǐ cì? = How many times have you been in China?”
“Ich bin auf Geschäftsreise”,“I’m on business”,“empty”
¯¯¯¯¯¯¯¯¯¯
two new terms are successfully imported.

If I analyze your files, the “ä” is (wrongly) UTF-8 encoded: “61 CC 88”, using a Combining diacritical mark, a (UTF-8: “61” = U+0061) + ¨ (UTF-8: “CC 88” = U+0308) → ä, see http://www.wordiq.com/definition/Combining_diacritical_mark

LingQ does not allow “Combining diacritical marks”!!

Correct would be “ä” (UTF-8: “C3 A4”, U+00E4) which is the correct UTF-8 encoding. I use this in my upload example above, and it works.

See also this article: Character codes and encoding - Red Leopard

spatterson · November 18, 2012, 12:53pm

‘LingQ does not allow “Combining diacritical marks”!!’

Okay that makes sense. As you discovered I was using U+0308. I knew I was doing something wrong last night when I was able to import a file I created in Windows – I bet I did ALT+0228 without even thinking to create the umlaut.

Sorry for wasting your time

nobody · November 18, 2012, 6:10pm

No problem. I was just wondering why it did not work in your case.