For anyone learning Japanese or Chinese or Korean - these small characters that help us, normally, to read a text with hieroglyphs, are a huge nuisance when importing to LingQ!
When I was importing texts and audio from Librivox (recommended!), I noticed that the texts (taken from Aozora works collection) are full of furigana, so when I imported it was impossible to use them in LingQ… Even though the audio was excellent and the text was all there, in front of my eyes - it was unusable… For a second I even thought I could erase them manually. But no, don’t try, it’s too much.
So I went googling, I found out that these characters are called “rubi characters” and they have special marking in HTML. When I searched how to remove them, I was quite discouraged by heavy programmers’ stuff, advice upon how go get rid of these charterers for Python programmers (well, it was complicated for me - but, but, maybe it’s easy for you - here - Digital Japanese Literature: Aozora Bunko – Arts and Humanities Research Computing)
However, I had to look for another solution, for simple mortals. So I thought that if I know how these characters are marked, I can get rid of them in a simpler fashion… well, using just MsWord!
And that’s how you do it:
a) go to the Aozora text (or any other online text you need that has furigana)
b) while in browser, press Ctrl+U (for Chrome, for other browsers - see here How to View the HTML Source Code of a Web Page) and…
c) you’ll the text with all the tags! (or whatever it is called - see the print screens attached - #1 and #2)
d) copy this text
e) paste it to MsWord
f) in Word go to “find and replace” (see print screen #3)
e) fill the “Find what” field with the stuff you want to get rid of* and leave the “replace with” empty; then hit the “replace all” button - and see the magic happen!
(*delete things one by one! - first get rid of “rb”, then “rt”, then “rp”, then “ruby”, then “<”, then “>” etc… IMPORTANT! Don’t touch the round brackets!)
f) now, the real magic! You didn’t touch the round brackets, right?
- find and replace all left round brackets with “a”
- find and replace all the right round brackets with “b”
- in the “find and replace” window tick the “use wildcards” box
- in the “find what” field write “a*b”, in the “replace with” - nothing, as before
- press the “replace all”…
Now - copy and paste the clean text to LingQ
Don’t delete the round brackets!
Substitute the left round bracket with “a” and the right one with “b”!