How to remove furigana when importing .epub files

storercd · April 18, 2022, 8:01pm

I had a similar issue with a bit of ruby text in a book where there were multiple furigana in a single word. Throw the above formula into a regex tester and you’ll see it. Here is a clip of text to test it on (the below text uses a different style with rb and rt, but use your imagination here to see how it would adapt):

そういいたいのに、言こと葉ばがのどにつかえたまま

So the ruby bracket opens, two kanji are defined, ruby tag closes. This regex will grab that first kanji and discard the other(s), thus, data loss. I couldn’t solve this with a single regex (because I’m just not that good at regex). But I did limp it along by stripping the tags entirely, then processing the other sections independently. The formulas stack one after the other and look like this:

regex: (.*?)</ruby>
replace with: \1

regex: ((.?)</rb>.?</rt>)
replace with: \2

Worked like a charm!