Creating LingQ lessons from subtitle files

Does anyone know how to do this, in particular for Japanese?

My husband has explained to me that subtitles for films, at least “soft” ones, are held in text files which ought to be able to be viewed in a text editor. I’ve browsed through my video collection and found a few films with .idx and .sub files attached. But I can’t open the sub files. My husband suspects that for Japanese they are stored in some image format, since there doesn’t seem to be a universally-recognised international standard for Japanese text. Is he right? Is there some clever piece of freeware that could convert Japanese subtitles in .sub files into text files I can copy and paste from?

What about .srt filesSo far I have found these subtitles in text format, thanks to a jolly helpful Russian guy: Learn Japanese by watching movies with Japanese subtitles

Me too, I am curious to know how to get text files from DVDs of Chinese films.

Actually, I type characters manually reading subtitles in order to make text files.

I think the answer is some subtitle ripper / editor. I’ve been studying a very technical language-learning forum (forum.koohii.com) and I expect the answer is in there, I just don’t understand it :-o

Maybe Vobsub?

SRT-files are technically just text files with time marks at the beginning of each line.
So, there can be no difficulty with opening them with the Notepad or whatever other text editor and importing subtitles into LingQ.
Googling “vobsub2srt” would probably allow us to find a handful of tools for converting DVDs’ subs into SRT, if DVDs’ subtitles are binary, but you can just try opening SUB or IDX files on your DVD (don’t have one now) with the Notepad to find out if they are binary or simple text files as well.
Some subtitle-formates (USF, SSF, etc) are based on XML-structures, these are kind of readable without being converted as well, but include some garbage from a human’s point of view, so it could be easier to convert them into the SRT format first also.

I have tried opening SUB files containing Russian and Japanese subtitles, they come out as garbage.

@skyblueteapot It is what happens with non-Unicode non-system character-set files in Notepad. Just open this SRT-file in the browser (you may drag&drop it from the folder into the address bar of your browser and press Enter) and then choose (in the View or Instruments menu in different browsers) the character-set manually (for Russian it is mostly Windows-1251).

@skyblueteapot

A while back I tried doing exactly what you are trying to do now. From my experience Japanese subs are normally stored as images. If you want to get them into a text format you need a program that will view the images and try to “guess” what characters they are supposed to be. There was a program I used that I think was freeware (that or it was a free trial) that did just this.

The problem is that it doesn’t work all that well. The program would give me the opportunity to correct the mistakes as it would show the subtitle image and show what it thinks the image is. You can change it at that point to match the image. Unfortunately this can be rather tedious for an entire movie, especially when you are still learning the language like I am.

Anyway, I think I still have that program installed on my desktop computer. I’m not at home right now but I can tell you what it is when I get home later this evening. It might be that previously mentioned “Vobsub” but I’m not sure.

I dont study japanese but I did a quick search on google and found a bunch of websites you can download subtile SRT files for various languages. Here’s one of many for japanese that might be helpful.

http://bit.ly/oU9zkF

I’m assuming these sub files are meant for illegally downloaded movies… Nonetheless , as far as i know, theres nothing illegal about downloading SRT subfiles.
Just click the movie you want subs from and click download subtiles in the next page. Save the Zip file and Extract the file.
The only way I could open the SRT file and view the content was with FireFox. Open the file with Firefox and make sure you change the encoding ( in firefox its under view > character encoding ) IN my case i changed the encoding to "Japanese ( Shift_JIS "…
Shows up perfectly and its READY to cut and paste.

HEres a screenshot to show what it looks like in my firefox…

http://ScrnSht.com/kzkhcw

You know, Last year, I tried to do lessons with Korean Subtitle files exactly like this. Cutting and pasting. The problem is that it was taking WAY TOOO much time to format the text into something usable. Theres just tooo much to delete and fix so I Gave up. Not to mention either having to cut the audio so that there isnt all these spaces between the dialogue or having to rewatch the movies to listen to the audio. All of which takes alot of time. Perhaps there are some software out there that could reformat your text by automatically deleting the timecodes and any other unwanted text.

I might try to study sub files only later on again…

Hope this helps.

@keroro - Thanks for the links!

@skyblueteapot - The subtitle program I used is called “Subtitle Edit” and it is a free program. You can find it here: Nikse.dk

You might need VobSub along with Subtitle Edit to actually be able to do the whole process. I can’t remember to be honest and it was a bit of a hassle so I don’t really want to try again right now to find out, sorry. :frowning:

I might try again down the road when my Japanese typing skills improve. Right now there are just too many Kanji that I don’t know how to type off the top of my head and it is very tedious to look up all the Kanji I don’t know just to correct the mistakes Subtitle Edit makes when reading the images. If you do decide to try it out, good luck! And if you find a better method, please share, I’d love to know!

@kekoro: You speak words of wisdom. I probably only need a way of reading SRT files and a couple of good SRT subtitle databases. I downloaded some subtitles and viewed as you recommended, it works and is easy.

I think I now have the functionality to rip “hard” subtitles from a DVD, convert them into text and edit them. Whether I will ever need this is another matter entirely.

Thank you all for your ideas!

The next question (which I’ll put in a separate thread) will be:

Which are your favourite SRT subtitle websites?

Hello Sky blue:

Could you please tell us if this functionality you now have to rip “hard” subtitles is from Eugrus’s advice or from the subtitle edit program or vobsub editor? It seems very confusing and I would like to rip subtitles too since I have a lot of Japanese and other foreign movies (btw, which I bought and paid for), some of which have Japanese or other foreign language subtitles but I have no clue how to get the subtitles into text. So far, I have been unable to find the subs for my movies from the subtitle sharing websites either.

I might be able to answer John’s question, at least partially. From what I remember “vobsub” was an extra program that you needed to enable Subtitle Edit to read the subtitle images. Without it the Subtitle Edit program can only handle “soft” subtitles. Its either that or you need vobsub to extract the hard subs from the DVD and then use Subtitle Edit to read them. I cant remember how exactly they worked together but I think you needed both. Both programs are freeware programs fortunately so you should be able to at least try it out.

As for Eugrus’s advice, it really only applies to soft subtitles as that kind of subtitle is stored in a text file. Hope that helps.

Btw, after you get to the point where Subtitle Edit can read the subtitles from your DVD you can have it save them as a text file.

It seems that SubRip will do the OCR-ing (converting idx/sub image-based subtitles into text-based ones that you can edit). You need to either train the OCR program (very time-consuming with Japanese because of all the kanji) or find a character matrix file that someone else has already created that will handle the character recognition.

I have downloaded SubRip and found a character matrix file, but haven’t played with them yet.

dl the srt file, then work with firefox , notepad and KMplayer…
sometime, u need to install some fonts…
easy job and simple…

Helen, awhile back I found extensive information on this topic and how to subtitle movies for oneself, although the articles were not specifically aimed at Japanese files, in Russian on the Internet. I no longer have those links, but I’ll look around over the next few days and report back if I find something good. Sorry, but I am very short of time this a.m. and cannot look farther now.

A google search for .srt файлы provides some interesting links, for instance http://open-file.ru/types/srt , which contains a description of the .srt file format (from a programming point of view, and notice the little section, “Как, чем открыть файл .srt?” and the link to видео-файлы), and a wikipedia article on SubRip. The rest of the first page of search resutls has many other articles that look promising. The Russian Internet is filled w/ informative discussion of technical programming subjects, and would always be worth your while to search, since you read Russian . . . but you probably know that already.

I found that opening SRTs in Word works better than opening them in Notepad or Open Office, especially if you want to copy and paste into LingQ. It’s something to do with how the ends of lines are coded.

Unfortunately, I have no idea how to run a Java program, so I’ve had to mess about a bit with Word and Excel.

This is my method for stripping out the time data from .srt files:

In Word, do a search and replace for a double paragraph marker (^p^p). Replace them all with a hamster smiley 8:@.

Now do a search and replace for a single paragraph marker (^p). Replace them all with a tab character.

Now do a search and replace for hamster smileys (:-@) and replace them with paragraph markers (^p).

Save as a text file. Be sure to select the correct encoding, eg Windows 1251 (Cyrillic) for Russian subtitles.

Open text file in Excel. Excel will look at the structure and correctly realise you have a tab delimited data file. Be sure to select the correct encoding, eg Windows 1251 (Cyrillic) for Russian subtitles.

You will get a spreadsheet where columns A and B are subtitle information, which you can delete. Column C, and maybe column D also, will be the subtitle data. If there are some typos, you might want to correct them manually at this point.

WARNING - if you try and convert the Excel file back to a text file at this point, you have to be really careful how you do it. Excel tends not to ask you which character encoding to use on conversion to text. Personally, I skip this step.

The subtitle columns can now be copied and pasted into a LingQ lesson.