Getting Subtitles From Documentaries / TV Shows

Hi all,

I would dearly, dearly love to put subtitles into LingQ from documentaries I watch.

For just one example:

I would type out the script myself, but it’ll take so much time.

I have downloaded a program that takes Chinese speech into characters, but it doesn’t process the audio from even the clearest documentary I can find well enough to be useful.

I have tried to get subtitle files for all TV shows, such as 北平无战事, though I can’t get any, nor any program that will auto-generate Chinese subtitles for documentaries, such as the one I listed. I’ve searched for hours.

I’m assuming this might be trickier than with other languages simply because most Chinese shows / documentaries have subtitles embedded into the video itself.

Could anyone give me any advice about where to find subtitle files, or, some program that would let me compile a text file for documentaries (or TV shows, though I much prefer documentaries).

Many, many thanks!

Anthony.

1 Like

There is a plug-in for the Mozilla browser that adds a link to YouTube pages to download any subtitles that are not the (often worthless) automatically generated ones. I use this and like it. Sorry, I don’t have its name available right now, but you can search for it in the Mozilla plug-ins. I would expect there to be similar tools available for other browsers, but I don’t know.

There is also a website, downsub.com, into which you can paste the link to a YouTube video. It will then give you links to any subtitles for the video, including the automatically generated subtitles. (Sometimes they’re not completely worthless.) There may be other similar sites, and there are videos on YouTube (that I have not watched) about downloading subtitles. Just google “youtube subtitle download”.

When you download subtitles, they will be in the “.srt” format which is very simple. Each instance of a subtitle consists of 3 or more lines of text separated from its neighbors by blank lines. The first line is a sequence number. The second line has the start and stop times for displaying the subtitle. The third and subsequent lines contain the subtitle itself.

Those subtitles that are “burned” into the video image itself are probably not accessible any other way than by reading. I’ve transcribed those manually for one or two short videos, but that’s not very sustainable.

If you’re handy with Linux or other Unix-like systems, or have any system with a *nix-like shell and the sed utility, this script can extract the subtitles from a .srt file. It also removes any directives that are embedded in some subtitles.

#!/bin/bash

SEDFILE=.desrt.$$
trap “/bin/rm ${SEDFILE}; exit” 0

cat >${SEDFILE} <<EOF
/^[1-9][0-9]$/d
/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9]
–>/d
s/<font[^>]
>//g
s/</font[^>]*>//g
EOF

sed -f ${SEDFILE} “$@”

And a cleaner-looking script to do the same thing with Perl:

#!/usr/bin/perl -nl –

next if (m/^\d+$/) ;
next if (m/^\d\d:\d\d:\d\d,\d+\s*–>/) ;
s/</?font[^>]*>//g ;
print ;

If you have an srt file and want to extract the pure text you can use Convert subtitles to plain text | Subtitle Tools
I’ve been using it for chinese just like you and it works great.

I wrote a post about this using anime. How to Use Anime Subtitles to Help Improve Your Japanese - LingQ Blog

As for Chinese subtitles, you can find them here I think : https://www.quora.com/What-is-the-best-site-for-downloading-subtitles-2