Export your lingq text content as a tree of files to Google drive

Hi all,

I just created a script that uses the lingq API to get all my content from lingq and store it as text files on Google Drive. Here is a video of the export in action:

The script runs on Google Colab for free and should be fairly self explanatory to use, I hope:

Jamie

1 Like

The next task I am going to work on is to write a script that:

  • iterates through a directory of text files on Google drive
  • runs each text file through spaCy which is an excellent natural language processing piece of software.
  • creates a db with records for every word in your content with linguistic features of that word such as:
    • the sentence the word appears in for context.
    • the base form of the word be it a verb, adjective or noun.
    • details of how the word is inflected eg. for verbs what person, tense, etc the verb is in or for Slavic nouns what case, etc the nouns are.

Here is an example of spacy analysing text, spacy can analyse about 70 languages I believe, here I have it set up to either work with English or Polish language models:

@jamespratt

Very cool script mate. Good to see some other public development going on.

I didnā€™t see any handling for pagination in the courses/lessons. the LingQ API usually limits results to 10,25 or 99.

1 Like

Ah! Yes, as yet there is no handling of pagination, so if you have more than one page of courses in any language or lessons in a course then the second page of courses or lessons will not be transferred to Google drive.

Iā€™ll take a stab at fixing that sometime soon. That should be fairly simple to do. If anyone wanted to submit a pull request, suggest some code or simply explain how they would do that please do!

Thanks! Well done with all your great work too!

1 Like

It think the JSON response has ā€œnextā€ and ā€œpreviousā€ keys right at the start. The only code I have handy atm gets all courses in a language, but it should be similar for lessons.

Summary
def fetch_and_save_to_csv():
    library_data_list = []
    page = 1
    while True:
        library_url = f"{base_url}/api/v3/{language_code}/search/?level=1&level=2&level=3&level=4&level=5&level=6&sortBy=mostLiked&type=collection&page={page}"
        library_response = requests.get(library_url, headers=headers)

        if library_response.status_code == 200:
            library_data = library_response.json()
            results = library_data.get("results", [])
            library_data_list.extend(results)

            next_url = library_data.get("next")
            if not next_url:
                print("No further pages")
                break
            else:
                page += 1
                print("going to page: ", page)
        else:
            print(
                f"Failed to fetch data from page {page}. Status code: {library_response.status_code}"
            )
            break
1 Like

Antoher thing you could try is to change the ā€œpage_sizeā€ parameter in the URL, if you set it to something like ā€œ1000ā€ you are probably good, not sure if this has any side effects.
https://www.lingq.com/api/v3/zh/search/?collection=1240751&page=1&page_size=25&sortBy=position&type=content

1 Like

1 button copy the entire LingQ lesson databaseā€¦ Why do they allow thisā€¦!

Guessing you need higher priv or get limit locked?

Thanks for the nice simple fix @bamboozled ! I just added that parameter to all json requests:

1 Like

@roosterburton
Nothing nefarious going on, I had a couple of things I was looking for:

  • Find out how popular my courses are in the Chinese library (it turns out they account for only 2% of the total views)
  • create a simple spreadsheet for librarian purposes, navigating the website and wading through potentially 1000+ courses to find for example untagged courses is just too slow and annoying
  • gather some statistics about the library and get data on total duration, word counts, popularity by level etc

It turns out, for many stats you actually need to comb through all lessons in all courses to get the relevant data and that turned out to be really slow for larger libraries. Additionally a lot of data is just not accurate, e.g. word counts, for that you need to call ā€œcountersā€ on top of everything else, but at that point I lost interest - just too annoying.
Here is a simplified example without the lesson stuff, also you donā€™t need any special privileges the data is accessible to all users who have added the language.

Summary
import requests
import csv

KEY = "12345"

language_code = "es"

base_url = "https://www.lingq.com"
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "accept-encoding": "gzip, deflate, br",
    "Authorization": f"Token {KEY}",
}


def fetch_and_save_to_csv():
    library_data_list = []
    page = 1
    while True:
        library_url = f"{base_url}/api/v3/{language_code}/search/?level=1&level=2&level=3&level=4&level=5&level=6&sortBy=mostLiked&type=collection&page={page}"
        library_response = requests.get(library_url, headers=headers)

        if library_response.status_code == 200:
            library_data = library_response.json()
            results = library_data.get("results", [])
            library_data_list.extend(results)

            next_url = library_data.get("next")
            if not next_url:
                print("No further pages")
                break
            else:
                page += 1
                print("going to page: ", page)
        else:
            print(
                f"Failed to fetch data from page {page}. Status code: {library_response.status_code}"
            )
            break

    # Processing
    data_list = []

    for result in library_data_list:
        id_value = result.get("id")
        title = result.get("title")
        lessons_count = result.get("lessonsCount")
        duration = result.get("duration")
        level = result.get("level")
        roses = result.get("rosesCount")
        shared_by = result.get("sharedByName")
        sharer_role = result.get("sharedByRole")
        tags = result.get("tags", "")
        status = result.get("status", "")

        data_list.append(
            {
                "link": f"{base_url}/uni/learn/{language_code}/web/library/course/{id_value}",
                "title": title,
                "level": level,
                "lessonsCount": lessons_count,
                "duration": duration,
                "roses": roses,
                "shared_by": shared_by,
                "sharer_role": sharer_role,
                "tags": tags,
                "status": status,
            }
        )

    # Sort by
    data_list.sort(key=lambda x: x["roses"], reverse=True)

    # Write data to CSV
    output_filename = f"{language_code}_library.csv"

    with open(output_filename, "w", newline="", encoding="utf-8") as csv_file:
        fieldnames = [
            "link",
            "title",
            "level",
            "lessonsCount",
            "duration",
            "roses",
            "shared_by",
            "sharer_role",
            "tags",
            "status",
        ]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

        writer.writeheader()
        for data in data_list:
            writer.writerow(data)

    print(f"Data saved to {output_filename}.")


if __name__ == "__main__":
    fetch_and_save_to_csv()
1 Like

I think itā€™s great, especially for the right reasons. Just unexpected.

2% seems like not too bad, not sure how many make it past the short storiesā€¦ I guess you already know the answer to that.

There was a good article about twitter (x) and how they locked down excessive scraping. (Pay walling, can only see 100 tweets a day etc). They would experience this sort of thing to a much wider and blatant extent though.

1 Like