Export your lingq text content as a tree of files to Google drive

jamespratt · December 20, 2023, 7:22am

Hi all,

I just created a script that uses the lingq API to get all my content from lingq and store it as text files on Google Drive. Here is a video of the export in action:

The script runs on Google Colab for free and should be fairly self explanatory to use, I hope:

Jamie

jamespratt · December 20, 2023, 7:39am

The next task I am going to work on is to write a script that:

iterates through a directory of text files on Google drive
runs each text file through spaCy which is an excellent natural language processing piece of software.
creates a db with records for every word in your content with linguistic features of that word such as:
- the sentence the word appears in for context.
- the base form of the word be it a verb, adjective or noun.
- details of how the word is inflected eg. for verbs what person, tense, etc the verb is in or for Slavic nouns what case, etc the nouns are.

jamespratt · December 20, 2023, 7:43am

Here is an example of spacy analysing text, spacy can analyse about 70 languages I believe, here I have it set up to either work with English or Polish language models:

roosterburton · December 20, 2023, 8:16am

@jamespratt

Very cool script mate. Good to see some other public development going on.

I didn’t see any handling for pagination in the courses/lessons. the LingQ API usually limits results to 10,25 or 99.

jamespratt · December 20, 2023, 8:25am

Ah! Yes, as yet there is no handling of pagination, so if you have more than one page of courses in any language or lessons in a course then the second page of courses or lessons will not be transferred to Google drive.

I’ll take a stab at fixing that sometime soon. That should be fairly simple to do. If anyone wanted to submit a pull request, suggest some code or simply explain how they would do that please do!

Thanks! Well done with all your great work too!

bamboozled · December 20, 2023, 8:51am

It think the JSON response has “next” and “previous” keys right at the start. The only code I have handy atm gets all courses in a language, but it should be similar for lessons.

Summary

def fetch_and_save_to_csv():
    library_data_list = []
    page = 1
    while True:
        library_url = f"{base_url}/api/v3/{language_code}/search/?level=1&level=2&level=3&level=4&level=5&level=6&sortBy=mostLiked&type=collection&page={page}"
        library_response = requests.get(library_url, headers=headers)

        if library_response.status_code == 200:
            library_data = library_response.json()
            results = library_data.get("results", [])
            library_data_list.extend(results)

            next_url = library_data.get("next")
            if not next_url:
                print("No further pages")
                break
            else:
                page += 1
                print("going to page: ", page)
        else:
            print(
                f"Failed to fetch data from page {page}. Status code: {library_response.status_code}"
            )
            break

bamboozled · December 20, 2023, 8:56am

Antoher thing you could try is to change the “page_size” parameter in the URL, if you set it to something like “1000” you are probably good, not sure if this has any side effects.
https://www.lingq.com/api/v3/zh/search/?collection=1240751&page=1&page_size=25&sortBy=position&type=content

roosterburton · December 20, 2023, 9:20am

1 button copy the entire LingQ lesson database… Why do they allow this…!

Guessing you need higher priv or get limit locked?

jamespratt · December 20, 2023, 9:54am

Thanks for the nice simple fix @bamboozled ! I just added that parameter to all json requests:

bamboozled · December 20, 2023, 10:22am

@roosterburton
Nothing nefarious going on, I had a couple of things I was looking for:

Find out how popular my courses are in the Chinese library (it turns out they account for only 2% of the total views)
create a simple spreadsheet for librarian purposes, navigating the website and wading through potentially 1000+ courses to find for example untagged courses is just too slow and annoying
gather some statistics about the library and get data on total duration, word counts, popularity by level etc

It turns out, for many stats you actually need to comb through all lessons in all courses to get the relevant data and that turned out to be really slow for larger libraries. Additionally a lot of data is just not accurate, e.g. word counts, for that you need to call “counters” on top of everything else, but at that point I lost interest - just too annoying.
Here is a simplified example without the lesson stuff, also you don’t need any special privileges the data is accessible to all users who have added the language.

Summary

import requests
import csv

KEY = "12345"

language_code = "es"

base_url = "https://www.lingq.com"
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "accept-encoding": "gzip, deflate, br",
    "Authorization": f"Token {KEY}",
}


def fetch_and_save_to_csv():
    library_data_list = []
    page = 1
    while True:
        library_url = f"{base_url}/api/v3/{language_code}/search/?level=1&level=2&level=3&level=4&level=5&level=6&sortBy=mostLiked&type=collection&page={page}"
        library_response = requests.get(library_url, headers=headers)

        if library_response.status_code == 200:
            library_data = library_response.json()
            results = library_data.get("results", [])
            library_data_list.extend(results)

            next_url = library_data.get("next")
            if not next_url:
                print("No further pages")
                break
            else:
                page += 1
                print("going to page: ", page)
        else:
            print(
                f"Failed to fetch data from page {page}. Status code: {library_response.status_code}"
            )
            break

    # Processing
    data_list = []

    for result in library_data_list:
        id_value = result.get("id")
        title = result.get("title")
        lessons_count = result.get("lessonsCount")
        duration = result.get("duration")
        level = result.get("level")
        roses = result.get("rosesCount")
        shared_by = result.get("sharedByName")
        sharer_role = result.get("sharedByRole")
        tags = result.get("tags", "")
        status = result.get("status", "")

        data_list.append(
            {
                "link": f"{base_url}/uni/learn/{language_code}/web/library/course/{id_value}",
                "title": title,
                "level": level,
                "lessonsCount": lessons_count,
                "duration": duration,
                "roses": roses,
                "shared_by": shared_by,
                "sharer_role": sharer_role,
                "tags": tags,
                "status": status,
            }
        )

    # Sort by
    data_list.sort(key=lambda x: x["roses"], reverse=True)

    # Write data to CSV
    output_filename = f"{language_code}_library.csv"

    with open(output_filename, "w", newline="", encoding="utf-8") as csv_file:
        fieldnames = [
            "link",
            "title",
            "level",
            "lessonsCount",
            "duration",
            "roses",
            "shared_by",
            "sharer_role",
            "tags",
            "status",
        ]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

        writer.writeheader()
        for data in data_list:
            writer.writerow(data)

    print(f"Data saved to {output_filename}.")


if __name__ == "__main__":
    fetch_and_save_to_csv()

roosterburton · December 20, 2023, 1:09pm

I think it’s great, especially for the right reasons. Just unexpected.

2% seems like not too bad, not sure how many make it past the short stories… I guess you already know the answer to that.

There was a good article about twitter (x) and how they locked down excessive scraping. (Pay walling, can only see 100 tweets a day etc). They would experience this sort of thing to a much wider and blatant extent though.