Linq -> spacy natural language processor -> to an srs such as ANKI

Think this will really help with my Polish studies and hopefully the tools I produce will be useful to others too. Here is a short description of the project I am working on.

spaCy NLP for SRS

Some scripts to apply spaCy natural language processing to target language content to help adult foreign language learners get to grips with their target languages’ grammar.

The idea is to take content that a learner is already familiar with and analyse the grammatical features of this content with spaCy.

We will take spaCy’s output and produce a csv file that can be imported into a SRS such as Anki so that the learner can for example then expose themselves to categories of grammar such as:

  • lots of examples of uses of a particular base form of a verb, noun, adjective etc with different cases, tenses, genders, in singular or plural etc. etc.

  • OR examples of masculine plural nouns, in whatever case.

  • examples of verbs in the third person, plural vs singular inflection.

  • etc. etc.

By mining familiar content and categorising sentences according to their grammatical features we allow the learner to expose themselves to lots of examples of the grammatical feature they are interested in, in a progressive systematic way according to their whim, which we feel will help them to become familiar with the grammar patterns of their target language in as efficient a way as possible.

Tools

Done

Export all your content from lingq into a tree of text files in Google drive.

Discussed here: Export your lingq text content as a tree of files to Google drive

Experiment to allow a user to select a single lesson from their lingq and run the content through spacy

Discussed here: Google colab notebook to select a language -> course -> lesson and then to run natural language processing from spaCy on the text

See also:

This web app I created that allows you to enter text, select a Polish or English language model and enter some content to run through spaCY.

Remarkably simple code here: GitHub - jamiepratt/spacy-pl: Streamlit application using spacy to analyse Polish text.

To do

Run text through spacy and produce a csv file

  • iterates through a tree of text files on Google drive

  • runs each text file through spaCy.

  • creates a db with records for every word in your content with linguistic features of that word such as:

  • the sentence the word appears in for context.

  • the base form of the word be it a verb, adjective or noun.

  • details of how the word is inflected eg. for verbs what person, tense, etc the verb is in or for Slavic nouns what case, etc the nouns are.

2 Likes

@roosterburton my full diabolical plan as mentioned in PM. :slight_smile:

Very exciting stuff. Having access to data like this is game changing.

I’ll need to look at setting up a Spacey API endpoint that supports multiple languages… unless you know of a free one…!

No @roosterburton , don’t know of an API available to query spaCy from a client, nice idea to set one up!

If you do want to set up an endpoint this might be the way to go?

Great that looks straight forward enough. I’ll just set it up as a private API for my tools.
Integrating the data in a meaningful way will still be a big challenge

“Integrating the data in a meaningful way will still be a big challenge”

I’m excited to see what you come up with Dan! Definitely a big UI challenge and speaking for myself it is not immediately obvious what all the data you can get from spaCy means for me as a language learner.

I think that this tool spacy is indeed a VERY helpful tool to have as a language learner.

I am particularly interested in the morphology of words it reveals. Quote I found in the spaCy documentation:

" We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form."

But this is just one use of spacy.

Let me know if you want to jump onto a video / audio call to discuss possibilities and / or bounce ideas off each other about what this could be used for.

I’ve been going through the course at https://course.spacy.io/en/chapter1

Just started getting into it but this word dependency system seems extremely useful. The other straight forward thing I can see is being able to group words based on these tags.

I’ll send you a message about collab

image