Skip to content
/ SRT2CSV Public

Read in a .srt file, clean the data, and export to a .csv file with definitions

Notifications You must be signed in to change notification settings

vsulli/SRT2CSV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SRT2CSV

Read in a .srt file, clean the data, and export to a .csv file for upload to ANKI.

This script currently only supports Spanish & German fully, but the spaCy library and PyMultiDictionary offer support for various other languages. The main purpose of this project was to create an easier workflow that would allow me to take the subtitle files for shows that I watch and quickly identify the words that I don't know to expand my ANKI deck. Individual word tokens are extracted from each row of a dataframe and, if they do not already appear in a "stop word" file, will be added to a csv file in the format:

original subtitle sentence with bolded and highlighted word | word [pos] : definition | tags

Example Output:

No quiero que sea lunes. | lunes [Noun]:En el diccionario castellano lunes significa primer día de la semana, segundo de la semana litúrgica. | spanish,es,sentence::mining,La Caza,sentence::recognition

USAGE

(1) The original .srt file includes the subtitle number, the start and end times, and other features such as tags that will need to be removed.

Alt text

Alt text

(2) After loading the subtitles with the pysrt library you get a SubRipFile object.

Alt text

(3) The formatting is removed with the remove_formatting function and you return a cleaner SubRipFile object.

Alt text

(4) Using that SubRipFile's text attribute you can then create the first dataframe to inspect it for further areas to clean. In this dataframe you can note that there are still newline characters "\n" that appear on a few lines.

Alt text

(5) You can do some basic exploratory data analysis on the dataframe to see the top 5 rows, the total rows and columns, the data type, unique values, and the most frequent values.

Alt text

(6) Removing the newline character.

Alt text

(7) Breaking down the text into individual words through word tokenization will be vital for identifying words that I do not know.

Alt text

(8) Tokenizing the sentences. Currently there is not any sentiment analysis, but the full sentences will be needed to perform this step.

Alt text

(9) Stop words are loaded from the nltk.corpus library and you can extend that list to include your own words. Character names and city names do not need to be added to ANKI for study so they can be filtered out.

Alt text

Alt text

(10) You can get a visual overview of the "x" most frequent words so you know which words to focus on learning first.

Alt text

(11) If you want to see the full subtitle lines for one of those most frequent words you can use the find_word function.

Alt text

(12) The Educalingo dictionary defaults to a mix of English and the target language's definitions. For my purposes I only wanted to see the definition in the target language.

Alt text

(13) A function is used to chunk sections of the definition, use LangDetect to see if it's the target language and if so return only that section of the definition.

Alt text

(14) Another issue with Educalingo is that the definition doesn't include the article of a noun, which is crucial at the time of learning a word in gendered languages like Spanish(el,la) and German (der, die, das). SpaCy includes the gender as part of the token, but the actual article has to be specified.

Alt text

(15) It's better to use the base word with Educalingo because often the definition is tied to this word (as opposed to the plural of a noun or the conjugated form of a verb).

Alt text

(16) Constructing the csv file by iterating through all words in the specified column 'No Stop Words'. For each word that is unknown I look up all the sentences in the dataframe that also include that word. Each entry is stored as a string and added to a list that is then turned into a new dataframe and exported to a csv file.

Alt text

(17) In the language learning community there is the n+1 principle where you should use comprehensible input when learning a language that is just slightly more difficult than what you can already do. This can be applied to the 'No Stop Words' column by choosing to only add a sentence to the csv if there is only 1 unknown word (although there is the option for the user to expand this to 'x' desired words).

Alt text

NOTES / LIMITATIONS

  • Notepad++ convert ANSI to UTF-8 for special characters

  • Seems to need to be UTF-8 BOM in order to preserve special characters

  • Hinglish doesn't have an official language code.

  • On occasion the Educalingo library is unable to find a definition for the word.

    ex) Pájaros [?]:None , Buenos [?]:None

    • This is possibly due to the definition being linked to the stem that is singular as well as a problem with the first letter being uppercase.
  • Perhaps the biggest limitation with the approach of this project is that it does not account for idioms or slang that may combine previously known/stop words that result in a completely different meaning. For this reason I combine a phrase dictionary and book of idioms in my personal studies.

    ex) Mi media naranja. ("My better half." - Literally "My half orange.")

FUTURE IMPROVEMENTS

  • Adding sentiment analysis on the sentences
  • Adding a category tag for each sentence
  • Creating word clouds (for categories such as family & friends, outside & sports, cooking, etc.)
  • Changing the educalingo dictionary to web scraping an actual dictionary for better definitions and articles for nouns?
  • Extending the verb definition to include tense -> go deeper into spaCy lemmatization
  • Adding support for more languages

About

Read in a .srt file, clean the data, and export to a .csv file with definitions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published