Question

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories:
MainConceptDescription-> Explanation of the Main concept(s)
SubConceptDescription-> Explanation of Subconcept related to the main concept
Methodology / Technique-> To achieve something, what should one do
Summary-> Summary of the discussed material or of the whole course
Application-> Practical advise for the concept
Example-> Concept example

Now, for the first 2 i think I should try to apply Latent Dirichlet Allocation to extract the topics. Another idea was to look in the resource name and search for these words in the text. Another idea was to read some of the resources and manually fix some sort of a dictionary for every category and then create regex patterns and search for them in the text.

But the latter seems too lame. So now I am not sure what can I do. I've seen similar works for research papers, however research papers have their own specific expressions etc that are more or less constant and seen in most of the papers, but that is not the case with my video scripts where it's 100% spoken natural language I need to work on. Do you have any ideas how can I approach this? I do have a list of keywords that kind of signifies whether an example follows or a concept is explained, but I am doing this manually which is surely not what I want to do for 563 files that may as well become many more.

Furthermore, I'd like to connect topics found to ontologies in order to enrich the metadata about each file. I have no idea how can I approach this either. Any advice would be VERY appreciated.

Forgive me if my explanations make no sense. I am not too familiar with the terminology. So if you also explain some of the terminology you use, i'd appreciate that too. And please advise on algorithms I can try with.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top