Question

my question is pretty straight forward, I've spent a few hours searching the web of existing methods of generating keywords for a topic/word. For example, if my input is:

Object oriented programming

I want my output to be along the lines of:

classes, objects, friend functions, static variables etc

My current idea of a solution is to google the specific subject I'm interested in generating keywords for, grabbing the first x(many) result pages, removing all tags and stop words from them, passing each word through the Python nltk lemmatizer to get its basic form so I dont count words that mean essentially the same thing more than once ("performance" and "performer" will both become "perform"), then counting the number of occurances of each word, and grabbing the top x% as the most relevant words to my search topic.

The first issue with this, is that it wont generate any phrases because it treats each word individually, the second is that there Must be something already done in this field and the results I've come up with during my research are : context vectors(seem quite similar to what I want but in reality they arent really...i think... :p) the second thing is Porter stemmer algorithm, but then I realized lemmatization is much better for my cause... I also saw alot of "keyword generators" for sites to increase their traffic, but I highly doubt I can use any of those for what Im trying to do.

If anyone could point me in a direction of an algorithm or existing research on this, or anything at all, Id be really grateful :)

Was it helpful?

Solution

What you are looking for is a focused crawler. Have a look at BootCat. BootCat extracts keywords as n-grams, but you could be able to use your own algorithm to extract keywords from web pages (instead of extracting space-separated strings as words). You could also use some library or REST API for keyword extraction, which will extract multi-word keywords for you. Here, in the "External links" section, you can find a list of some keyword extractors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top