extracting technical keywords from a text document [closed]

https://stackoverflow.com/questions/21371416

03-10-2022
|

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 8 years ago.

Improve this question

Re-written:

I have a corpus of computer science related documents. I want to extract domain specific keywords. for example JAVA, C#, HTML, OOP, UML, Unity, etc. I was looking for a source similar to Oxford dictionary for computing, however their API is not up and running yet. I have also tried Webopedia for computer science terms but that one is not as inclusive and updated ( e.g. it doesn’t include some words in my documents such as F#) or in case of Wikipedia all terms are not listed all together. Is there a more inclusive source or appropriate approach to extract those keywords? I am using Python with NLTK . For example, tf-idf wasn’t helpful because some domain specific words are common almost in all documents so those words don’t get a high rating. I think it would be helpful if I could use the POS-tagging but I’m not sure which option would be the best for my application. Take the string below as an example:

“Expert level capabilities in JavaScript, JSON, and AJAX, and a deep knowledge of JavaScript frameworks such as JQuery “ Here I want to extract these words : [‘JavaScript’, ‘JSON’, ‘AJAX’, ‘Frameworks’, ‘JQuery’] but when I search for nouns using POS-tagging of NLTK, I get ‘level’, ‘capability’, ‘knowledge’ … as well. Thanks for your help.

Solution

Why don't you download the StackOverflow data dumps and write a program to filter the tags?

They just have been released on archive.org, see here

Of course, it would not include all terms and there would be some false positives, but I assume this is about as close as you will get.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow