Question

Re-written:

I have a corpus of computer science related documents. I want to extract domain specific keywords. for example JAVA, C#, HTML, OOP, UML, Unity, etc. I was looking for a source similar to Oxford dictionary for computing, however their API is not up and running yet. I have also tried Webopedia for computer science terms but that one is not as inclusive and updated ( e.g. it doesn’t include some words in my documents such as F#)  or in case of Wikipedia all terms are not listed all together. Is there a more inclusive source or appropriate approach to extract those keywords?  I am using Python with NLTK . For example, tf-idf wasn’t helpful because some domain specific words are common almost in all documents so those words don’t get a high rating. I think it would be helpful if I could use the POS-tagging but I’m not sure which option would be the best for my application. Take the string below as an example:

“Expert level capabilities in JavaScript, JSON, and AJAX, and a deep knowledge of JavaScript frameworks such as JQuery “ Here I want to extract these words : [‘JavaScript’, ‘JSON’, ‘AJAX’, ‘Frameworks’, ‘JQuery’] but when I search for nouns using POS-tagging of NLTK, I get ‘level’, ‘capability’, ‘knowledge’ … as well. Thanks for your help.

Was it helpful?

Solution

Why don't you download the StackOverflow data dumps and write a program to filter the tags?

They just have been released on archive.org, see here

Of course, it would not include all terms and there would be some false positives, but I assume this is about as close as you will get.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top