Where to find domain-specific corpus for a text mining task?

https://stackoverflow.com//questions/22071685

23-12-2019
|

Pergunta

I am working on a text mining project which focus on the computer technology documents. So there're many jargons. Tasks like part-of-speech tagging require some training data to built a pos-tagger. And I think this training data should be from the same domain with words like ".NET, COM, JAVA" correctly tagged.

So where can I find such corpus? Or is there any work around? Or can we tune an existing tagger to handle domain specific task?

Solução

Gathering training data (and defining features) is going to be the hardest step of this problem. I'm sure there are datasets out there. But an alternative option for you would be to identify a few journals or news sites that focus on your area of interest and crawl them and pull down the text, perhaps validating each article you pull down by searching for keywords. I've done that before to develop a corpus focused on elections.

Outras dicas

Unfortunately, it is domain-specific where you can find such a corpus.

Catch-22. There is no general source for specialized data.

Just like there is no universal software to solve domain-specific problems.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow