MALLET for automatic topic tagging - with training data

https://stackoverflow.com/questions/12726728

05-07-2021
|

سؤال

I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which I often re-use if the document contains very similar content)

I want to make an interface that will suggest tags/titles (from my existing lists) for new documents that I add to the corpus, based on how I have tagged the existing documents.

I have read about the probabilistic topic model LDA classes, which look great for analyzing text when you don't have any existing tagged data. But I don't see any way I can incorporate my existing work.

Any suggestions would be appreciated.

Kind Regards

Swami

المحلول

For tags suggestion, our experience is just using a search engine, no need for topic modeling.

Try below steps:

Setup an index on title and abstract of all your documents
Using the title or abstract of the new document as a query to search on the index, and a list of similar document can be achieved.
Using the first few most-similar documents from the list, we aggregate all the tags on them as a tag bundle
Sort the tags bundle by frequency of each tag, and the first most-frequent tags are the final result

This solution is workable.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow