MALLET for automatic topic tagging - with training data
-
05-07-2021 - |
Domanda
I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which I often re-use if the document contains very similar content)
I want to make an interface that will suggest tags/titles (from my existing lists) for new documents that I add to the corpus, based on how I have tagged the existing documents.
I have read about the probabilistic topic model LDA classes, which look great for analyzing text when you don't have any existing tagged data. But I don't see any way I can incorporate my existing work.
Any suggestions would be appreciated.
Kind Regards
Swami
Soluzione
For tags suggestion, our experience is just using a search engine, no need for topic modeling.
Try below steps:
- Setup an index on title and abstract of all your documents
- Using the title or abstract of the new document as a query to search on the index, and a list of similar document can be achieved.
- Using the first few most-similar documents from the list, we aggregate all the tags on them as a tag bundle
- Sort the tags bundle by frequency of each tag, and the first most-frequent tags are the final result
This solution is workable.