Question

I've thought about this a bit (and looked at every "auto-generate tags for content" type post on StackOverflow).

I have an Article (body:string) with multiple Tags (joined through Taggings).

Right now in the app, in order to suggest tags for the Article, pgsearch searches other Articles's body text for the text included in body (stemming words in the text) and suggests tags based on those related articles' tags. Of course this only works if similar articles have been tagged, and as more articles are tagged in the database, perhaps there are better tags to use.

Is there a smarter way, using say ElasticSearch, to automatically find the popular words from other Articles body text (unique and stemmed) and auto-generate a list of these tags.

If I were to do this myself, are there any examples to follow for doing this efficiently?

Was it helpful?

Solution

You can use the more-like-this query to find similar articles, and a terms facet to find the popular tags:

curl -XGET 'http://127.0.0.1:9200/myindex/article/_search?pretty=1'  -d '
{
   "query" : {
      "more_like_this_field" : {
         "body" : {
            "min_doc_freq" : 1,
            "like_text" : "BODY OF THE NEW ARTICLE",
            "min_term_freq" : 1,
            "percent_terms_to_match" : 0.2
         }
      }
   },
   "facets" : {
      "tags" : {
         "terms" : {
            "field" : "tags"
         }
      }
   }
}
'

Depending on the size of your corpus, you may need to play around with the parameters to more_like_this_field to get the best matches.

OTHER TIPS

The best way to do this is to use the elasticsearch Percolator API. Check out this answer:

Elasticsearch - use a "tags" index to discover all tags in a given string

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top