Facets tokenize tags with spaces. Is there a solution?

https://stackoverflow.com/questions/12052637

27-06-2021
|

Question

I have some problem with facets tokenize tags with spaces.

I have the following mappings:

    curl -XPOST "http://localhost:9200/pictures" -d '
    {
      "mappings" : {
        "pictures" : {
                "properties" : {
                    "id": { "type": "string" },
                    "description": {"type": "string", "index": "not_analyzed"},
                    "featured": { "type": "boolean" },
                    "categories": { "type": "string", "index": "not_analyzed" },
                    "tags": { "type": "string", "index": "not_analyzed", "analyzer": "keyword" },
                    "created_at": { "type": "double" }
                }
            }
        }
    }'

And My Data is:

    curl -X POST "http://localhost:9200/pictures/picture" -d '{
      "picture": {
        "id": "4defe0ecf02a8724b8000047",
        "title": "Victoria Secret PhotoShoot",
        "description": "From France and Italy",
        "featured": true,
        "categories": [
          "Fashion",
          "Girls",
        ],
        "tags": [
          "girl",
          "photoshoot",
          "supermodel",
          "Victoria Secret"
        ],
        "created_at": 1405784416.04672
      }
    }'

And My Query is:

    curl -X POST "http://localhost:9200/pictures/_search?pretty=true" -d '
    {
      "query": {
        "text": {
          "tags": {
            "query": "Victoria Secret"
          }
        }
      },
      "facets": {
        "tags": {
          "terms": {
            "field": "tags"
          }
        }
      }
    }'

The Output result is:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 0,
        "max_score" : null,
        "hits" : [ ]
      },
      "facets" : {
        "tags" : {
          "_type" : "terms",
          "missing" : 0,
          "total" : 0,
          "other" : 0,
          "terms" : [ ]
        }
      }
    }

Now, I got total 0 in facets and total: 0 in hits
Any Idea Why its not working?
I know that when I remove the keyword analyzer from tags and make it "not_analyzed" then I get result.
But there is still a problem of case sensitive.
If I run same above query by removing the keyword analyzer then I get the result which is:

    facets: {
        tags: {
            _type: terms
            missing: 0
            total: 12
            other: 0
            terms: [
                {
                    term: photoshoot
                    count: 1
                }
                {
                    term: girl
                    count: 1
                }
                {
                    term: Victoria Secret
                    count: 1
                }
                {
                    term: supermodel
                    count: 1
                }         
            ]
        }

    }

Here Victoria Secret is case sensitive in "not_analyzed" but it takes space in count, but when I query with lowercase as "victoria secret" it doesn't give any results.

Any suggestions??

Thanks,
Suraj

Solution

The first examples are not totally clear to me. If you use the KeywordAnalyzer it means that the field will be indexed as it is, but then it makes much more sense to just not analyze the field at all, which is the same. The mapping you posted contains both

"index": "not_analyzed", "analyzer": "keyword"

which doesn't make a lot of sense. If you are not analyzing the field why would select an analyzer for it?

Apart from this, of course if you don't analyze the field the tag Victoria Secret will be indexed as it is, thus the query victoria secret won't match. If you want it to be case-insensitive you need to define a custom analyzer which uses the KeyworkTokenizer, since you don't want to tokenize it and the LowercaseTokenFilter. You can define a custom analyzer through the index settings analysis section and then use it in your mapping. But that way the facet would be always lowercase, which is something that you don't like I guess. That's why it's better to define a multi field and index the field using two different text analysis, one for the facet and one for search.

You can create the index like this:

curl -XPOST "http://localhost:9200/pictures" -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
              "lowercase_analyzer" : {
                "type" : "custom",
                "tokenizer" : "keyword",
                "filter" : [ "lowercase"]
              }
            }
        }
    },
    "mappings" : {
        "pictures" : {
            "properties" : {
                "id": { "type": "string" },
                "description": {"type": "string", "index": "not_analyzed"},
                "featured": { "type": "boolean" },
                "categories": { "type": "string", "index": "not_analyzed" },
                "tags" : {
                    "type" : "multi_field",
                    "fields" : {
                        "tags": { "type": "string", "analyzer": "lowercase_analyzer" },
                        "facet": {"type": "string", "index": "not_analyzed"},
                    }
                },
                "created_at": { "type": "double" }
            }
        }
    }
}'

Then the custom lowercase_analyzer will be applied by default to the text query too when you search on that field, so that you can either search for Victoria Secret or victoria secret and get the result back. You need to change the facet part and make the facet on the new tags.facet field, which is not analyzed.

Furthermore, you might want to have a look at the match query since the text query has been deprecated with the latest elasticsearch version (0.19.9).

OTHER TIPS

I think this make some sense to my answer

https://gist.github.com/2688072

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow