Titan ES regex query across tokenized field?
-
21-12-2019 - |
Question
I'm running Titan 0.4.0 and am trying to use the latest REGEX
operator for the ES string search.
I've created an index on my_key
for my ES index named search
.
gremlin> g.makeKey("my_key").dataType(String.class).indexed("search",Vertex.class).single().make()
==>v[82]
Then I add a vertex:
gremlin> v = g.addVertex(null, ["my_key":"123-abc"])
==>v[8]
gremlin> v.map
==>{my_key=123-abc}
The REGEX
seems to work...
gremlin> g.query().has("my_key", REGEX, "[12]{2}3").vertices()
==>v[8]
...but only on my tokenized "123"
and "abc"
independently:
gremlin> g.query().has("my_key", REGEX, "123").vertices()
==>v[8]
gremlin> g.query().has("my_key", REGEX, "abc").vertices()
==>v[8]
However, if I attempt to run a regular expression that matches my full value, my vertex is not retrieved (none of the below return results):
gremlin> g.query().has("my_key", REGEX, "123-abc").vertices()
gremlin> g.query().has("my_key", REGEX, "123.abc").vertices()
gremlin> g.query().has("my_key", REGEX, "[0-9]+.[abc]{3}").vertices()
gremlin> g.query().has("my_key", REGEX, "123.").vertices()
Is there a way in Titan to query the index in this way (regex w/o tokenized/analyzed terms)?
Solution
The way this was handled in Titan up until 0.4.0 can be a little bit confusing, because strings are always tokenized when they are indexed in an external indexing backend. This leads to strings being "chunked" into words an non-letter characters (as well as stop words) being ignored.
In the upcoming Titan 0.4.1 release we are making this more explicit. Have a look at the updated documentation: https://github.com/thinkaurelius/titan/wiki/Full-Text-and-String-Search
The gist: You can now specify whether you want your strings indexed "as-is" or as a bag of words after analysis. For your use case, it would be the former. We also straightened out the terminology: If you are looking for words in a string matching a regular expression, the predicate Text.CONTAINS_REGEX is used. If you want the entire string to match an expression, use Text.REGEX.
Titan 0.4.1 is currently in final preview and will be released next week.