Question

I am trying to index millions of strings, that are associated with metadata objects.

Each metadata object can have n thousands of strings.

I need to be able to search both string content, and the associated object metadata.

Currently this means that I am indexing the copies of relevant metadata fields with each string, which leads to ridiculous amounts of duplication and incredibly large index sizes.

In a relational db model, i could just store one copy of the metadata and join the tables to be able to filter and search by the combined fields, but I can’t see any way of eliminating this duplication in Solr.

Is there something obvious I am missing, or is Solr just the wrong tool for the job?

Was it helpful?

Solution

Solr has support for join, which behaves more like subquery than join in relational database terms, but might do what you want. You can have Solr return metadata objects that have one or more strings that match your query. With another non-join query, you can also find out which strings are matched. (Note: This SO question explains why you cannot get both the metadata objects and the matched strings with one query yet.) If your metadata objects and the strings have a 1-to-N relationship, then you should also look into block join, which is designed for such relationship. You can index the metadata objects as parent documents, and the strings as child documents.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top