Question

I need some assitance on deciding creating a single index in a single Solr instance vs creating multiple cores in a single Solr instance, each core servicing an index. My understanding is, a single index in solr is usually implemented to index one type of document. What is the best practice when you have different document types? For an example, if you want to index details of an invoice transaction, you could create a schema with fields for an invoice transaction document as follows;

  • invoiceDate
  • dueDate
  • invoiceSummary
  • billingContact
  • invoiceLineItems
  • notes

Let's say you also want to index details of products, would you create a new document type with a schema as follows;

  • productCode
  • productDescription
  • sellingPrice
  • buyingPrice
  • onHand
  • avgCost
  • notes

and create a new core in Solr to index product documents? Or would you merge both transaction and product into one schema as follows;

  • invoiceDate
  • dueDate
  • invoiceSummary
  • billingContact
  • invoiceLineItems
  • productCode
  • productDescription
  • sellingPrice
  • buyingPrice
  • onHand
  • avgCost
  • notes

and have just the one core indexing the above doucment, instead of having an "Invoice" core and a "Product" core indexing the two different documents?

I guess it makes sense to have a single flat index as suggested in the Solr wiki when the fields are similar, however in an example like above, the data are not even remotely related to one another because they are separate entities. I have seen cases where people have suggested to add an extra field to distinguish between the different entities, like a table name field or similar, and filter the query based on the table name field, which I guess works. I am not sure how far that is scalable though when you have a use case as follows;

"Search invoices for key word 'John', fields to search for are 'billingContact', 'invoiceSummary', 'notes'. Boost 'billingContact' field at query time. Also search product for 'John', fields to search for are 'productDescription', 'supplier', 'notes'. Boost 'supplier' at query time. Return only 100 invoices and 100 products."

The application I am working on needs to search across invoices and products from a single form. There are no different parts in the application that searches for different stuff.

My concerns in putting everything in one index;

1)Large index size eg: 50 million invoices + 50 million products in single index

2) Reindexing an index of that size.

3) Index tuning: wouldn't it be easier to tweak/tune each separate index to serve specific expected search outcomes, rather than trying to do that in a single index?

4) We decide to index billing contact details as well in the future. Which will add more fields to be indexed and contribute to my concerns in points 1) and 2).

Was it helpful?

Solution

Return only 100 invoices and 100 products.

also

Boost 'billingContact' field at query time Boost 'supplier' at query time

This would suggest that even though you are searching the same terms, you are searching them as separate concepts.

Based on this and lack of common fields, I would recommend starting with separate collections.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top