Question

In my current project I need to index all e-mails and their attachments from multiple mailboxes.

I will use Solr, but I don't know what is the best approach to build my index's structure. My first approach was:

<fields>
<field name="id" require="true"/>
<field name="uid" require="true"/>
//A lot of other fields
<dynamicField name="attachmentName_*" require="false">
<dynamicField name="attachmentBody_*" require="false">
</fields>

But now I am not really sure if it is the best structure. I don't think I can search for one term (e.g stackoverflow) and know where the term was (e.g. attachmentBody_1 or _2 or _3 etc) with a single query.

Anyone have a better suggestion to my index's structure?

Was it helpful?

Solution 2

I found one possible solution. All I need to do is set attachmentBody as stored.

This solution is not good enough because the index's space will dramatically increase but in my case there is no problem cause I will implement highlight feature too and those fields need to be stored.

OTHER TIPS

You can use multiValued fields for attachmentName and attachmentBody. So you would have 2 regular fields instead of dynamic fields. You can then use highlighting to bring back the specific values that match with surrounding context.

Another option would be to make each attachment a separate document, and store something to identify which email it belongs to. The downside of this approach is that you may need to index any data from the email itself several times. But this is really only a problem if most of the email messages have more than one attachment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top