Configure fields for considering duplicates
-
29-06-2021 - |
Question
Consider a Solr index with the following fields:
<fields>
<field name="id" type="uuid" indexed="true" stored="true" default="0"/>
<field name="user" stored="true" type="string" multiValued="false" indexed="true"/>
<field name="text" stored="true" type="textmulti" multiValued="false" indexed="true"/>
<field name="media" stored="true" type="string" multiValued="false" indexed="true"/>
</fields>
I would consider a newly indexed Document to be a dupe (and therefore to be rejected) if there exists a current document that has identical user
and text
fields, no matter what the id
or media
fields' content are. Documents that have matching user
or text
is not enough to be considered a dupe, it must be both user
and text
.
I have read through Document Duplication Detection and XML Messages for Updating a Solr Index on the Solr wiki but I still do not see how to configure this. Any ideas? I am using the wonderful solr-php-client to connect to Solr via PHP.
Thanks.
Solution
probably you have some reason not to do so, but you could use the concatenation of user and text as id and then you would not need to use Duplicate Detection as Solr does it for you if you dont overwrite