سؤال

I've a SOLR search which uses lucene index as a backend. I also have some data in Hadoop I would like to use. How do I copy this data into SOLR ??

Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR. I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.

How do I copy? And it would be great if there is some incremental copy mechanism.

هل كانت مفيدة؟

المحلول

If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.

I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.

As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth

You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here

EDIT 1

Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.

نصائح أخرى

You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.

Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.

Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.

* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top