Question

spoiler :
This is just another Lucene vs Sphinx vs whatever,
I saw that all other threads were almost two years old, so decided to start again..

Here is the requirement :

data size : max 10 GB.
rows : nearly billions
indexing should be fast
searching should be under 0 ms [ ok, joke... laugh... but keep this as low as possible ]

In today's world, which/what/how do I go about it ?

edit : I did some timing on lucene, and for indexing 1.8gb data, it took 5 minutes.
searching is pretty fast, unless I do a a*. a* takes 400 ~ 500 ms.
My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Was it helpful?

Solution

I have no experience other than with Lucene - it's pretty much the default indexing solution so don't think you can go too wrong.

10GB is not a lot of data. You'll be able to re-index it pretty rapidly - or keep it on SSDs for extra speed. And of course keep your whole index in RAM (which Lucene supports) for super-fast lookups.

OTHER TIPS

Please check Lucene wiki for tips on improving Lucene indexing speed. This is quite succinct. In general, Lucene is quite fast (it is used for real-time search.) The tips will be handy to figure out if you are missing out on something "obvious."

My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Take a look at Lusql, we used it once, FWIW 100 GBdata from mysql on a decent machine took little more than an hour to index, on filesystem(NTFS)

Now if u add SSD or whatever ultra fast disk tecnnology, you can bring it down considerably

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top