What is the most efficient data indexing technique

https://datascience.stackexchange.com/questions/143

16-10-2019
|

Question

As we all know, there are some data indexing techniques, using by well-known indexing apps, like Lucene (for java) or Lucene.NET (for .NET), MurMurHash, B+Tree etc. For a No-Sql / Object Oriented Database (which I try to write/play a little around with C#), which technique you suggest?

I read about MurMurhash-2 and specially v3 comments say Murmur is very fast. Also Lucene.Net has good comments on it. But what about their memory footprints in general? Is there any efficient solution which uses less footprint (and of course if faster is preferable) than Lucene or Murmur? Or should I write a special index structure to get the best results?

If I try to write my own, then is there any accepted scale for a good indexing, something like 1% of data-node, or 5% of data-node? Any useful hint will be appreciated.

Solution

I think you messed up some things in your question. Lucene (I know nothing about Lucene,NET, but I suppose is the same) is a library used to analyze, split in tokens, and store documents in order to be able to query and retrieve them later. Lucene has a pretty old but effective model, it uses inverted trees to find and retrieve documents. Without further details, all documents are split in tokens (terms), and for each term is maintained a data structure, which stores all the documents which contains the given term. As a data structure could be used a BTree, a hash table and in the latest major revisions you can even plug in your own data structures.

A BTree (see Wikipedia page for further details), is a kind of a tree data structure, which is appropriate for working with big chunks of data and is often used for storing tree-like ordered structures on disk. For in-memory other trees performs better.

Murmur hash (see Wikipedia page for further details), is a family of hash functions used in hash table. The implementation of the hash table is not important, it could be a standard chained implementation or more advanced open hash addressing scheme. The idea is that the hash tables allows one to get fast a key, from an unordered set of keys, and can answer to tasks like: is this key part of this set of keys? which is the value associated with this key?

Now back to your main problem. You have one library (Lucene) and to data structures, both data structures are used in Lucene. Now you see that it is not possible to answer your question in these terms since they are not comparable.

However, regarding you footprint and performance part of the question. First of all you have to know which kind of operations you need to implement.

Do you need only get value for key, or do you need to find all elements in a range? In other words do you need order or not? If you do, than a tree can help. If you do not, than a hash table, which is faster could be used instead.

Do you have a lot of data which does not fit the memory? If yes than a disk-based solution would help (like BTree). If your data fit the memory, than use the fastest in-memory solution and use disk only as a storage (with a different structure, much simpler).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange