Question

I am trying to write a small program that searches a key-value type structure. My search is to find the FASTEST approach possible for searching the key-value.

I would prefer to use C# for this program, unless another language gives me a significant advantage. Another limitation that I am putting is that everything has to be on the same computer. I don't want to use an Oracle or SQL Server database, beacuse I belive the other options will me much faster. The data is mostly read and rarely written. Whenever there are changes or updates to the data a new set is created and it is ok if writing of the data takes time.

Assumptions:
The data is sorted in a numeric order.
The structure is as simple as this:

Char3 file: (This file will only store 3 character keys)
Key|Value
100|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
999|2,5,6,7:9:3,4:3:2,5

Char5 file: (This file will only store 5 character keys)
Key|Value
A1000|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7

Char3 and Char5 follow the same storage structure but have different types of keys. The key however will be of same length in a given file

I have multiple files like these each file will follow the same structure. The only variation will be the key length in each file.

The task is given a set of 1-200 (variable lengths) Keys find all the data related to each key.

I am generating this data from a database and thus can create the data in any format.

For the FileStream test I am going to pad each line for a given file and then use FileStream.Seek to quickly jump to a given location based on the padding.

What I want to do is find out which of these apporaches would be the fastest?

  1. FileStream - I will eventually also look at memory-mapped files. (Open to other options)
  2. Embedded SQL - SQLite (Open to other options)
  3. NoSql - ?? (Looking for Suggstions)

My question is what should I be using in each of these categories for a proper comparisson. For example, if I was using FileStream and not using FileStream.Seek than it would not be a proper comparisson.

I would eventually also like to run the searches in parallel as much as I can. My pripary requirement is SEARCH performance.

Any ideas or suggestions would be great.

Thanks,

UPDATE: I will list the option details and results as I process them
Find 5000 random entries (by line numebr or some other similar charateristic) in a file that contains 10K lines, 2.28 MB.

  1. FileStream options - Best time: 00:00:00.0398530 ms
Was it helpful?

Solution

You're best bet is Berkeley DB, via the C# API (which uses key-value pair storage). Berkeley DB is a library, so it links into your application. There is no separate server to install and no client/server overhead. Berkeley DB is extremely fast, scalable and reliable and is designed to do exactly what you describe here.

Disclaimer: I'm the Product Manager for Berkeley DB, so I'm a little biased. But I'm serious when I say that this is exactly the scenario that Berkeley DB is designed for.

OTHER TIPS

As far as I understand, your data is already in a database, indexed and ready to be searched. What you want to do is to extract it from the database and implement your custom search scheme, where you manually manipulate byte offsets in a file etc. IMHO this approach is bound to fail.

Not using a database because of one's beliefs is known to not be the best approach to performance tuning. :-)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top