Question

I have a 200MiB binary file containing pairs of Int64 and Int32. The file is already sorted by the values on the Int64.

I'm reading it like this:

private SortedDictionary<Int64, int> edgeID = new SortedDictionary<Int64, int>();

//Carico il file del mapping arco ID
using (BinaryReader b = new BinaryReader(File.Open(@"file.bin", FileMode.Open)))
{
    Int64 keyID;
    Int32 grafoID;
    int numArchi = b.ReadInt32();

    for (int i = 0; i < numArchi; i++)
    {
        keyID = (b.ReadInt64());
        grafoID = b.ReadInt32();
        edgeID[keyID] = grafoID;
    }
}

But it's super slow. I noticed if I was using a normal Dictionary I could speed up a bit if I pass the dimension of the Dictionary in the costructor but apparently SortedDictionary doesn't support that. Also I think the bigger problem is the program doesn't know I'm passing data already ordered so it checks the key at every insert.

Was it helpful?

Solution

Your second assumption is partially correct: SortedDictionary, in general, has no way of telling that data is ordered and needs to check they are inserted in the right place.

Furthermore, SortedDictionary is a red-black tree internally in the current Mono and Microsoft implementations of the BCL.

Insertion of already ordered data is actually the worst case scenario for a red black tree (see insert section here)

Also, the implementation might change in the future, so optimizing for it/agaist it is a risky move!

In comparison, Dictionary is an hash-table internally (that's why you can give it an initial size; hash-tables are implemented using arrays underneath (of different shapes, depending on the implementation)) and so both insertion and look-up times are faster (amortized, O(1)).

You have to consider: what do you need it for? If you want to just search by key, and the data is already ordered, probably the best way of doing it is loading the data in a simple contiguous array (a .NET array or a list) and use binary search on it. You will need the overload that takes a comparer, as your array will probably hold a Tuple - or you may just use two arrays, one of "indexes" and one of "values".. your choice.

Much faster than SortedDictionary and Dictionary on loading (and should be faster than SortedDictionary in searching too).

It might be that Dictionary is still faster on searching a single item; hash-tables have 0(1) access, while binary search on an ordered array is 0(log N), but locality plays an important role on this "intermediate" data dimensions, so the real result might be different from the theoretical one. You will need to profile and measure to decide which is best in your case!

(By intermediate dimensions I mean: something for which the big-O notation does not play a dominant role, yet)

OTHER TIPS

Using a Dictionary with a max number of elements like this:

using (BinaryReader b = new BinaryReader(File.Open(@"filed", FileMode.Open)))
{
    Int64 tomtomID;
    Int32 grafoID;
    int numArchi = b.ReadInt32();
    edgeID = new Dictionary<long, int>(numArchi);

    for (int i = 0; i < numArchi; i++)
    {
        tomtomID = (b.ReadInt64());
        grafoID = b.ReadInt32();
        edgeID[keyID] = grafoID;
    }
}

This solved the issue.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top