Question

I have a dictionary<string, int[]> which I need to store and retrieve as efficiently as possible from the disk.

The key length (string) will typically vary from 1 to 60 characters (unicode) but could exceed that length (this is however marginal and these values could be discarded). Integers in the array will be in the range 1 to 100 Million. (Typically, 1 to 5M)

My first idea was to use a delimited format:

key [tab] int,int,int,int,...
key2 [tab] int,int,int,int,...
...

and to load the dictionary as follows:

string[] Lines = File.ReadAllLines(sIndexName).ToArray();
string[] keyValues = new string[2];
List<string> lstInts =  new List<string>();
// Skip the header line of the index file.
for (int i = 1; i < Lines.Length; i++)
{
    lstInts.Clear();
    keyValues = Lines[i].Split('\t');
    if (keyValues[1].Contains(','))
    {
        lstInts.AddRange(keyValues[1].Split(','));
    }
    else
    {
        lstInts.Add(keyValues[1]);
    }
    int[] iInts = lstInts.Select(x => int.Parse(x)).ToArray();
    Array.Sort(iInts);
    dic.Add(keyValues[0], iInts);               
}

It works, but going over the potential size requirements, it's obvious this method is never going to scale well enough.

Is there an off-the-shelf solution for this problem or do I need to rework the algorithm completely?


Edit: I am a little embarassed to admit it, but I didn't know dictionaries could just be serialized to binary. I gave it a test run and and it's pretty much what I needed.

Here is the code (suggestions welcome)

    public static void saveToFile(Dictionary<string, List<int>> dic)
{
    using (FileStream fs = new FileStream(_PATH_TO_BIN, FileMode.OpenOrCreate))
    {
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(fs, dic);
    }
}

public static Dictionary<string, List<int>> loadBinFile()
{
    FileStream fs = null;
    try
    {
        fs = new FileStream(_PATH_TO_BIN, FileMode.Open);
        BinaryFormatter bf = new BinaryFormatter();
        return (Dictionary<string, List<int>>)bf.Deserialize(fs);
    }
    catch
    {
        return null;
    }
}

With a dictionary of 100k entries with an array of 4k integers each, serialization takes 14 seconds and deserialization 10 seconds and the resulting file is 1.6gb.

@Patryk: Please convert your comment to an answer so I can mark it as approved.

Was it helpful?

Solution

The Dictionary<TKey, TValue> is marked as [Serializable] (and implements ISerializable) which can be seen here.

That means you can use e.g. BinaryFormatter to perform binary serialization and deserialization to and from a stream. Say, FileStream. :)

OTHER TIPS

I'm guessing you want to reduce the memory footprint during load. Right now you're loading everything into memory in an array, then copying everything into a dictionary. Until the original array goes out of scope and gets garbage collected there will be a period of time with roughly 2x the memory usage needed. If it's a very large file then that could be a lot... if it's only a few megabytes it's not a big deal.

If you want to do this more efficiently you can read the data from a stream like so:

string fileName = @"C:\...";
var dict = new Dictionary<string, int[]>();

using (var fs = new FileStream(fileName, FileMode.Open))
using (var reader = new StreamReader(fs))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        var values = line.Split(',');
        dict.Add(values[0], values.Skip(1).Select(x => Convert.ToInt32(x)).ToArray());
    }       
}

Or you can use the shortcut Jim suggested:

string fileName = @"C:\...";
var dict = new Dictionary<string, int[]>();

foreach (string line in File.ReadLines(fileName))
{
    var values = line.Split(',');
    dict.Add(values[0], values.Skip(1).Select(x => Convert.ToInt32(x)).ToArray());
}

This makes some strict presumptions about the file format. Notably that each line is in the format key,int1,int2,int3,int4,... and that the key doesn't contain a comma. Each line also must end in an Environment.NewLine character.

Although it's worth noting that you should consider the fact that while your current code is not terribly efficient, it is not your main bottleneck. The file read speed is usually the biggest bottleneck. If you are actually experiencing performance issues with your code it most likely simply has to do with you reading from the file synchronously. Any file I/O should be done asynchronously in an application with a user interface.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top