outputing dictionary optimally

https://stackoverflow.com/questions/9913788

27-05-2021
|

Pregunta

i have 4 dictionarys that contain 800k strings with 200 to 6000 characters. when i load it into memory it takes up about 11 gigs of memory. it is taking me 2 minutes to parse the data and 2 minutes to output the data. is there anyway to output the data faster than what i am using below? I am only getting 20-31 MB per second disk IO and I know the hard drive can do 800ish

var hash1 = new Dictionary<int, Dictionary<string, string>>(f.Count + 2);
var hash2 = new Dictionary<int, Dictionary<string, string>>(f.Count + 2);
var hash3 = new Dictionary<int, Dictionary<string, string>>(f.Count + 2);
var hash4 = new Dictionary<int, Dictionary<string, string>>(f.Count + 2);
....
foreach (var me in mswithfilenames)
{
    filename = me.Key.ToString();
    string filenamef = filename + "index1";
    string filenameq = filename + "index2";
    string filenamefq = filename + "index3";
    string filenameqq = filename + "index4";

    StreamWriter sw = File.AppendText(filenamef);
    StreamWriter sw2 = File.AppendText(filenameq);
    StreamWriter swq = File.AppendText(filenamefq);
    StreamWriter sw2q = File.AppendText(filenameqq);

    for (i = 0; i <= totalinhash; i++)
    {
        if (hashs1[i].ContainsKey(filenamef))
        {
            sw.Write(hashs1[i][filenamef]);
        }
        if (hashs2[i].ContainsKey(filenameq))
        {
            sw2.Write(hashs2[i][filenameq]);
        }
        if (hashs3[i].ContainsKey(filenamefastaq))
        {
            swq.Write(hash4[i][filenamefastaq]);
        }

        if (hash4[i].ContainsKey(filenameqq))
        {
            sw2q.Write(hash4[i][filenameqq]);
        }
    }

    sw.Close();
    sw2.Close();
    sw3.Close();
    sw4.Close();
    swq.Close();
    sw2q.Close();
}

Solución

The most expensive part is the I/O. And this loop:

for (i = 0; i <= totalinhash; i++)
{
    if (hashs1[i].ContainsKey(filenamef))
    {
        sw.Write(hashs1[i][filenamef]);
    }
    if (hashs2[i].ContainsKey(filenameq))
    {
        sw2.Write(hashs2[i][filenameq]);
    }
    ...
}

is alternating between different files. That will probably cause some extra head-movement and it creates fragmented files (slowing future actions on those files).

I would use:

for (i = 0; i <= totalinhash; i++)
{
    if (hashs1[i].ContainsKey(filenamef))
    {
        sw.Write(hashs1[i][filenamef]);
    }
}

for (i = 0; i <= totalinhash; i++)
{
    if (hashs2[i].ContainsKey(filenameq))
    {
        sw2.Write(hashs2[i][filenameq]);
    }
}
...

But of course you should measure this. It won't make much difference on SSDs for instance, only on mechanical disks.

Otros consejos

Did you measure anything? It sounds like you have non trivial amount of data to read and write - so first step would be to establish absolute baseline for your disk subsystem on how fast it reads/writes that much of data. Simple read of the file followed by write to new file of approximate amount of data you expect will show how far you can go in optimizing it.

You may fine that your code itself does not take too much more time over reading/writing.

Can you have a Dictionary<int, Dictionary<string, myCustomDataHolder>> rather than four separate parallel Dictionary<int, Dictionary<string, string>? Not only should it reduce the space consumed quite a lot, but it means 1/4 the dictionary lookups.

It's not quite clear if the dictionaries are entirely parallel given your question, but it seems likely enough to me.

I'd like to add that

if (hashs1[i].ContainsKey(filenamef))
{
   sw.Write(hashs1[i][filenamef]);
}

Takes 2 hash table accesses. One for the contains key, and one for the actual access. Many dictionary accesses can add up, so you can halve these accesses by using the dictionary tryGetValue method. This will combine these two calls into one. I could explain how this works, but this does the job better than I could: http://www.dotnetperls.com/trygetvalue

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow