What is the fastest way to calculate frequency distribution for array in C#?

https://stackoverflow.com/questions/7259228

17-01-2021
|

Pergunta

I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.

Is it good idea to use bucket search for that?

Actually I found that question Calculating frequency distribution of a collection with .Net/C#

But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.

EDIT: After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?

Solução

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

Outras dicas

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:

First of all sort your input array. If you work with real-world data I would recommend to consider Timsort - Wiki for this. It provides very good performance guarantees for a patterns that can be seen in real-world data.
Traverse through sorted array and compare it with the first value in the array of boundaries:
- If value in input array is less then boundary - increment frequency counter for this boundary
- If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.

In a code it can look like this:

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow