Вопрос

Sorry in advance for the long post, and thanks to whoever will take the time to get to the end and give me feedback.

I have performance related questions regarding operations on lists and arrays.

I have written a software to perform some operations on data gathered from an array of sensors. In an attempt to make it run faster, I am currently attempting to write some optimizations.

The data gathered is in a N-by-M array of doubles (actually implemented as a class extending List<List<double>>). We always have this.Count() == N and this[i].Count() == M for any value of i. Basically, it's a rectangular array.

Each data point from this array is correlated to some points on a X-by-Y map. Basically, imagine this to be an image I make out of the data to represent it in a quick and clear manner. Hence there is, for each data point a List<int[]> of the map-points it relates to. This fact is represented by List<int[]>[,] pointsLocal. I also hold a static List<int[]>[,] where I store this same information: this way I can modify pointsLocal to my leisure in an elaboration loop, and have a fresh copy the next time these methods are called. The same sensor will always be correlated to the same points, and that is why I have that local array. Some points (most of them actually) are related to more than one sensor, and thus are in many lists.

In other parts of my code I am able to correctly identify the fact that some sensors of the array had some issue, and the data then contains errors. This is represented in private List<List<bool>> faultyData. If a sensor gives a faulty output, then I have to assume that all the points is it correlated to may suffer from the fault, thus I don't care for the results of further analysis in those map-points.

The computational part of my code aggregates data from all the sensors in the array for each map-point. What I am trying to do is to predetermine a subset of map-points on which I don't have to perform any analysis.

The class PointsEqualityComparer is a custom comparison operator for int[], that I use because map-points are identified by their 2D coordinates.

public class Sinogram : List<List<double>>
{
    //various enums
   private List<List<bool>> faultyData; //faultyData[i][j] is true if there is an error in the data
    //constructors
    //some methods
    public void dataAnalysis()
    {
        List<int[]>[,] pointsLocal = new List<int[]>[this.Count(), this[0].Count()];
        List<int[]> faultyPoints = new List<int[]>();
        //Fill pointsLocal with the correlated points from the static array
        PointsEqualityComparer myComparer = new PointsEqualityComparer();
        //Point selection parts (see later for the two different implementations)
        //Actual analysis parts (not here because it is not relevant to my question, but it works)
    }
}

The comparer class is as follow. I already found out that the GetHashCode method had to return results as unique as possible to improve performance, so I implemented it as you see in this snippet.

 public class PointsEqualityComparer : IEqualityComparer<int[]>
 {
    public bool Equals(int[] p1, int[] p2)
    {
        bool result = (p1.Count() == p2.Count()) && (p1.Count() == 2) && (p1[0] == p2[0]) && (p1[1] == p2[1]);
        return result;
    }

    public int GetHashCode(int[] obj)
    {
        return ((obj[0] + obj[1]) * (obj[0] + obj[1] + 1) + obj[1]);
    }
}

Now for the tricky parts. I have two different implementations for the portion of code where I actually select which map-points are interesting. By interesting I mean the map-points on which I will have to aggregate the data from the sensors. I select them by actually identifying the points subject to errors and removing them from the lists.

In my first implementation I loop through all Lists of map-points. If the corresponding sensor was faulty, I add those points to a list of faulty points (avoiding duplicates). Once I looped through all points and generated the full list of faulty ones, I update allPairsLocal by removing them. The list of faultyPoints can become rather big, especially in some cases when a lot of sensors report errors (the maximum theoretical size is more than 2000000 elements, if all the sensors report error and I am trying to create a 1920*1080 map to plot as an HD image)

for (int i = 0; i <this.Count; i++)
{
    for (int j = 0; j < this[i].Count; j++)
    {
        if (faultyData[i][j])
        {
            faultyPoints = faultyPoints.Union<int[]>(allPairsLocal[i, j], myComparer).ToList();
        }
    }
}
for (int i = 0; i <this.Count; i++)
{
    for (int j = 0; j < this[i].Count; j++)
    {
        allPairsLocal[i, j] = allPairsLocal[i, j].Except(faultyPoints, myComparer).ToList();
    }
}

In my second implementation I tried to have a smaller faultyPoints list. Thus, what I do is, for each sensor reporting errors, use its list to remove map-points from all the others (and its own as well). This maintains the dimensions of the lists smaller, but at the cost of more annidated loops.

for (int i = 0; i <this.Count; i++)
{
    for (int j = 0; j < this[i].Count; j++)
    {
        if (faultyData[i][j])
        {
            faultyPoints = allPairsLocal[i, j]. ToList();
            for (int x = 0; x < this.Count; x++)
            {
                for (int y = 0; y < this[x].Count; y++)
                {
                    allPairsLocal[x, y] = allPairsLocal[x, y].Except(faultyPoints, myComparer).ToList();
                }
            }
        }
    }
}

Both of these implementations are very slow, and I guess this is at least partly because of the size of the data sets. Both take way longer that performing the data analysis steps on the whole datasets. Are there ways to make similar operations but with faster implementations? Some steps can probably be made parallel, but that wouldn't really change the substance. Are there data structures that have O(1) methods to implement what I do here with Union and Except?

Again thank you for having read through my whole post. I appreciate any feedback, even if it is not a full answer, and I am more than available to clarify what points I can.

Это было полезно?

Решение

If I understand you correctly, once you populate the pointsLocal array, we have the following for each sensor (i,j):

  • this[i][j] = data from sensor (i,j)
  • pointsLocal[i,j] = list of map points for sensor (i,j)
  • faultyData[i][j] = true if data from sensor (i,j) is bad, and is false otherwise

Consider "inverting" your data, so that given a map point (x,y) you can efficiently

  • Find out whether the point is faulty (i.e. any sensor reporting data for the map point is faulty)
  • Obtain a list of sensors reporting data correlated with the map point

To do this, we can create a dictionary that uses the comparer that you have already written. Each key is an (x,y) pair (i.e. an int[2]) representing a map point; the value returned, if any, is the list of known sensors contributing to that point. A returned value of null indicates that the map point is "infected" by a faulty sensor and should be ignored. If a given pair does not exist in the dictionary at all, it means no sensors contribute to that point.

var mapPoints = new Dictionary<int[], List<int[]>)(PointsEqualityComparer);

for (int i = 0; i <this.Count; i++)
{
    for (int j = 0; j < this[i].Count; j++)
    {
        foreach (var point in pointsLocal[i,j]) 
        {
            if (faultyData[i][j])
            {
                // infected point
                mapPoints[point] = null;  
            }
            else
            {
                // Add the current sensor's indices (i,j) to the list of 
                // known sensors for the current map point

                List<int[]> sensors = null;
                if (!mapPoints.TryGetValue(point, out sensors)) 
                {
                    sensors = new List<int[]>();
                    mapPoints[point] = sensors;
                }

                // null here means that we previously determined that the
                // current map point is infected 
                if (sensors != null) 
                {
                    // Add sensor to list for this map point
                    sensors.Add(new int[] { i, j });
                }
            }
        }
    } 
}

Now you can enumerate all of the map points, classifying each as good or bad:

var faultyPoints = new List<int[]>();  // not sure you really need this? 
var goodPoints = new List<int[]>();
foreach (var point in mapPoints.Keys)
{
    var sensors = mapPoints[point];
    if (sensors == null)
         faultyPoints.Add(point);
    else
         goodPoints.Add(point);
}

Finally, you could enumerate the sensors for each good map point, to do your analysis:

foreach (var point in goodPoints) 
{
    var sensors = mapPoints[point]; 
    // for current point, aggregate data for each sensor listed in "sensors"
}

Note that I have not altered allPairsLocal, because it doesn't seem necessary for the analysis step. However, if you really need to remove the faulty map points from that, you could do that efficiently as well:

for (int i = 0; i <this.Count; i++)
{
    for (int j = 0; j < this[i].Count; j++)
    {
        var points = allPairsLocal[i][j];
        var cleanedUp = new List<int[]>();
        foreach (var point in points) 
        {
            // Important: do NOT use 'faultyPoints' here. It will kill performance
            if (mapPoints[point] != null)
            {
               cleanedUp.Add(point); 
            }
        }
        allPairsLocal[i][j] = cleanedUp;   
    }
}

The performance improvement in all of this comes from using a Dictionary to look up a single map point whenever you need to know if it is faulty or what its contributing sensors are. The lookup is essentially a constant time operation (amortized), if your hash function is good.

There are a number of optimizations you could make here. For example, do you really need to know the sensor indices to do the aggregations for each map point? Or do you just need the data values? If the latter, then your dictionary would be Dictionary<List<double>>. Finally, the code could be made more compact by using Linq (instead of loops) to do many of the enumerations.

Другие советы

Yes, you're right. And it's because Union and Except operations complexity.
You have N-by-M table of sensors (you named it above as Lists of map-points). Every sensor affects an array of points (you named it as allPairsLocal[i, j]). And every array of points is a subset of global predetermined array of points (points on a X-by-Y map).
If I am right, then:

  1. Points on a X-by-Y map - this's a global array of points. Even more, since you can compare points you can sort them and keep this array sorted (I mean maybe not actually sorted but with good read-operation complexity). Use Dictionary<int[], int> for it where key - point's coordinates, value - order index (set it after inserting all points).
  2. Now we have sets of sensors and points (let's name Dictionary<int[], int> from step 1 as points). We need to constuct 2 mappings - one sensors2points (name it s2p) and points2sensors (name it p2s). You have allPairsLocal as sensors2points and looks like List<int[]>[][], i.e. list of points coordinates for each sensor. But we need to keep list of indices to points coordinates for each sensor, i.e. convert int[] to it's order index in points:

    // straight and inverted mappings
    var s2p = new List<int>[N*M];
    var p2s = new List<List<int>>(point.Count);
    //and initialize p2s inner lists
    for (int i = 0; i < p2s.Count; i++)
        p2s[i] = new List<int>();
    
    for (int i = 0; i < N * M; i++)
    {
        s2p[i] = new List<int>(allPairsLocal[i/M][i%M].Count);
    
        //convert list of points coordinates to list of it's indices
        // and construct inverted mapping
        foreach(int[] p in allPairsLocal[i/M][i%M])
        {
            // points[p] - index of point p in Dictionary if you remember
            s2p[i].Add(points[p]);
            p2s[points[p]].Add(i);
        }            
    }
    

I think it's clear that steps 1 and 2 are needed to be performed only once at initialization time. And then to select interesting points you need:

//I don't know which set you need as a result - valid points or sensors so I do both

// false - correct, true - error. Initialized with false
BitArray sensorsMask = new BitArray(sensors.Count);
BitArray pointsMask = new BitArray(points.Count);

for (int i = 0; i < N * M; i++)
{
    if (faultyData[i / M][i % M])
        sensorsMask[i] = true; // means error in sensor

    foreach(int p in s2p[i])
        pointsMask[p] = true;
}

// so you can get only valid sensors
var validSensors = new List<int>();
for (int i = 0; i < N * M; i++)
    if (!sensorsMask[i])
        validSensors.Add(i);

// or only valid points
var validPoints = new List<int[]>();
foreach (var pair in points)
    if (!pointsMask[pair.Value])
        validPoints.Add(points.Key);

It's maybe not very effictive way (it's hard to tell what exactly you want to get) but it's better than operating with sets. I mean playing with mask-array vs sets. Hope it'll help.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top