Pairwise Distance Calculation in c++

Question 1

A few potential micro-optimsations:

Multiplication may be faster than using pow() to square the distances;
The common factor -1 * G * pow(p_mass,2) / mpc_to_m could be made a constant;
It might be slightly faster to sum 1.0 / dist_betw, then multiply by the common factor at the end.
You might (or might not) be able to approximate the square-root accurately enough, faster than sqrt(). There are plenty of approximations to try.
Convert each co-ordinate to double just once, storing the values in another array, rather than on every iteration of the inner loop.

Algorithmically, you could discard particles that are too far from the current point to make a significant contribution to the energy. A simple modification could check the squared distance, before the expensive square-root, and move on if that's too large. You might get a further improvement by storing the particles in a spatially-aware data structure, like an Octree, or just a simple grid, so that you don't even need to look at distant points; but that might not be practical if the particles move frequently.

Question 2

Conversion from string to double is expensive, move these outside the loop and cache them into a separate datacache[],

typedef struct { double x, y, z; } position;
position datacache[data.size()];
for(int i = 0; i < data.size()-1; i++)
{
   datacache[i].x = (double)atof(data[i][2].c_str());
   datacache[i].y = (double)atof(data[i][3].c_str());
   datacache[i].z = (double)atof(data[i][4].c_str());
}

Then use datacache[].x, .y, .z in your loop.

You can use float instead of double, as you are willing to approximate within 1% (so losing the extra digits of precision you get from double is still within get 8-9 digits of precision).

Another efficiency improvement - you might also consider using fixedpoint integer arithmetic (for the distances), where you decide upon the range of values, and rather than use an an explicit decimal point storage as in float/double, scale your fixedpoint number by an implicit value (which would factor out of the distance calculation.

Algorithmic optimization - separate your 3D space into regions, compute aggregates over regions, and then aggregate the effects from regions.

Question 3

Precalculate everything possible outside the loop
Move conversion from array elements out of the loops to avoud to convert any value more than once.

The Code:

int colstart = 2;
int colend = 4;
double PE = 0;
double p_mass = 8.721e9 * 1.989e30; // mass of sim particle in kg
double mpc_to_m = 3.08567758e22; // meters per mpc
double G = 6.67384e-11; // grav. constant in mks units
double constVal= (-1 * G * pow(p_mass,2)) / mpc_to_m

double values[][3]= ... ;// allocate an array of size  data.size()*3
for(int i = 0; i < data.size()-1; i++){
    value[i][0]=(double)atof(data[i][2].c_str());
    value[i][1]=(double)atof(data[i][3].c_str());
    value[i][2]=(double)atof(data[i][4].c_str());
}

// Calculating PE
for(int i = 0; i < data.size()-1; i++){
     //cout << i << " " << data[i].size() << endl;
     double xi=value[i][0]
     double yi=value[i][1];
     double yi=value[i][2];
     for(int j = 0; j < i; j++){
         double xDiff = xi - value[j][0] ;
         double yDiff = yi - value[j][1] ;
         double zDiff = zi - value[j][2];
         PE += constVal / (xDiff*xDiff + yDiff*yDiff + zDiff*zDiff) ;  
     }
}

But only profiling can show if, and how much this increases speed.

Question 4

You could use a divide and conquer approach like in the closest-pair problem: http://en.m.wikipedia.org/wiki/Closest_pair_of_points_problem. You can also compute the delaunay or voronoi diagram in O(n log n).

Question 5

The Fast Multipole Method has been used to speed up a seemingly similar O(n^2) all-pairs n-body simulation problem to O(n). I'm not familiar with the details beyond once seeing it in a "Top Ten Algorithms Ever Invented" list, but I think the basic idea is that most particle pairs are long-range interactions, which are weak and are not too sensitive to small changes in position, meaning that they can be approximated well by clustering (I'm not sure exactly how this is done) without losing too much accuracy. You might be able to apply similar techniques to your problem.

Question 6

One thing you should do for sure, at the very least, is move the lines

   double x_i = (double)atof(data[i][2].c_str());
   double y_i = (double)atof(data[i][3].c_str());
   double z_i = (double)atof(data[i][4].c_str());

outside of the inner loop. These values depend only on i and not on j, so you definitely don't want to re-parse them every time. Then there are a few micro-optimizations that could get it to run a little faster. Finally, if you are on a multi-processor machine, you can easily parallelize this with openMP. An semi-optimized and parallelized version of the code might look like this:

inline double squared(double x){
  return x * x;
}

double compute_pe(vector<string *> data){
  double PE = 0;
  double p_mass = 8.721e9 * 1.989e30; // mass of sim particle in kg
  double mpc_to_m = 3.08567758e22; // meters per mpc
  double G = 6.67384e-11; // grav. constant in mks units

  double PEarray[data.size()];

  double numerator = (-1 * G * pow(p_mass,2))/ mpc_to_m;


  size_t i,j;

  // Calculating PE
  #pragma omp parallel for private(i, j)
  for(i = 0; i < data.size()-1; i++) // -1 is for empty line at the end of every file
    {
    PEarray[i]=0;
    double x_i = (double)atof(data[i][2].c_str());
    double y_i = (double)atof(data[i][3].c_str());
    double z_i = (double)atof(data[i][4].c_str());

      //cout << i << " " << data[i].size() << endl;
      for(j = 0; j < i; j++)
       {

        double x_j = (double)atof(data[j][2].c_str());
        double y_j = (double)atof(data[j][3].c_str());
        double z_j = (double)atof(data[j][4].c_str());

        double dist_betw = sqrt(squared(x_i-x_j) + squared(y_i-y_j) + squared(z_i-z_j));

        PEarray[i] += numerator / (dist_betw);
       }
    }

  for(i = 0; i < data.size()-1; i++){
    PE += PEarray[i];
  }

  return PE;
}

Question 7

On top of all previous advice, a common trick is to precompute the squared norms of all vectors.

Given that ||x-y||^2 = ||x||^2 + ||y||^2 - 2*x.y, one can avoid a lot of useless multiplications if ||x||^2 is computed for all x once and for all at the beginning.

This pairwise distance problem thus becomes a dot product problem, which is a basic linear algebra operation that can be computed by various optimized libraries depending on your hardware.