The main answer is in memory management.
Look at those lines
long **resultMatrix = new long*[capacity];
for (long i = 0; i < capacity; ++i) {
resultMatrix[i] = new long[capacity];
}
All lines are located in different places of memory, not as a whole block. We know how physical memory are presented on Mac Mini — 2 pieces of plastic, but on server it may be even different hosts (cluster).
Now we'll try to fix this.
long **allocateMatrix(long capacity)
{
// Allocating a vector of pointers to rows
long **matrix = (long **)malloc(capacity * sizeof(long *));
// Allocating a matrix as a whole block
matrix[0] = (long *)malloc(capacity * capacity * sizeof(long));
// Initializing a vector of pointers with rows of addresses
long *lineAddress = matrix[0];
for(long i = 0; i < capacity; ++i) {
matrix[i] = lineAddress;
lineAddress += capacity;
}
return matrix;
}
void deallocateMatrix(long **matrix, long capacity)
{
free(matrix[0]);
free(matrix);
}
This boosts code running on Mac Mini to 9.8 seconds, on server — to 58 seconds.
But I still don't know where are other time leaks. Maybe I should somehow optimize looping one of matrices.