Question

​I have a question that now I have a list of 3 million records and I want get all difference values between every two records. A simple nested loop may take forever. Could anyone suggest me any algorithm which is capable of handling this problem?

Was it helpful?

Solution

If you want to calculate the mean of all absolute differences and your timestamps are sorted, you just need one loop:

t[i] <= t[i + 1]    -->    abs(t[i] - t[j]) = t[j] - t[i]    for    i < j

That is, there is a summand with positive sign and another summand with a negative sign for each of the N timestamp differences. Let's look at an example with 4 timestamps:

sum = (t[3] - t[2]) + (t[3] - t[1]) + (t[3] - t[0])
    + (t[2] - t[1]) + (t[2] - t[0]) 
    + (t[1] - t[0])

Here, t[3] is always added, t[2] is added twice and subtracted once, t[1] is added once and subtracted twice and finally the lowest value, t[0] is always subtracted.

Ore, more general: The first timestamp, i.e. the one with the lowest value, has always the negative sign, N - 1 times. The second has N - 2 times negative signs and a positive sign once, namely when comparing the the first timestamp. The third has N - 3 times a negative sign and a positive sign twice.

So your loop goes like this:

sum = 0;
for i = 0 to N:
    sum = sum + (2*i - N + 1) * t[i]

where i is a zero-based index and N an exclusive upper bound, C-style. To get the average, divide by (N - 1) * N / 2.

If your array isn't sorted, you must sort it first, which usually has better performance than quadratic time, so you should be better off than with a nested loop.

One thing that might occur is that by summing up large values, you hit the limits of your data type. You could try to fix that by halving your loop and start summing from both ends in the hope that the differences about cancel themselves out. Alternatively you could already divide by the total number of differences inside the loop, possibly introducing some nasty floating-point round-off errors.

OTHER TIPS

You could parallelise the problem by splitting the file into, say 8, chunks and processing them all at the same time and making the most of those expensive Intel iCores you paid for....

Use the split command to generate the lists.

#!/bin/bash
split -l 375000 yourfile sublist      # split into lumps of 375,000 subfiles called sublist*
for f in sublist*                     # for all list* files
do
   # Start a background process to work on one list
   echo start processing file $f in background &    
done
wait                                  # till all are finished
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top