Question

I try to solve one problem on codeforces. And I get Time limit exceeded judjment. The only time consuming operation is calculation sum of big array. So I've tried to optimize it, but with no result.

What I want: Optimize the next function:

//array could be Integer.MAX_VALUE length
private long canocicalSum(int[] array) { 
    int sum = 0;
    for (int i = 0; i < array.length; i++)
        sum += array[i];
    return sum;
}

Question1 [main]: Is it possible to optimize canonicalSum?

I've tried: to avoid operations with very big numbers. So i decided to use auxiliary data. For instance, I convert array1[100] to array2[10], where array2[i] = array1[i] + array1[i+1] + array1[i+9].

private long optimizedSum(int[] array, int step) {
    do {
        array = sumItr(array, step);
    } while (array.length != 1);
    return array[0];
}

private  int[] sumItr(int[] array, int step) {
    int length = array.length / step + 1;
    boolean needCompensation = (array.length % step == 0) ? false : true;
    int aux[] = new int[length];
    for (int i = 0, auxSum = 0, auxPointer = 0; i < array.length; i++) {
        auxSum += array[i];
        if ((i + 1) % step == 0) {
            aux[auxPointer++] = auxSum;
            auxSum = 0;
        }
        if (i == array.length - 1 && needCompensation) {
            aux[auxPointer++] = auxSum;
        }
    }
    return aux;
}

Problem: But it appears that canonicalSum is ten times faster than optimizedSum. Here my test:

@Test
public void sum_comparison() {
    final int ARRAY_SIZE = 100000000;
    final int STEP = 1000;
    int[] array = genRandomArray(ARRAY_SIZE);

    System.out.println("Start canonical Sum");
    long beg1 = System.nanoTime();
    long sum1 = canocicalSum(array);
    long end1 = System.nanoTime();
    long time1 = end1 - beg1;
    System.out.println("canon:" + TimeUnit.MILLISECONDS.convert(time1, TimeUnit.NANOSECONDS) + "milliseconds");

    System.out.println("Start optimizedSum");
    long beg2 = System.nanoTime();
    long sum2 = optimizedSum(array, STEP);
    long end2 = System.nanoTime();
    long time2 = end2 - beg2;
    System.out.println("custom:" + TimeUnit.MILLISECONDS.convert(time2, TimeUnit.NANOSECONDS) + "milliseconds");

    assertEquals(sum1, sum2);
    assertTrue(time2 <= time1);
}

private int[] genRandomArray(int size) {
    int[] array = new int[size];
    Random random = new Random();
    for (int i = 0; i < array.length; i++) {
        array[i] = random.nextInt();
    }
    return array;
}

Question2: Why optimizedSum works slower than canonicalSum?

Was it helpful?

Solution 2

Question1 [main]: Is it possible to optimize canonicalSum?

Yes, it is. But I have no idea with what factor.

Some things you can do are:

  • use the parallel pipelines introduced in Java 8. The processor has instruction for doing parallel sum of 2 arrays (and more). This can be observed in Octave when you sum two vectors with ".+" (parallel addition) or "+" it is way faster than using a loop.

  • use multithreading. You could use a divide and conquer algorithm. Maybe like this:

    • divide the array into 2 or more
    • keep dividing recursively until you get an array with manageable size for a thread.
    • start computing the sum for the sub arrays (divided arrays) with separate threads.
    • finally add the sum generated (from all the threads) for all sub arrays together to produce final result
  • maybe unrolling the loop would help a bit, too. By loop unrolling I mean reducing the steps the loop will have to make by doing more operations in the loop manually.

An example from http://en.wikipedia.org/wiki/Loop_unwinding :

for (int x = 0; x < 100; x++)
{
    delete(x);
}

becomes

for (int x = 0; x < 100; x+=5)
{
    delete(x);
    delete(x+1);
    delete(x+2);
    delete(x+3);
    delete(x+4);
}

but as mentioned this must be done with caution and profiling since the JIT could do this kind of optimizations itself probably.

A implementation for mathematical operations for the multithreaded approach can be seen here.

The example implementation with the Fork/Join framework introduced in java 7 that basically does what the divide and conquer algorithm above does would be:

public class ForkJoinCalculator extends RecursiveTask<Double> {

   public static final long THRESHOLD = 1_000_000;

   private final SequentialCalculator sequentialCalculator;
   private final double[] numbers;
   private final int start;
   private final int end;

   public ForkJoinCalculator(double[] numbers, SequentialCalculator sequentialCalculator) {
     this(numbers, 0, numbers.length, sequentialCalculator);
   }

   private ForkJoinCalculator(double[] numbers, int start, int end, SequentialCalculator sequentialCalculator) {
     this.numbers = numbers;
     this.start = start;
     this.end = end;
     this.sequentialCalculator = sequentialCalculator;
   }

   @Override
   protected Double compute() {
     int length = end - start;
     if (length <= THRESHOLD) {
         return sequentialCalculator.computeSequentially(numbers, start, end);
     }
     ForkJoinCalculator leftTask = new ForkJoinCalculator(numbers, start, start + length/2, sequentialCalculator);
     leftTask.fork();
     ForkJoinCalculator rightTask = new ForkJoinCalculator(numbers, start + length/2, end, sequentialCalculator);
     Double rightResult = rightTask.compute();
     Double leftResult = leftTask.join();
     return leftResult + rightResult;
  }
}

Here we develop a RecursiveTask splitting an array of doubles until the length of a subarray doesn't go below a given threshold. At this point the subarray is processed sequentially applying on it the operation defined by the following interface

The interface used is this:

public interface SequentialCalculator {
  double computeSequentially(double[] numbers, int start, int end);
}

And the usage example:

public static double varianceForkJoin(double[] population){
   final ForkJoinPool forkJoinPool = new ForkJoinPool();
   double total = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
     @Override
     public double computeSequentially(double[] numbers, int start, int end) {
       double total = 0;
       for (int i = start; i < end; i++) {
         total += numbers[i];
       }
       return total;
     }
  }));
  final double average = total / population.length;
  double variance = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
    @Override
    public double computeSequentially(double[] numbers, int start, int end) {
      double variance = 0;
      for (int i = start; i < end; i++) {
        variance += (numbers[i] - average) * (numbers[i] - average);
      }
      return variance;
    }
 }));
 return variance / population.length;
}

OTHER TIPS

As of Java 9, vectorisation of this operation has been implemented but disabled, based on benchmarks measuring the all-in cost of the code plus its compilation. Depending on your processor, this leads to the relatively entertaining result that if you introduce artificial complications into your reduction loop, you can trigger autovectorisation and get a quicker result! So the fastest code, for now, assuming numbers small enough not to overflow, is:

public int sum(int[] data) {
    int value = 0;
    for (int i = 0; i < data.length; ++i) {
        value += 2 * data[i];
    }
    return value / 2;
}

This isn't intended as a recommendation! This is more to illustrate that the speed of your code in Java is dependent on the JIT, its trade-offs, and its bugs/features in any given release. Writing cute code to optimise problems like this is at best vain and will put a shelf life on the code you write. For instance, had you manually unrolled a loop to optimise for an older version of Java, your code would be much slower in Java 8 or 9 because this decision would completely disable autovectorisation. You'd better really need that performance to do it.

If you want to add N numbers then the runtime is O(N). So in this aspect your canonicalSum can not be "optimized".
What you can do to reduce runtime is make the summation parallel. I.e. break the array to parts and pass it to separate threads and in the end sum the result returned by each thread.
Update: This implies multicore system but there is a java api to get the number of cores

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top