Which is my best option to process big 2D arrays in a Java app?

Question 1

Since you stated you "can't deal with requiring large ammounts of RAM to use this app" your only option is to store the big array off RAM - disk being the most obvious choice (using a relational database is just an unnecessary overhead).

You can use a little utility class which provides a persistent 2-dimensional double array functionality. Here is my solution to that using RandomAccessFile. This solution also has the advantage that you can keep the array and reuse it when you restart the application!

Note: the presented solution is not thread-safe. Synchronization needed if you want to access it from multiple threads concurrently.

Persistent 2-dimensional double array:

public class FileDoubleMatrix implements Closeable {

    private final int rows;
    private final int cols;
    private final long rowSize;
    private final RandomAccessFile raf;

    public FileDoubleMatrix(File f, int rows, int cols) throws IOException {
        if (rows < 0 || cols < 0)
            throw new IllegalArgumentException(
                "Rows and cols cannot be negative!");
        this.rows = rows;
        this.cols = cols;
        rowSize = cols * 8;
        raf = new RandomAccessFile(f, "rw");
        raf.setLength(rowSize * cols);
    }

    /**
     * Absolute get method.
     */
    public double get(int row, int col) throws IOException {
        pos(row, col);
        return get();
    }

    /**
     * Absolute set method.
     */
    public void set(int row, int col, double value) throws IOException {
        pos(row, col);
        set(value);
    }

    public void pos(int row, int col) throws IOException {
        if (row < 0 || col < 0 || row >= rows || col >= cols)
            throw new IllegalArgumentException("Invalid row or col!");
        raf.seek(row * rowSize + col * 8);
    }

    /**
     * Relative get method. Useful if you want to go though the whole array or
     * though a continuous part, use {@link #pos(int, int)} to position.
     */
    public double get() throws IOException {
        return raf.readDouble();
    }

    /**
     * Relative set method. Useful if you want to go though the whole array or
     * though a continuous part, use {@link #pos(int, int)} to position.
     */
    public void set(double value) throws IOException {
        raf.writeDouble(value);
    }

    public int getRows() { return rows; }

    public int getCols() { return cols; }

    @Override
    public void close() throws IOException {
        raf.close();
    }

}

The presented FileDoubleMatrix supports relative get() and set() methods which is very useful if you process your whole array or a continuous part of it (e.g. you iterate over it). Use the relative methods when you can for faster operations.

Example using the FileDoubleMatrix:

final int rows = 10;
final int cols = 10;

try (FileDoubleMatrix arr = new FileDoubleMatrix(
            new File("array.dat"), rows, cols)) {

    System.out.println("BEFORE:");
    for (int row = 0; row < rows; row++) {
        for (int col = 0; col < cols; col++) {
            System.out.print(arr.get(row, col) + " ");
        }
        System.out.println();
    }

    // Process array; here we increment the values
    for (int row = 0; row < rows; row++)
        for (int col = 0; col < cols; col++)
            arr.set(row, col, arr.get(row, col) + (row * cols + col));

    System.out.println("\nAFTER:");
    for (int row = 0; row < rows; row++) {
        for (int col = 0; col < cols; col++)
            System.out.print(arr.get(row, col) + " ");
        System.out.println();
    }
} catch (IOException e) {
    e.printStackTrace();
}

More about the relative get and set methods:

The absolute get and set methods require the position (row and column) of the element to be returned or set. The relative get and set methods do not require the position, they return or set the current element. The current element is in fact the pointer of the underlying file. The position can be set with the pos() method.

Whenever a relative get() or set() method is called, after returning they implicitly move the pointer to the next element, in a row-continuity manner (moving to the next in the row, and if the end of row reached, moving to the first element of the next row etc.)

For example here is how we can zero the whole array using the relative set method:

// Fill the whole array with zeros using relative set
// First position to the beginning:
arr.pos(0, 0);

// And execute a "set zero" operation
// as many times as many elements the array has:
for ( int i = rows * cols; i > 0; i--)
    arr.set(0);

The relative get and set methods automatically move the pointer to the next element.

It should be obvious that in my implementation the absolute get and set methods also change the pointer which must not be forgotten when relative and absolute get/set methods are used.

Another example: let's set the sum of each row to the last element of the row, but also include the last element in the sum! For this we will use the mixture of realtive and absolute get/set methods:

// Start with the first row:
arr.pos(0, 0);

for (int row = 0; row < rows; row++) {
    double sum = 0;
    for (int col = 0; col < cols; col++)
        sum += arr.get(); // Relative get to calculate row sum

    // Now set the sum to the end of row.
    // For this we have to position back, so we use the absolute set.
    arr.set(row, cols - 1, sum);

    // The absolute set method also moves the pointer, and since
    // it is the end of row, it moves to the first of the next row.
}

And that's all. Using the relative get/set methods we don't have to pass the "matrix indices" when processing continuous parts of the array, and also the implementation does not have to move the internal pointer which is more than handy when processing millions of elements as in your example.

Question 2

Well, for trivia's sake, that matrix you're showing consumes roughly 2.6Gb of RAM. So, that's a benchmark of how much memory you need should you decided to pursue that tact.

If it's efficient for you, you could store the rows of the matrix in to blobs within a database. In this case you'd have 18658 rows, with a serialized double[18658] store on it.

I wouldn't suggest that though.

A better tact would be to use the image file directly, and look at NIO and byte buffers to use mmap to map them in to your program space.

Then you can use things like DoubleBuffers to access the data. This lets the VM page in as much of the original file is necessary, and it also keeps the data off the Java heap (rather it's stored in process RAM associated with the JVM). The big benefit is that it keeps these monster data structures away from the Garbage Collector.

You'll still need physical RAM on the machine, of course, but it's not Java Heap RAM.

But this is would likely be the most efficient way to access this data for your process.

Question 3

I would recommend the following things in order.

Investigate why your app is running out of memory. Are you creating arrays or other objects bigger than what you need. I hope you might have done that already. But still I thought it's worth mentioning because this should not be ignored.
If you think there is nothing wrong with step 1 then check you are not running with too low memory settings. or 32 bit jvm
If there is no issue with step 2. Now it's not always true that a light weight database will give you best performance. If you don't require searching the temp data may be you won't gain much from implementing a light weight database. But if your application needs lot of searching / querying the temp data it may be different case. If you don't need searching custom file format may be fast and efficient.

I hope it helps you solve the issue at hand :)

Question 4

The simplest fix would be simply to give your program more memory. For example, if you specify -xmx 11G on your Java command line, the JVM will be able to allocate up to 11 GB of heap space - enough memory to carry several copies of your array, which is around 2.6 GB in size, in memory at a time.

If speed is really not an issue, you can do this even if you don't have enough physical memory, by allocating enough virtual memory and letting the OS swap the memory to disk.

I personally also think this is the best solution. Memory on this scale is cheaper than programmer time.

Question 5

I would suggest a different approach.

Since most image processing operations are done by going over all of the pixels in some order exactly once, it's usually possible to do them on one piece of the image at a time. What I mean is that there's usually no random access to pixels of the image. If I'm not mistaking, all of the operations you mention in your question fit this description.

Therefore, I would suggest loading the image lazily, a piece at a time. Then, implement methods that retrieve the next chunk of pixels once the previous one is processed, and feeds these chunks to the algorithms you use.

In order to support that, I would suggest converting the images to a non compressed format that you could create a lazy reader for easily.

Question 6

Not sure I would bother with a database for this, just open a temporary file and spill parts of your matrix in there as needed, and delete the file when you're done. Whatever solution you choose has to depend somewhat on your matrix library being able to use it. If you're using a third party library then you're probably limited to whatever options (if any) they provide. However if you've implemented your own matrix operations then definitely would just go with a temporary file that I manage myself. That will be fastest and lightest weight.

Question 7

You can use split and reduce technique. split your image into small fragments, or you can use sliding window technique

http://forums.ni.com/t5/Machine-Vision/sliding-window-technique/td-p/2586621

cheers,