Question

more updates

As is explained in the selected answer, the problem is in JVM's garbage collection algorithm. JVM uses card marking algorithm to keep track of modified references in object fields. For each reference assignment to a field, it marks an associated bit in the card to be true -- this causes a false-sharing hence blocks scaling. The details are well described in this article: https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card

The option -XX:+UseCondCardMark (in Java 1.7u40 and up) mitigates the problem, and makes it scale almost perfectly.


updates

I found out (hinted from Park Eung-ju) that assigning an object into a field variable makes the difference. If I remove the assignment, it scales perfectly. I think probably it has something to do with Java memory model -- such as, an object reference must point to a valid address before it gets visible, but I am not completely sure. Both double and Object reference (likely) have 8 bytes size on 64 bit machine, so it seems to me that assigning a double value and an Object reference should be the same in terms of synchronization.

Anyone has a reasonable explanation?


Here I have a weird Java multi-threading scalability problem.

My code simply iterates over an array (using the visitor pattern) to compute simple floating-point operations and assign the result to another array. There is no data dependency, nor synchronization, so it should scale linearly (2x faster with 2 threads, 4x faster with 4 threads).

When primitive (double) array is used, it scales very well. When object type (e.g. String) array is used, it doesn't scale at all (even though the value of the String array is not used at all...)

Here's the entire source code:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.concurrent.CyclicBarrier;

class Table1 {
    public static final int SIZE1=200000000;
    public static final boolean OBJ_PARAM;

    static {
        String type=System.getProperty("arg.type");
        if ("double".equalsIgnoreCase(type)) {
            System.out.println("Using primitive (double) type arg");
            OBJ_PARAM = false;
        } else {
            System.out.println("Using object type arg");
            OBJ_PARAM = true;
        }
    }

    byte[] filled;
    int[] ivals;
    String[] strs;

    Table1(int size) {
        filled = new byte[size];
        ivals = new int[size];
        strs = new String[size];

        Arrays.fill(filled, (byte)1);
        Arrays.fill(ivals, 42);
        Arrays.fill(strs, "Strs");
    }

    public boolean iterate_range(int from, int to, MyVisitor v) {
        for (int i=from; i<to; i++) {
            if (filled[i]==1) {
                // XXX: Here we are passing double or String argument
                if (OBJ_PARAM) v.visit_obj(i, strs[i]);
                else v.visit(i, ivals[i]);
            }
        }
        return true;
    }
}

class HeadTable {
    byte[] filled;
    double[] dvals;
    boolean isEmpty;

    HeadTable(int size) {
        filled = new byte[size];
        dvals = new double[size];
        Arrays.fill(filled, (byte)0);

        isEmpty = true;
    }

    public boolean contains(int i, double d) {
        if (filled[i]==0) return false;

        if (dvals[i]==d) return true;
        return false;
    }

    public boolean contains(int i) {
        if (filled[i]==0) return false;

        return true;
    }
    public double groupby(int i) {
        assert filled[i]==1;
        return dvals[i];
    }
    public boolean insert(int i, double d) {
        if (filled[i]==1 && contains(i,d)) return false;

        if (isEmpty) isEmpty=false;
        filled[i]=1;
        dvals[i] = d;
        return true;
    }
    public boolean update(int i, double d) {
        assert filled[i]==1;
        dvals[i]=d;
        return true;
    }
}


class MyVisitor {
    public static final int NUM=128;

    int[] range = new int[2];
    Table1 table1;
    HeadTable head;
    double diff=0;
    int i;
    int iv;
    String sv;

    MyVisitor(Table1 _table1, HeadTable _head, int id) { 
        table1 = _table1;
        head = _head;
        int elems=Table1.SIZE1/NUM;
        range[0] = elems*id;
        range[1] = elems*(id+1);
    }

    public void run() {
        table1.iterate_range(range[0], range[1], this);
    }

    //YYY 1: with double argument, this function is called
    public boolean visit(int _i, int _v) {
        i = _i;
        iv = _v;
        insertDiff();
        return true;
    }

    //YYY 2: with String argument, this function is called
    public boolean visit_obj(int _i, Object _v) {
        i = _i;
        iv = 42;
        sv = (String)_v;
        insertDiff();
        return true;
    }

    public boolean insertDiff() {
        if (!head.contains(i)) {
            head.insert(i, diff);
            return true;
        }
        double old = head.groupby(i);
        double newval=Math.min(old, diff);
        head.update(i, newval);
        head.insert(i, diff);
        return true;
    }
}
public class ParTest1 {
    public static int THREAD_NUM=4;

    public static void main(String[] args) throws Exception {
        if (args.length>0) {
            THREAD_NUM = Integer.parseInt(args[0]);
            System.out.println("Setting THREAD_NUM:"+THREAD_NUM);
        }
        Table1 table1 = new Table1(Table1.SIZE1);
        HeadTable head = new HeadTable(Table1.SIZE1);

        MyVisitor[] visitors = new MyVisitor[MyVisitor.NUM];
        for (int i=0; i<visitors.length; i++) {
            visitors[i] = new MyVisitor(table1, head, i);
        }

        int taskPerThread = visitors.length / THREAD_NUM;
        MyThread[] threads = new MyThread[THREAD_NUM];
        CyclicBarrier barrier = new CyclicBarrier(THREAD_NUM+1);
        for (int i=0; i<THREAD_NUM; i++) {
            threads[i] = new MyThread(barrier);
            for (int j=taskPerThread*i; j<taskPerThread*(i+1); j++) {
                if (j>=visitors.length) break;
                threads[i].addVisitors(visitors[j]);
            }
        }

        Runtime r=Runtime.getRuntime();
        System.out.println("Force running gc");
        r.gc(); // running GC here (excluding GC effect)
        System.out.println("Running gc done");

        // not measuring 1st run (excluding JIT compilation effect)
        for (int i=0; i<THREAD_NUM; i++) {
            threads[i].start();
        }
        barrier.await();

        for (int i=0; i<10; i++) {
            MyThread.start = true;

            long s=System.currentTimeMillis();
            barrier.await();
            long e=System.currentTimeMillis();
            System.out.println("Iter "+i+"  Exec time:"+(e-s)/1000.0+"s");
        }
    }

}

class MyThread extends Thread {
    static volatile boolean start=true;
    static int tid=0;

    int id=0;
    ArrayList<MyVisitor> tasks;
    CyclicBarrier barrier;
    public MyThread(CyclicBarrier _barrier) {
        super("MyThread"+(tid++));

        barrier = _barrier;
        id=tid;
        tasks = new ArrayList(256);
    }

    void addVisitors(MyVisitor v) {
        tasks.add(v);
    }

    public void run() {
        while (true) {
            while (!start) { ; }

            for (int i=0; i<tasks.size(); i++) {
                MyVisitor v=tasks.get(i);
                v.run();
            }
            start = false;
            try { barrier.await();}
            catch (InterruptedException e) { break; }
            catch (Exception e) { throw new RuntimeException(e); }
        }
    }
}

The Java code can be compiled with no dependency, and you can run it with the following command:

java -Darg.type=double -server ParTest1 2

You pass the number of worker threads as an argument (the above uses 2 threads). After setting up the arrays (that is excluded from the measured time), it does a same operation for 10 times, printing out the execution time at each iteration. With the above option, it uses double array, and it scales very well with 1,2,4 threads (i.e. the execution time reduces to 1/2, and 1/4), but

java -Darg.type=Object -server ParTest1 2

With this option, it uses Object (String) array, and it doesn't scale at all! I measured the GC time, but it was insignificant (and I also forced running GC before measuring times). I have tested with Java 6 (updates 43) and Java 7 (updates 51), but it's the same.

The code has comments with XXX and YYY describing the difference when arg.type=double or arg.type=Object option is used.

Can you figure out what is going on with the String-type argument passing here?

Was it helpful?

Solution

HotSpot VM generates following assemblies for reference type putfield bytecode.

mov ref, OFFSET_OF_THE_FIELD(this)  <- this puts the new value for field.

mov this, REGISTER_A
shr 0x9, REGISTER_A
movabs OFFSET_X, REGISTER_B
mov %r12b, (REGISTER_A, REGISTER_B, 1)

putfield operation is completed in 1 instruction. but there are more instructions following.

They are "Card Marking" instructions. (http://www.ibm.com/developerworks/library/j-jtp11253/)

Writing reference field to every objects in a card (512 bytes), will store a value in a same memory address.

And I guess, store to same memory address from multiple cores mess up with cache and pipelines.

just add

byte[] garbage = new byte[600];

to MyVisitor definition.

then every MyVisitor instances will be spaced enough not to share card marking bit, you will see the program scales.

OTHER TIPS

This is not a complete answer but may provide a hint for you.

I have changed your code

Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];

Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, "Strs");
}

to

Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];

Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, new String("Strs"));
}

after this change, the running time with 4 threads with object type array reduced.

According to http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.7

For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64-bit value from one write, and the second 32 bits from another write.

Writes and reads of volatile long and double values are always atomic.

Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.

Assigning references are always atomic, and double is not atomic except when it is defined as volatile.

The problem is sv can be seen by other threads and its assignment is atomic. Therefore, wrapping visitor's member variables (i, iv, sv) using ThreadLocal will solve the problem.

"sv = (String)_v;" makes the difference. I also confirmed that the type casting is not the factor. Just accessing _v can't make the difference. Assigning some value to sv field makes the difference. But I can't explain why.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top