Question

I am trying to find a way to execute CPU intensive parallel jobs over a cluster. My objective is to schedule one job per core, so that every job hopefully gets 100% CPU utilization once scheduled. This is what a have come up with so far:

FILE build_sshlogin.sh

#!/bin/bash

serverprefix="compute-0-"
lastserver=15
function worker {
    server="$serverprefix$1"; 
    free=$(ssh $server /bin/bash << 'EOF'
        cores=$(grep "cpu MHz" /proc/cpuinfo | wc -l)
        stat=$(head -n 1 /proc/stat)
        work1=$(echo $stat | awk '{print $2+$3+$4;}')
        total1=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')
        sleep 2;
        stat=$(head -n 1 /proc/stat)
        work2=$(echo $stat | awk '{print $2+$3+$4;}')
        total2=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')

        util=$(echo " ( $work2 - $work1 ) / ($total2 - $total1) " | bc -l );
        echo " $cores * (1 - $util) " | bc -l | xargs printf "%1.0f"
    EOF
    )

    if [ $free -gt 0 ] 
    then 
        echo $free/$server
    fi
}

export serverprefix
export -f worker

seq 0 $lastserver | parallel -k worker {}

This script is used by GNU parallel as follows:

parallel --sshloginfile <(./build_sshlogin.sh) --workdir $PWD command args {1} :::  $(seq $runs) 

The problem with this technique is that if someone starts another CPU intensive job on a server in the cluster, without checking the CPU usage, then the script will end up scheduling jobs to a core that is being used. In addition, if by the time the first jobs finishes, the CPU usage has changed, then the newly freed cores will not be included for scheduling by GNU parallel for the remaining jobs.

So my question is the following: Is there a way to make GNU parallel re-calculate the free cores/server before it schedules each job? Any other suggestions for solving the problem are welcome.

NOTE: In my cluster all cores have the same frequency. If someone can generalize to account for different frequencies, that's also welcome.

Was it helpful?

Solution

Look at --load which is meant for exactly this situation.

Unfortunately it does not look at CPU utilization but load average. But if your cluster nodes do not have heavy disk I/O then CPU utilization will be very close to load average.

Since load average changes slowly you probably also need to use the new --delay option to give the load average time to rise.

OTHER TIPS

Try mpstat

mpstat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db)       07/09/2011

10:25:32 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:25:32 PM  all    5.68    0.00    0.49    2.03    0.01    0.02    0.00   91.77    146.55

This is an overall snapshot on a per core basis

$ mpstat -P ALL
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db)       07/09/2011      _x86_64_        (4 CPU)

10:28:04 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
10:28:04 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.99
10:28:04 PM    0    0.01    0.00    0.01    0.01    0.00    0.00    0.00    0.00   99.98
10:28:04 PM    1    0.00    0.00    0.01    0.00    0.00    0.00    0.00    0.00   99.98
10:28:04 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
10:28:04 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

There lot of options, these two give a simple actual %idle per cpu. Check the manpage.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top