Question

This question is related to pbs job no output when busy. i.e Some of the jobs I submit produce no output when PBS/Torque is 'busy'. I imagine that it is busier when many jobs are being submitted one after another, and as it so happens, of the jobs submitted in this fashion, I often get some that do not produce any output.

Here're some codes.

Suppose I have a python script called "x_analyse.py" that takes as its input a file containing some data, and analyses the data stored in the file:

./x_analyse.py data_1.pkl

Now, suppose I need to: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..., data_N.pkl (2) Have "x_analyse.py" work on each of them, and write the results to a file for each of them. (3) Since the analysis of different data files are all independent of each other, I am going to use PBS/Torque to run them in parallel to save time. (I think this is essentially an 'embarrassingly parallel problem'.)

I have this python script to do the above:

import os
import sys
import time

N = 100

for k in range(1, N+1):
    datafilename = 'data_%d' % k
    file = open(datafilename + '.pkl', 'wb')
    #Prepare data set k, and save it in the file
    file.close()

    jobname = 'analysis_%d' % k
    file = open(jobname + '.sub', 'w')
    file.writelines( [ '#!/bin/bash\n',
                       '#PBS -N %s\n' % jobname,
                       '#PBS -o %s\n' % (jobname + '.out'),
                       '#PBS -q compute\n' ,
                       '#PBS -j oe\n' ,
                       '#PBS -l nodes=1:ppn=1\n' ,
                       '#PBS -l walltime=5:00:00\n' ,
                       'cd $PBS_O_WORKDIR\n' ,
                       '\n' ,
                       './x_analyse.py %s\n' % (datafilename + '.pkl') ] ) 
    file.close()

    os.system('qsub %s' % (jobname + '.sub')) 

    time.sleep(2.)

The script prepares a set of data to be analysed, saves it to a file, writes a pbs submit file for analysing this set of data, submits the job to do it, and then moves onto doing the same again with the next set of data, and so on.

As it is, when the script is run, a list of job ids are printed to the standard output as the jobs are submitted. 'ls' shows that there are N .sub files and N .pkl data files. 'qstat' shows that all the jobs are running, with the status 'R', and are then completed, with the status 'C'. However, afterwards, 'ls' shows that there are fewer than N .out output files, and fewer than N result files produced by "x_analyse.py". In effect, no output are produced by some of the jobs. If I were to clear everything, and re-run the above script, I would get the same behaviour, with some jobs (but not necessary the same ones as last time) not producing any output.

It has been suggested that by increasing the waiting time between the submission of consecutive jobs, things improve.

time.sleep(10.) #or some other waiting time

But I feel this is not most satisfactory, because I have tried 0.1s, 0.5s, 1.0s, 2.0s, 3.0s, none of which really helped. I have been told that 50s waiting time seems to work fine, but if I have to submit 100 jobs, the waiting time will be about 5000s, which seems awfully long.

I have tried reducing the number of times 'qsub' is used by submitting a job array instead. I would prepare all the data files as before, but only have one submit file, "analyse_all.sub":

#!/bin/bash                                                                                                                                                    
#PBS -N analyse                                                                                                                            
#PBS -o analyse.out                                                                                                                        
#PBS -q compute                                                                                                                                                
#PBS -j oe                                                                                                                                                     
#PBS -l nodes=1:ppn=1                                                                                                                                          
#PBS -l walltime=5:00:00                                                                                                                                       
cd $PBS_O_WORKDIR

./x_analyse.py data_$PBS_ARRAYID.pkl

and then submit with

qsub -t 1-100 analyse_all.sub

But even with this, some jobs still do not produce output.

Is this a common problem? Am I doing something not right? Is waiting in between job submissions the best solution? Can I do something to improve this?

Thanks in advance for any help.

Edit 1:

I'm using Torque version 2.4.7 and Maui version 3.3.

Also, suppose job with job ID 1184430.mgt1 produces no output and job with job ID 1184431.mgt1 produces output as expected, when I use 'tracejob' on these I get the following:

[batman@gotham tmp]$tracejob 1184430.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory

Job: 1184430.mgt1

12/13/2012 13:53:13  S    enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13  S    Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_1, queue = compute
12/13/2012 13:53:13  S    Job Run at request of root@mgt1
12/13/2012 13:53:13  S    Not sending email: User does not want mail of this type.
12/13/2012 13:54:48  S    Not sending email: User does not want mail of this type.
12/13/2012 13:54:48  S    Exit_status=135 resources_used.cput=00:00:00  resources_used.mem=15596kb resources_used.vmem=150200kb resources_used.walltime=00:01:35
12/13/2012 13:54:53  S    Post job file processing error
12/13/2012 13:54:53  S    Email 'o' to batman@mgt1 failed: Child process '/usr/lib/sendmail -f adm batman@mgt1' returned 67 (errno 10:No child processes)
[batman@gotham tmp]$tracejob 1184431.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory

Job: 1184431.mgt1

12/13/2012 13:53:13  S    enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13  S    Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_2, queue = compute
12/13/2012 13:53:13  S    Job Run at request of root@mgt1
12/13/2012 13:53:13  S    Not sending email: User does not want mail of this type.
12/13/2012 13:53:31  S    Not sending email: User does not want mail of this type.
12/13/2012 13:53:31  S    Exit_status=0 resources_used.cput=00:00:16 resources_used.mem=19804kb resources_used.vmem=154364kb resources_used.walltime=00:00:18

Edit 2: For job that produces no output, 'qstat -f' returns the following:

[batman@gotham tmp]$qstat -f 1184673.mgt1
Job Id: 1184673.mgt1   
Job_Name = analysis_7
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 17572kb
resources_used.vmem = 152020kb
resources_used.walltime = 00:01:36
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:00:31 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_7.e1184673
exec_host = node26/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:02:07 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_7.out
Priority = 0
qtime = Fri Dec 14 14:00:31 2012
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 9397
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman,
    PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
    ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
    in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
    n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
    PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
    PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
    PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
sched_hint = Post job file processing error; job 1184673.mgt1 on host node
    26/0Unknown resource type  REJHOST=node26 MSG=invalid home directory '
    /gpfs1/batman' specified, errno=116 (Stale NFS file handle)
etime = Fri Dec 14 14:00:31 2012
exit_status = 135  
submit_args = analysis_7.sub
start_time = Fri Dec 14 14:00:31 2012
Walltime.Remaining = 1790
start_count = 1
fault_tolerant = False
comp_time = Fri Dec 14 14:02:07 2012

as compared with a job that produces output:

[batman@gotham tmp]$qstat -f 1184687.mgt1
Job Id: 1184687.mgt1
Job_Name = analysis_1
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 19652kb
resources_used.vmem = 162356kb
resources_used.walltime = 00:02:38
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:40:46 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_1.e118468
    7
exec_host = ionode2/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:43:24 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_1.out
Priority = 0
qtime = Fri Dec 14 14:40:46 2012
Rerunable = True   
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 28039 
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8,
    PBS_O_LOGNAME=batman,
    PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
    ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
    in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
    n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
    PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
    PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
    PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
etime = Fri Dec 14 14:40:46 2012
exit_status = 0
submit_args = analysis_1.sub
start_time = Fri Dec 14 14:40:47 2012
Walltime.Remaining = 1784
start_count = 1

It appears that the exit status for one is 0 but not the other.

Edit 3:

From 'qstat -f' outputs like the ones above, it seems that the problem has something to do with 'Stale NFS file handle'in the post job file processing. By submitting hundreds of test jobs, I have been able to identify a number of nodes that produce failed jobs. By sshing onto these, I can find the missing PBS output files in /var/spool/torque/spool, where I can also see output files belonging to other users. One strange thing about these problematic nodes is that if they are the only node chosen to be used, the job runs fine on them. The problem only arises when they are mixed with other nodes.

Since I do not know how to fix the post job processing 'Stale NFS file handle', I avoid these nodes by submitting 'dummy' jobs to them

echo sleep 60 | qsub -lnodes=badnode1:ppn=2+badnode2:ppn=2

before submitting the real jobs. Now all jobs produce output as expected, and there is no need to wait before consecutive submissions.

Was it helpful?

Solution

I see two issues in the tracejob output from the failed job.

First it is Exit_status=135. This exit status is not a Torque error code, but an exit status returned by the script which is x_analyse.py. Python does not have a convention on the use of sys.exit() function and the source of the 135 code might be in one of the modules used in the script.

The second issue is the failure of post job file processing. This might indicate a misconfigured node.

From now on I am guessing. Since a successful job takes about 00:00:16, it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node. I would modify the submit script adding a line

  'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',

to the python script that generates the .sub file. This would add the names of the execution hosts to the debug.log which would reside on a common filesystem if I understood your setup correctly.

Then you (or the Torque admin) might want to look for the unprocessed output files in the MOM spool directory on the failing node to get some info for further diagnosis.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top