Question

I need to run a Java application on a PBS cluster and I'm a bit unclear on how it should be behaving.

The application starts a few threads, the number of which is decided by looking at the number of cores on the node it's running on. It then starts 2 threads per core.

The best thing for me would be to request full access to a node in the PBS cluster for this thing to run on. That is, reserve all cores on a node. I haven't seen how I can do this. All I see is the ppn parameter that requests a specific number of cores per node but the nodes are heterogeneous so I don't want to specify a single ppn number, it would depend on the type of node I get.

If this is not possible, I need to understand how jobs behave when ppn is specified. I could instruct the Java application to only create X many threads but I don't think I would have any control over what cores these threads would run on. Creating 2 threads per core is a rule of thumb for us and it could happen that all threads want to run all the time and in that case I would be using 100% more CPU resources than I requested. Is my understanding correct that PBS won't enforce any limit on my process but may actually monitor it and even kill my process if it exceeds the resource usage that was specified?

TL;DR;

So to summarize:

  1. Can I request full access to a node (reserve all the cores on the node I get for a job)?
  2. If I request only some fraction of the cores on a node, will PBS kill my job if I exceed that limit?
Was it helpful?

Solution

Can I request full access to a node (reserve all the cores on the node I get for a job)?

In conjunction with Moab, you can use the parameter you state in your comment:

#PBS -W x=NACCESSPOLICY:SINGLEJOB

This guarantees that Moab won't send any more jobs to the same node. This won't make all of the processors for each node show up in $PBS_NODEFILE, but it will allow you to use the entire node without stepping on anything else.

If I request only some fraction of the cores on a node, will PBS kill my job if I exceed that limit?

No, it won't. By default, TORQUE does nothing to enforce that you use only the cores that you request. The caveat here is that if you are configuring TORQUE to use cpusets, then the cpusets will restrict your process to only the processors that you are assigned. If you are using a version of TORQUE that is 3.0.0 or newer, you can add

#PBS -E

to your job script to solve this problem. This tells the mom you have exclusive access to the node and instructs the mom to put all of the cpus in the machine in your cpuset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top