在玻璃连接群的多个节点上运行工作

https://stackoverflow.com/questions/3872977

28-09-2019
|

题

我可以访问一个128核群集，我想在该群集上运行并行的工作。群集使用Sun Gridengine，我的程序编写为使用Parallel Python，Numpy，Scipy在Python 2.5.8上运行。在单个节点上运行该作业（4核）可以比单个核心提高约3.5倍。我现在想将其提升到一个新的水平，并将作业分开跨4个节点。我的 qsub 脚本看起来像这样：

#!/bin/bash
# The name of the job, can be whatever makes sense to you
#$ -N jobname

# The job should be placed into the queue 'all.q'.
#$ -q all.q

# Redirect output stream to this file.
#$ -o jobname_output.dat

# Redirect error stream to this file.

#$ -e jobname_error.dat

# The batchsystem should use the current directory as working directory.
# Both files will be placed in the current
# directory. The batchsystem assumes to find the executable in this directory.
#$ -cwd

# request Bourne shell as shell for job.
#$ -S /bin/sh

# print date and time
date

# spython is the server's version of Python 2.5. Using python instead of spython causes the program to run in python 2.3
spython programname.py

# print date and time again
date

有人知道如何做吗？

解决方案

是的，您需要包括网格引擎选项 -np 16 要么在您的脚本中：

# Use 16 processors
#$ -np 16

或提交脚本时在命令行上。或者，对于更永久的安排，请使用 .sge_request 文件。

在我使用过的所有GE安装中，这将为您提供16个处理器（或如今的处理器核心），以尽可能少的节点，因此，如果您的节点有4个核心，您将获得4个节点，如果它们具有8 2和很快。要把这份工作放在8个节点上的2个核心（如果每个过程需要大量的内存，可能要这样做）要复杂一些，您应该咨询您的支持团队。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow