I'm running a fork of galaxy-central latest_2014.08.11. The instance is
configured to run jobs on a SLURM cluster. The problem is that the SLURM
controller sometimes becomes too busy which results in errors like:
galaxy.jobs.runners.drmaa INFO 2014-10-23 21:10:47,768 (1813/22896754) job
left DRM queue with following message: code 1: slurm_load_jobs error:
Socket timed out on send/recv operation,job_id: 22896754
This causes Galaxy to assume that the job has failed:
galaxy.jobs.runners ERROR 2014-10-23 21:10:47,881 (1813/22896754) Job
output not returned from cluster: [Errno 2] No such file or directory:
This happens with both galaxy.jobs.runners.drmaa:DRMAAJobRunner and
galaxy.jobs.runners.slurm:SlurmJobRunner. Is there any way to handle this
condition in Galaxy?
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: