drmaa job status

classic Classic list List threaded Threaded
4 messages Options
| Threaded
Open this post in threaded view
|

drmaa job status

Matthias Bernt
Dear list,

I have two question for all DRMAA users. Here is the first one.

I was checking how our queuing system (univa GridEngine) and Galaxy
react if jobs are submitted that exceed run time or memory limits.

I found out that the python drmaa library cannot query the job status
after the job is finished (for both successful and unsuccessful jobs).

In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
     self.ds.job_status( external_job_id )

Is this always the case? Or might this be a problem with our GridEngine?

I have attached some code for testing. Here the first call to
s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
But I get "drmaa.errors.InvalidJobException: code 18: The job specified
by the 'jobid' does not exist."

The same error pops up in the Galaxy logs. The consequence is that jobs
that reached the limits are shown as completed successfully in Galaxy.

Interestingly, quite a bit of information can be obtained from the
return value of s.wait. I was wondering if this can be used to
differentiate successful from failed jobs. In particular hasExited,
hasSignal, and terminateSignal are different in the two cases.

Cheers,
Matthias


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

drmaa-submit.py (2K) Download Attachment
sleeper.sh (46 bytes) Download Attachment
| Threaded
Open this post in threaded view
|

Re: drmaa job status

Nate Coraor (nate@bx.psu.edu)
Hi Matthias,

I can't speak for GridEngine's specific behavior because I haven't used it in a long time, but it's not surprising that jobs "disappear" as soon as they've exited. Unfortunately, Galaxy uses periodic polling rather than waiting on completion. We'd need to create a thread-per-submitted job unless you can still get job exit details by looping over jobs with a timeout wait.

You can gain some control over how Galaxy handles InvalidJobException exceptions with drmaa job runner plugin params, see here:


However, if normally finished jobs also result in InvalidJobException, that probably won't help. Alternatively, you could create a DRMAAJobRunner subclass for GridEngine like we've done for Slurm that does some digging to learn more about terminal jobs.

--nate

On Thu, Jun 15, 2017 at 10:27 AM, Matthias Bernt <[hidden email]> wrote:
Dear list,

I have two question for all DRMAA users. Here is the first one.

I was checking how our queuing system (univa GridEngine) and Galaxy react if jobs are submitted that exceed run time or memory limits.

I found out that the python drmaa library cannot query the job status after the job is finished (for both successful and unsuccessful jobs).

In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
    self.ds.job_status( external_job_id )

Is this always the case? Or might this be a problem with our GridEngine?

I have attached some code for testing. Here the first call to s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
But I get "drmaa.errors.InvalidJobException: code 18: The job specified by the 'jobid' does not exist."

The same error pops up in the Galaxy logs. The consequence is that jobs that reached the limits are shown as completed successfully in Galaxy.

Interestingly, quite a bit of information can be obtained from the return value of s.wait. I was wondering if this can be used to differentiate successful from failed jobs. In particular hasExited, hasSignal, and terminateSignal are different in the two cases.

Cheers,
Matthias


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
| Threaded
Open this post in threaded view
|

Re: drmaa job status

Curtis Hendrickeduson
In reply to this post by Matthias Bernt
Matthias,

We have had this problem on our SGE based installation for years. We referred to it as the "green screen of death" - as it would allow a biologist to continue analysis using output that was partial, at best, often resulting in seemingly successful completion of the entire analysis, but completely bogus results (say, cuffdiff killed half way through the genome, but it's green in galaxy, so no transcripts on the smaller chromosomes, but no error, either).  

We ended up implementing an external reaper that detected these killed jobs from SGE, and notified the user and galaxy post-hoc. It was not a very satisfactory solution. We are currently moving to SLURM for other reasons and hope the problem will not be present there.

Regards,
Curtis


-----Original Message-----
From: galaxy-dev [mailto:[hidden email]] On Behalf Of Matthias Bernt
Sent: Thursday, June 15, 2017 9:27 AM
To: [hidden email]
Subject: [galaxy-dev] drmaa job status

Dear list,

I have two question for all DRMAA users. Here is the first one.

I was checking how our queuing system (univa GridEngine) and Galaxy react if jobs are submitted that exceed run time or memory limits.

I found out that the python drmaa library cannot query the job status after the job is finished (for both successful and unsuccessful jobs).

In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
     self.ds.job_status( external_job_id )

Is this always the case? Or might this be a problem with our GridEngine?

I have attached some code for testing. Here the first call to
s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
But I get "drmaa.errors.InvalidJobException: code 18: The job specified by the 'jobid' does not exist."

The same error pops up in the Galaxy logs. The consequence is that jobs that reached the limits are shown as completed successfully in Galaxy.

Interestingly, quite a bit of information can be obtained from the return value of s.wait. I was wondering if this can be used to differentiate successful from failed jobs. In particular hasExited, hasSignal, and terminateSignal are different in the two cases.

Cheers,
Matthias

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
| Threaded
Open this post in threaded view
|

Re: drmaa job status

Matthias Bernt
In reply to this post by Nate Coraor (nate@bx.psu.edu)
Dear Nate and Curtis,

I read a bit in the documentation of the python and underlying C
library.. and played a bit around.

> I can't speak for GridEngine's specific behavior because I haven't used
> it in a long time, but it's not surprising that jobs "disappear" as soon
> as they've exited. Unfortunately, Galaxy uses periodic polling rather
> than waiting on completion. We'd need to create a thread-per-submitted
> job unless you can still get job exit details by looping over jobs with
> a timeout wait.
 > > You can gain some control over how Galaxy handles InvalidJobException
> exceptions with drmaa job runner plugin params, see here:
>
> https://github.com/galaxyproject/galaxy/blob/dev/config/job_conf.xml.sample_advanced#L9
>
> However, if normally finished jobs also result in InvalidJobException,
> that probably won't help. Alternatively, you could create a
> DRMAAJobRunner subclass for GridEngine like we've done for Slurm that
> does some digging to learn more about terminal jobs.

I found a way to get the information. The problem in my script was that
wait() "reaps" (that the term used by python-drmaa) all information on
the job from the session(?). Hence calls to jobStatus after wait() will
fail. The solution here would be to use synchronize() with parameter
dispose=False, see attached file -- alternatively one can also wait
until job status is DONE or FAILED.

But this seems not to be the source of the problem within Galaxy, since
it never calls wait(..). The problem seems to be that an external python
script submits the job in another session (when jobs are submitted as
real user). The problem is then that jobs created in another session can
not be queried (The documentation is a bit vague here: in the C library
documentation of SGE I read that it is definitely not possible to query
finished jobs).

I tried if it is possible to pickle the session -- without success.

Does anyone has an idea how one could get the active drmaa session from
galaxy to the external script?

 > We have had this problem on our SGE based installation for years. We
 > referred to it as the "green screen of death" - as it would allow a
 > biologist to continue analysis using output that was partial, at best,
 > often resulting in seemingly successful completion of the entire
 > analysis, but completely bogus results (say, cuffdiff killed half way
 > through the genome, but it's green in galaxy, so no transcripts on the
 > smaller chromosomes, but no error, either).

Did you use submission as real user? Or does the problem also appear if
jobs are submitted as the single user running galaxy.

 > We ended up implementing an external reaper that detected these killed
 > jobs from SGE, and notified the user and galaxy post-hoc. It was not a
 > very satisfactory solution. We are currently moving to SLURM for other
 > reasons and hope the problem will not be present there.

I was also thinking about not using the python library at all, but using
system calls to qsub, qkill, qacct, etc? Any opinions on this idea?

I guess your reaper could be of interest also for others. Is this
available somewhere?

Best,
Matthias

> --nate
>
> On Thu, Jun 15, 2017 at 10:27 AM, Matthias Bernt <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Dear list,
>
>     I have two question for all DRMAA users. Here is the first one.
>
>     I was checking how our queuing system (univa GridEngine) and Galaxy
>     react if jobs are submitted that exceed run time or memory limits.
>
>     I found out that the python drmaa library cannot query the job
>     status after the job is finished (for both successful and
>     unsuccessful jobs).
>
>     In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
>          self.ds.job_status( external_job_id )
>
>     Is this always the case? Or might this be a problem with our GridEngine?
>
>     I have attached some code for testing. Here the first call to
>     s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
>     But I get "drmaa.errors.InvalidJobException: code 18: The job
>     specified by the 'jobid' does not exist."
>
>     The same error pops up in the Galaxy logs. The consequence is that
>     jobs that reached the limits are shown as completed successfully in
>     Galaxy.
>
>     Interestingly, quite a bit of information can be obtained from the
>     return value of s.wait. I was wondering if this can be used to
>     differentiate successful from failed jobs. In particular hasExited,
>     hasSignal, and terminateSignal are different in the two cases.
>
>     Cheers,
>     Matthias
>
>
>     ___________________________________________________________
>     Please keep all replies on the list by using "reply all"
>     in your mail client.  To manage your subscriptions to this
>     and other Galaxy lists, please use the interface at:
>     https://lists.galaxyproject.org/ <https://lists.galaxyproject.org/>
>
>     To search Galaxy mailing lists use the unified search at:
>     http://galaxyproject.org/search/ <http://galaxyproject.org/search/>
>
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

drmaa-submit.py (2K) Download Attachment