drmaa resubmission

classic Classic list List threaded Threaded
3 messages Options
| Threaded
Open this post in threaded view
|

drmaa resubmission

Matthias Bernt
Dear list,

I was thinking about implementing the job resubmission feature for drmaa.

I hope that I can simplify the job configuration for our installation
(and probably others as well) by escalating through different queues (or
ressource limits). Thereby I hope to reduce the number of special cases
that I need to take care.

I was wondering if there are others

- who are also interested in this feature and want to join? I would try
to give this project a head start in the next week.

- that may have started to work on this feature or just started to think
about it and want to share code/experience

Best,
Matthias
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
| Threaded
Open this post in threaded view
|

Re: drmaa resubmission

John Chilton-4
We've done a lot of work in Galaxy dev on this problem over the last
few years - I'm not sure how much concrete progress we have made.

Nate started it and I did some work at the end of last year. Just to
summarize my most recent work on this - in
https://github.com/galaxyproject/galaxy/pull/3291/commits/b78287f1508db2c06f0c309ed8d3747adb4d17fa
I added some test cases for the existing job runner resubmission stuff
- it was just my sense to understand what was there - hopefully the
examples in the form of test cases help you as well.. This includes a
little test job_conf.xml file that describes how you can catch job
walltime and memory limit hits registered by the job runner and send
jobs to different destinations. This requires the job runner knows how
to record these problems - which the SLURM job runner does - other job
runners like the generic drmaa runner may need to be subclassed to
check for these things in general.

In https://github.com/galaxyproject/galaxy/pull/3291/commits/7d52b28ab2ab0314cd4fa31108a6750cb9750ef3
I created a little DSL for resubmissions to make what can be expressed
in job_conf more powerful. Then I added variables to expressions
language such as seconds_since_queued,
seconds_running(https://github.com/galaxyproject/galaxy/pull/3291/commits/18eb1c8d0e4c3f7616d44fd177c90943695b7053),
and attempt number
(https://github.com/galaxyproject/galaxy/pull/3291/commits/7e338d790964f594ae67b33e6a72e1777e774b8c).
I also added the ability to resubmit on unknown job runner problems
here (https://github.com/galaxyproject/galaxy/pull/3291/commits/0559cff6e94b250ddd98275b119ab51b36491e34).

None of this is really documented outside the test cases - it is
waiting for someone to come along and find it useful.

I think the next thing I'd like to see for job resubmission besides
documentation and more job runner support for common runners is
described in this issue
(https://github.com/galaxyproject/galaxy/issues/3320) - all the
existing resubmission logic is based on errors detected from job
runners - if the underlying error exhibits itself as a tool failure -
we need a way to reason about that and we cannot currently.

Hope this helps.

-John

On Thu, Jun 15, 2017 at 10:37 AM, Matthias Bernt <[hidden email]> wrote:

> Dear list,
>
> I was thinking about implementing the job resubmission feature for drmaa.
>
> I hope that I can simplify the job configuration for our installation (and
> probably others as well) by escalating through different queues (or
> ressource limits). Thereby I hope to reduce the number of special cases that
> I need to take care.
>
> I was wondering if there are others
>
> - who are also interested in this feature and want to join? I would try to
> give this project a head start in the next week.
>
> - that may have started to work on this feature or just started to think
> about it and want to share code/experience
>
> Best,
> Matthias
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
| Threaded
Open this post in threaded view
|

Re: drmaa resubmission

Matthias Bernt
Dear John,

thanks a lot for all the information. I guess I will need some time to
dig into this.

For drmaa the wait() function of the python library seems to return
quite bit of useful information:  hasExited, hasCoreDump, hasSignal, and
terminateSignal. I guess this would be of help.

The problem seems to be that when the external run script is used jobs
can not be queried properly (see my other post). But I did not
understand this completely.

Cheers,
Matthias

On 15.06.2017 19:05, John Chilton wrote:

> We've done a lot of work in Galaxy dev on this problem over the last
> few years - I'm not sure how much concrete progress we have made.
>
> Nate started it and I did some work at the end of last year. Just to
> summarize my most recent work on this - in
> https://github.com/galaxyproject/galaxy/pull/3291/commits/b78287f1508db2c06f0c309ed8d3747adb4d17fa
> I added some test cases for the existing job runner resubmission stuff
> - it was just my sense to understand what was there - hopefully the
> examples in the form of test cases help you as well.. This includes a
> little test job_conf.xml file that describes how you can catch job
> walltime and memory limit hits registered by the job runner and send
> jobs to different destinations. This requires the job runner knows how
> to record these problems - which the SLURM job runner does - other job
> runners like the generic drmaa runner may need to be subclassed to
> check for these things in general.
>
> In https://github.com/galaxyproject/galaxy/pull/3291/commits/7d52b28ab2ab0314cd4fa31108a6750cb9750ef3
> I created a little DSL for resubmissions to make what can be expressed
> in job_conf more powerful. Then I added variables to expressions
> language such as seconds_since_queued,
> seconds_running(https://github.com/galaxyproject/galaxy/pull/3291/commits/18eb1c8d0e4c3f7616d44fd177c90943695b7053),
> and attempt number
> (https://github.com/galaxyproject/galaxy/pull/3291/commits/7e338d790964f594ae67b33e6a72e1777e774b8c).
> I also added the ability to resubmit on unknown job runner problems
> here (https://github.com/galaxyproject/galaxy/pull/3291/commits/0559cff6e94b250ddd98275b119ab51b36491e34).
>
> None of this is really documented outside the test cases - it is
> waiting for someone to come along and find it useful.
>
> I think the next thing I'd like to see for job resubmission besides
> documentation and more job runner support for common runners is
> described in this issue
> (https://github.com/galaxyproject/galaxy/issues/3320) - all the
> existing resubmission logic is based on errors detected from job
> runners - if the underlying error exhibits itself as a tool failure -
> we need a way to reason about that and we cannot currently.
>
> Hope this helps.
>
> -John
>
> On Thu, Jun 15, 2017 at 10:37 AM, Matthias Bernt <[hidden email]> wrote:
>> Dear list,
>>
>> I was thinking about implementing the job resubmission feature for drmaa.
>>
>> I hope that I can simplify the job configuration for our installation (and
>> probably others as well) by escalating through different queues (or
>> ressource limits). Thereby I hope to reduce the number of special cases that
>> I need to take care.
>>
>> I was wondering if there are others
>>
>> - who are also interested in this feature and want to join? I would try to
>> give this project a head start in the next week.
>>
>> - that may have started to work on this feature or just started to think
>> about it and want to share code/experience
>>
>> Best,
>> Matthias
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>   https://lists.galaxyproject.org/
>>
>> To search Galaxy mailing lists use the unified search at:
>>   http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/