Per-tool configuration

classic Classic list List threaded Threaded
14 messages Options
| Threaded
Open this post in threaded view
|

Per-tool configuration

Jan Kanis
I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration?

Jan

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

John Chilton-4
I would have different answers for your depending on what options are
available to the server admin. What exactly about the tool is
configurable - can you be more specific?

-John

On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <[hidden email]> wrote:

> I am writing a tool that should be configurable by the server admin. I am
> considering adding a configuration file, but where should such a file be
> placed? Is the tool-data directory the right place? Is there another
> standard way for per-tool configuration?
>
> Jan
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Jan Kanis

I have two use cases: the first is for a modification of the ncbi blast wrapper to limit the query input size (for a publically accessible galaxy instance), so this needs a configuration option for the query size limit. I was thinking about a separate config file in tool-data for this.

The second is is for a tool I have written to convert a blast xml output into a html report. The report contains links for each match to a gene bank (e.g. the ncbi database). These links should be configurable per database that was searched, and preferrably have an option of linking to the location of the match within the gene if the gene bank supports such links. One option is to add an extra column to the blast .loc files (if that doesn't break blast), where the databases are already configured.

Jan

Op 13 jun. 2014 18:02 schreef "John Chilton" <[hidden email]> het volgende:
I would have different answers for your depending on what options are
available to the server admin. What exactly about the tool is
configurable - can you be more specific?

-John

On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <[hidden email]> wrote:
> I am writing a tool that should be configurable by the server admin. I am
> considering adding a configuration file, but where should such a file be
> placed? Is the tool-data directory the right place? Is there another
> standard way for per-tool configuration?
>
> Jan
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

John Chilton-4
Hello Jan,

Thanks for the clarification. Not quite what I was expecting so I am
glad I asked - I don't have great answers for either case so hopefully
other people will have some ideas.

For the first use case - I would just specify some default input to
supply to the input wrapper - lets call this N - add a parameter to
the tool wrapper "--limit-size=N" - test that and then allow it to be
overridden via an environment variable - so in your command block use
"--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit
is set, but deployers can set limits. There are a number of ways to
set such variables - DRM specific environment files, login rc files,
etc.... Just this last release I added the ability to define
environment variables right in job_conf.xml
(https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specification-of-environment/diff).
I thought the tool shed might have a way to collect such definitions
as well and insert them into package files - but Google failed to find
this for me.

Not sure about how to proceed with the second use case - extending the
.loc file should work locally - I am not sure it is feasible within
the context of the existing tool shed tools, data manager, etc.... You
could certainly duplicate this stuff with your modifications - this
how down sides in terms of interoperability though.

Sorry I don't have great answers for either question,
-John




On Sat, Jun 14, 2014 at 5:12 AM, Jan Kanis <[hidden email]> wrote:

> I have two use cases: the first is for a modification of the ncbi blast
> wrapper to limit the query input size (for a publically accessible galaxy
> instance), so this needs a configuration option for the query size limit. I
> was thinking about a separate config file in tool-data for this.
>
> The second is is for a tool I have written to convert a blast xml output
> into a html report. The report contains links for each match to a gene bank
> (e.g. the ncbi database). These links should be configurable per database
> that was searched, and preferrably have an option of linking to the location
> of the match within the gene if the gene bank supports such links. One
> option is to add an extra column to the blast .loc files (if that doesn't
> break blast), where the databases are already configured.
>
> Jan
>
> Op 13 jun. 2014 18:02 schreef "John Chilton" <[hidden email]> het
> volgende:
>
>> I would have different answers for your depending on what options are
>> available to the server admin. What exactly about the tool is
>> configurable - can you be more specific?
>>
>> -John
>>
>> On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <[hidden email]> wrote:
>> > I am writing a tool that should be configurable by the server admin. I
>> > am
>> > considering adding a configuration file, but where should such a file be
>> > placed? Is the tool-data directory the right place? Is there another
>> > standard way for per-tool configuration?
>> >
>> > Jan
>> >
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   http://lists.bx.psu.edu/
>> >
>> > To search Galaxy mailing lists use the unified search at:
>> >   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Peter Cock
On Mon, Jun 16, 2014 at 4:18 AM, John Chilton <[hidden email]> wrote:

> Hello Jan,
>
> Thanks for the clarification. Not quite what I was expecting so I am
> glad I asked - I don't have great answers for either case so hopefully
> other people will have some ideas.
>
> For the first use case - I would just specify some default input to
> supply to the input wrapper - lets call this N - add a parameter to
> the tool wrapper "--limit-size=N" - test that and then allow it to be
> overridden via an environment variable - so in your command block use
> "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit
> is set, but deployers can set limits. There are a number of ways to
> set such variables - DRM specific environment files, login rc files,
> etc.... Just this last release I added the ability to define
> environment variables right in job_conf.xml
> (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specification-of-environment/diff).
> I thought the tool shed might have a way to collect such definitions
> as well and insert them into package files - but Google failed to find
> this for me.

Hmm. Jan emailed me off list earlier about this. We could insert
a pre-BLAST script to check the size of the query FASTA file,
and abort if it is too large (e.g. number of queries, total sequence
length, perhaps scaled according to the database size if we want
to get clever?).

I was hoping there was a more general mechanism in Galaxy -
after all, BLAST is by no means the only computationally
expensive tool ;)

We have had query files of 20,000 and more genes against NR
(both BLASTP and BLASTX), but our Galaxy has task-splitting
enabled so this becomes 20 (or more) individual cluster jobs
of 1000 queries each. This works fine apart from the occasional
glitch with the network drive when the data is merged afterwards.
(We know this failed once shortly after the underlying storage
had been expanded, and would have been under heavy load
rebalancing the data across the new disks.)

> Not sure about how to proceed with the second use case - extending the
> .loc file should work locally - I am not sure it is feasible within
> the context of the existing tool shed tools, data manager, etc.... You
> could certainly duplicate this stuff with your modifications - this
> how down sides in terms of interoperability though.

Currently the BLAST wrappers use the *.loc files directly, but
this is likely to switch to the newer "Data Manager" approach.
That may or may not complicate local modifications like adding
extra columns...

> Sorry I don't have great answers for either question,
> -John

Thanks John,

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Jan Kanis
Too bad there aren't any really good options. I will use the environment variable approach for the query size limit. For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems.

@Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download.

Jan


On 16 June 2014 09:08, Peter Cock <[hidden email]> wrote:
On Mon, Jun 16, 2014 at 4:18 AM, John Chilton <[hidden email]> wrote:
> Hello Jan,
>
> Thanks for the clarification. Not quite what I was expecting so I am
> glad I asked - I don't have great answers for either case so hopefully
> other people will have some ideas.
>
> For the first use case - I would just specify some default input to
> supply to the input wrapper - lets call this N - add a parameter to
> the tool wrapper "--limit-size=N" - test that and then allow it to be
> overridden via an environment variable - so in your command block use
> "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit
> is set, but deployers can set limits. There are a number of ways to
> set such variables - DRM specific environment files, login rc files,
> etc.... Just this last release I added the ability to define
> environment variables right in job_conf.xml
> (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specification-of-environment/diff).
> I thought the tool shed might have a way to collect such definitions
> as well and insert them into package files - but Google failed to find
> this for me.

Hmm. Jan emailed me off list earlier about this. We could insert
a pre-BLAST script to check the size of the query FASTA file,
and abort if it is too large (e.g. number of queries, total sequence
length, perhaps scaled according to the database size if we want
to get clever?).

I was hoping there was a more general mechanism in Galaxy -
after all, BLAST is by no means the only computationally
expensive tool ;)

We have had query files of 20,000 and more genes against NR
(both BLASTP and BLASTX), but our Galaxy has task-splitting
enabled so this becomes 20 (or more) individual cluster jobs
of 1000 queries each. This works fine apart from the occasional
glitch with the network drive when the data is merged afterwards.
(We know this failed once shortly after the underlying storage
had been expanded, and would have been under heavy load
rebalancing the data across the new disks.)

> Not sure about how to proceed with the second use case - extending the
> .loc file should work locally - I am not sure it is feasible within
> the context of the existing tool shed tools, data manager, etc.... You
> could certainly duplicate this stuff with your modifications - this
> how down sides in terms of interoperability though.

Currently the BLAST wrappers use the *.loc files directly, but
this is likely to switch to the newer "Data Manager" approach.
That may or may not complicate local modifications like adding
extra columns...

> Sorry I don't have great answers for either question,
> -John

Thanks John,

Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Peter Cock
On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <[hidden email]> wrote:
> Too bad there aren't any really good options. I will use the environment
> variable approach for the query size limit.

Are you using the optional job splitting (parallelism) feature in Galaxy?
That seems to be me to be a good place to insert a Galaxy level
job size limit. e.g. BLAST+ jobs are split into 1000 query chunks,
so you might wish to impose a 25 chunk limit?

Long term being able to set limits on the input file parameters
of each tool would be nicer - e.g. Limit BLASTN to at most
20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

> For the gene bank links I guess modifying the .loc file is the least
> bad way. Maybe it can be merged into galaxy_blast, that would at
> least solve the interoperability problems.

It would have to be sufficiently general, and backward compatible.

FYI other people have also looked at extending the blast *.loc
files (e.g. adding a category column for helping filter down a
very large BLAST database list).

> @Peter: One potential problem in merging my blast2html tool
> could be that I have written it in python3, and the current tool
> wrapper therefore installs python3 and a host of its dependencies,
> making for a quite large download.

Without seeing your code, it is hard to say, but actually writing
Python code which works unmodified under Python 2.7 and
Python 3 is quite doable (and under Python 2.6 with a few
more provisos). Both NumPy and Biopython do this if you
wanted some reassurance.

On the other hand, Galaxy itself will need to more to Python 3
at some point, and certainly individual tools will too. This will
probably mean (as with Linux Python packages) having double
entries on the ToolSehd (one for Python 2, one for Python 3),

e.g ToolShed package for NumPy under Python 2 (done)
and under Python 3 (needed).

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

John Chilton-4
On Tue, Jun 17, 2014 at 2:55 PM, Peter Cock <[hidden email]> wrote:

> On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <[hidden email]> wrote:
>> Too bad there aren't any really good options. I will use the environment
>> variable approach for the query size limit.
>
> Are you using the optional job splitting (parallelism) feature in Galaxy?
> That seems to be me to be a good place to insert a Galaxy level
> job size limit. e.g. BLAST+ jobs are split into 1000 query chunks,
> so you might wish to impose a 25 chunk limit?
>
> Long term being able to set limits on the input file parameters
> of each tool would be nicer - e.g. Limit BLASTN to at most
> 20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

Trello card created, please vote!

https://trello.com/c/0XQXVhRz

>
>> For the gene bank links I guess modifying the .loc file is the least
>> bad way. Maybe it can be merged into galaxy_blast, that would at
>> least solve the interoperability problems.
>
> It would have to be sufficiently general, and backward compatible.
>
> FYI other people have also looked at extending the blast *.loc
> files (e.g. adding a category column for helping filter down a
> very large BLAST database list).
>
>> @Peter: One potential problem in merging my blast2html tool
>> could be that I have written it in python3, and the current tool
>> wrapper therefore installs python3 and a host of its dependencies,
>> making for a quite large download.
>
> Without seeing your code, it is hard to say, but actually writing
> Python code which works unmodified under Python 2.7 and
> Python 3 is quite doable (and under Python 2.6 with a few
> more provisos). Both NumPy and Biopython do this if you
> wanted some reassurance.
>
> On the other hand, Galaxy itself will need to more to Python 3
> at some point, and certainly individual tools will too. This will
> probably mean (as with Linux Python packages) having double
> entries on the ToolSehd (one for Python 2, one for Python 3),

I certainly hope Galaxy can move to Python 3 at some point... being a
pessimist though I would place bets against it :).

>
> e.g ToolShed package for NumPy under Python 2 (done)
> and under Python 3 (needed).
>
> Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Jan Kanis
In reply to this post by Peter Cock
I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup.

Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client.

I gave a quick try before on making the blast2html tool run in both python 2.6 and 3, but I gave up due to too many encoding issues. The client's machine has python 2.6. Maybe I should have another look.

Jan


On 17 June 2014 21:55, Peter Cock <[hidden email]> wrote:
On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <[hidden email]> wrote:
> Too bad there aren't any really good options. I will use the environment
> variable approach for the query size limit.

Are you using the optional job splitting (parallelism) feature in Galaxy?
That seems to be me to be a good place to insert a Galaxy level
job size limit. e.g. BLAST+ jobs are split into 1000 query chunks,
so you might wish to impose a 25 chunk limit?

Long term being able to set limits on the input file parameters
of each tool would be nicer - e.g. Limit BLASTN to at most
20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

> For the gene bank links I guess modifying the .loc file is the least
> bad way. Maybe it can be merged into galaxy_blast, that would at
> least solve the interoperability problems.

It would have to be sufficiently general, and backward compatible.

FYI other people have also looked at extending the blast *.loc
files (e.g. adding a category column for helping filter down a
very large BLAST database list).

> @Peter: One potential problem in merging my blast2html tool
> could be that I have written it in python3, and the current tool
> wrapper therefore installs python3 and a host of its dependencies,
> making for a quite large download.

Without seeing your code, it is hard to say, but actually writing
Python code which works unmodified under Python 2.7 and
Python 3 is quite doable (and under Python 2.6 with a few
more provisos). Both NumPy and Biopython do this if you
wanted some reassurance.

On the other hand, Galaxy itself will need to more to Python 3
at some point, and certainly individual tools will too. This will
probably mean (as with Linux Python packages) having double
entries on the ToolSehd (one for Python 2, one for Python 3),

e.g ToolShed package for NumPy under Python 2 (done)
and under Python 3 (needed).

Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Peter Cock
On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <[hidden email]> wrote:
> I am not using job splitting, because I am implementing this for a client
> with a small (one machine) galaxy setup.

Ah - this also explains why a job size limit is important for you.

> Implementing a query limit feature in galaxy core would probably be the best
> idea, but that would also probably require an admin screen to edit those
> limits, and I don't think I can sell the required time to my boss under the
> contract we have with the client.

The wrapper script idea I outlined to you earlier would be the least
invasive (although might cause trouble if BLAST is run at the command
line outside Galaxy), while your idea of inserting the check script into
the Galaxy Tool XML just before running BLAST itself should also
work well.

> I gave a quick try before on making the blast2html tool run in both python
> 2.6 and 3, but I gave up due to too many encoding issues. The client's
> machine has python 2.6. Maybe I should have another look.
>
> Jan

It gets easier with practice - a mixture of little syntax things, and
the big pain about bytes versus unicode (and thus encodings,
and raw versus text mode for file handles).

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Peter Cock
On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <[hidden email]> wrote:

> On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <[hidden email]> wrote:
>> I am not using job splitting, because I am implementing this for a client
>> with a small (one machine) galaxy setup.
>
> Ah - this also explains why a job size limit is important for you.
>
>> Implementing a query limit feature in galaxy core would probably be the best
>> idea, but that would also probably require an admin screen to edit those
>> limits, and I don't think I can sell the required time to my boss under the
>> contract we have with the client.
>
> The wrapper script idea I outlined to you earlier would be the least
> invasive (although might cause trouble if BLAST is run at the command
> line outside Galaxy), while your idea of inserting the check script into
> the Galaxy Tool XML just before running BLAST itself should also
> work well.

While looking an Jan's pull request to insert a query size limit before
running BLAST https://github.com/peterjc/galaxy_blast/pull/43
I realised that this will not work so well if job-splitting is enabled.

If using the job-splitting parallelism setting in Galaxy, then the BLAST
query FASTA file is broken up into chunks of 1000 sequences. This
means the new check would be make at the chunk level - so it could
in effect catch extremely long query sequences (e.g. chromosomes),
but could not block anyone submitting one query FASTA file containing
many thousands of moderate length query sequences (e.g. genes).

John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
Generic infrastructure to let deployers specify limits for tools based
on input metadata (number of sequences, file size, etc...)

Would it be fair to say this is not likely to be implemented in the near
future? i.e. Should we consider implementing the BLAST query limit
approach as a short term hack?

Thanks,

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

John Chilton-4
On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <[hidden email]> wrote:

> On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <[hidden email]> wrote:
>> On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <[hidden email]> wrote:
>>> I am not using job splitting, because I am implementing this for a client
>>> with a small (one machine) galaxy setup.
>>
>> Ah - this also explains why a job size limit is important for you.
>>
>>> Implementing a query limit feature in galaxy core would probably be the best
>>> idea, but that would also probably require an admin screen to edit those
>>> limits, and I don't think I can sell the required time to my boss under the
>>> contract we have with the client.
>>
>> The wrapper script idea I outlined to you earlier would be the least
>> invasive (although might cause trouble if BLAST is run at the command
>> line outside Galaxy), while your idea of inserting the check script into
>> the Galaxy Tool XML just before running BLAST itself should also
>> work well.
>
> While looking an Jan's pull request to insert a query size limit before
> running BLAST https://github.com/peterjc/galaxy_blast/pull/43
> I realised that this will not work so well if job-splitting is enabled.
>
> If using the job-splitting parallelism setting in Galaxy, then the BLAST
> query FASTA file is broken up into chunks of 1000 sequences. This
> means the new check would be make at the chunk level - so it could
> in effect catch extremely long query sequences (e.g. chromosomes),
> but could not block anyone submitting one query FASTA file containing
> many thousands of moderate length query sequences (e.g. genes).
>
> John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
> Generic infrastructure to let deployers specify limits for tools based
> on input metadata (number of sequences, file size, etc...)
>
> Would it be fair to say this is not likely to be implemented in the near
> future? i.e. Should we consider implementing the BLAST query limit
> approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone
on the core team getting to it in the next six months say.

...

I am now angry with myself though because I realized that dynamic job
destinations are a better way to implement this in the meantime (that
environment stuff was very fresh when I responded so I think I just
jumped there). You can build a flexible infrastructure locally that is
largely decoupled from the tools and that may (?) work around the task
splitting problem Peter brought up.

Outline of the idea:

Create a Python script - say lib/galaxy/jobs/mapper_limits.py and add
some functions to it like:

------------------
# Helper utilities for limiting tool inputs.
from galaxy.jobs.mapper import JobMappingException

DEFAULT_QUERY_LIMIT_MESSAGE = "Size of input exceeds query limit of
this Galaxy instance."

def assert_fewer_than_ n_sequences(input_path, n,
msg=DEFAULT_QUERY_LIMIT_MESSAGE):
  ...  # compute num_sequences
  if num_sequences > n:
    raise JobMappingException(msg)

# Do same for other checks...
------------------

This is an abstract file that has nothing to do with the institution
or toolbox really. Once you get it working - open a pull request and
we can probably get this integrated into Galaxy (as long as it is
abstract enough). Then deployers can create specific rules for that
particular cluster and toolbox:

Create  lib/galaxy/jobs/runners/rules/instance_dests.py

------------------
from galaxy.jobs import mapper_limits

def limited_blast(job, app):
  inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] )
  query_file = inp_data[ "query" ].file_name
  mapper_limits.assert_fewer_than_ n_sequences( query_file, 300 )
  return app.job_config.get_destination( "blast_base" )

------------------

Then open job_conf.xml and add the correct destinations...

<job_conf>
   ...
  <destinations>
     ...
    <destination id="limited_blast" runner="dynamic">
      <param id="function">limited_blast</param>
    </destination>
    <destination id="blast_base" runner="torque> <!-- or whatever -->
      ....
    </destination>
  </destinations>
  <tools>
    <tool id="ncbi_blastn_wrapper" destination="limited_blast" />
    <tool id="ncbi_blastp_wrapper" destination="limited_blast" />
     ...
   </tools>
</job_conf>

Jan I am really sorry I didn't come up with this before you did all
that work. Hopefully what you did for "limit_query_size.py" can be
reused in this context.

-John

>
> Thanks,
>
> Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

Peter Cock
On Fri, Jun 27, 2014 at 3:13 PM, John Chilton <[hidden email]> wrote:

> On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <[hidden email]> wrote:
>> On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <[hidden email]> wrote:
>>
>> John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
>> Generic infrastructure to let deployers specify limits for tools based
>> on input metadata (number of sequences, file size, etc...)
>>
>> Would it be fair to say this is not likely to be implemented in the near
>> future? i.e. Should we consider implementing the BLAST query limit
>> approach as a short term hack?
>
> It would be good functionality - but I don't foresee myself or anyone
> on the core team getting to it in the next six months say.
>
> ...
>
> I am now angry with myself though because I realized that dynamic job
> destinations are a better way to implement this in the meantime (that
> environment stuff was very fresh when I responded so I think I just
> jumped there). You can build a flexible infrastructure locally that is
> largely decoupled from the tools and that may (?) work around the task
> splitting problem Peter brought up.
>
> Outline of the idea:
> <snip>

Hi John,

So the idea is to define a dynamic job mapper which checks the
query input size, and if too big raises an error, and otherwise
passes the job to the configured job handler (e.g. SGE cluster).

See https://wiki.galaxyproject.org/Admin/Config/Jobs

It sounds like this ought to be possible right now, but you are
suggesting since this seems quite a general use case, the
code to help build a dynamic mapper using things like file
size (in bytes or number of sequences) could be added to
Galaxy?

This approach would need the Galaxy Admin to setup a custom
job mapper for BLAST (which knows to look at the query file),
but it taps into an existing Galaxy framework. By providing a
reference implementation this ought to be fairly easy to setup,
and can be extended to be more clever about the limits.

e.g. For BLAST, we should consider both the number (and
length) of the queries, plus the size of the database.

Regards,

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Per-tool configuration

John Chilton-4
On Fri, Jun 27, 2014 at 9:30 AM, Peter Cock <[hidden email]> wrote:

> On Fri, Jun 27, 2014 at 3:13 PM, John Chilton <[hidden email]> wrote:
>> On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <[hidden email]> wrote:
>>> On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <[hidden email]> wrote:
>>>
>>> John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
>>> Generic infrastructure to let deployers specify limits for tools based
>>> on input metadata (number of sequences, file size, etc...)
>>>
>>> Would it be fair to say this is not likely to be implemented in the near
>>> future? i.e. Should we consider implementing the BLAST query limit
>>> approach as a short term hack?
>>
>> It would be good functionality - but I don't foresee myself or anyone
>> on the core team getting to it in the next six months say.
>>
>> ...
>>
>> I am now angry with myself though because I realized that dynamic job
>> destinations are a better way to implement this in the meantime (that
>> environment stuff was very fresh when I responded so I think I just
>> jumped there). You can build a flexible infrastructure locally that is
>> largely decoupled from the tools and that may (?) work around the task
>> splitting problem Peter brought up.
>>
>> Outline of the idea:
>> <snip>
>
> Hi John,
>
> So the idea is to define a dynamic job mapper which checks the
> query input size, and if too big raises an error, and otherwise
> passes the job to the configured job handler (e.g. SGE cluster).
>
> See https://wiki.galaxyproject.org/Admin/Config/Jobs
>
> It sounds like this ought to be possible right now, but you are
> suggesting since this seems quite a general use case, the
> code to help build a dynamic mapper using things like file
> size (in bytes or number of sequences) could be added to
> Galaxy?

Yes it is possible right now and everything could just be stuck right
the rule file itself. I was just suggesting that sharing some of the
helpers with the community might ease the process for future
deployers.

>
> This approach would need the Galaxy Admin to setup a custom
> job mapper for BLAST (which knows to look at the query file),
> but it taps into an existing Galaxy framework. By providing a
> reference implementation this ought to be fairly easy to setup,
> and can be extended to be more clever about the limits.

Yes. As you mention this can be much more expressive than an XML-based
fixed set of limit types. In addition to static sorts of limits - you
could combine inputs like you mentioned, one could allow local users
of the public resource to run as much as they want, allow larger jobs
on the weekend when things are slow, etc.... I recently added a
high-level utility for looking at job metrics in these rules - so you
can say restrict and or expand the limit based on how many jobs the
user has run in the last month or how many core hours they have
consumed, etc....

https://bitbucket.org/galaxy/galaxy-central/commits/9a905e98e1550314cf821a99c2adc1b00a4eed83

>
> e.g. For BLAST, we should consider both the number (and
> length) of the queries, plus the size of the database.

Thanks for clarifying and providing some context to my (in retrospect)
seemingly random Python scripts :).

>
> Regards,
>
> Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/