Looking for recommendations: How to run galaxy workflows in batch

classic Classic list List threaded Threaded
9 messages Options
| Threaded
Open this post in threaded view
|

Looking for recommendations: How to run galaxy workflows in batch

Dave Lin
Hi All,

I'm looking to batch process 40 large data sets with the same galaxy workflow.

This obviously can be done in a brute-force manual manner. 

However, is there a better way to schedule/invoke these jobs in batch

1) from the UI with a plugin
2) command-line
3) web-service

Thanks in advance for any pointers.
Dave


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Dannon Baker
Hi Dave,

Yes, galaxy's standard run-workflow dialog has a feature where you can select multiple datasets as input for a single "Input Dataset" step.  To do this, click the icon referenced by the tooltip in the screenshot below to select multiple files.  All parameters remain static between executions except for the single input dataset that gets modified for each run, and that only one input dataset can be set to multiple files in this fashion.

-Dannon






On Feb 6, 2012, at 4:18 PM, Dave Lin wrote:

Hi All,

I'm looking to batch process 40 large data sets with the same galaxy workflow.

This obviously can be done in a brute-force manual manner. 

However, is there a better way to schedule/invoke these jobs in batch

1) from the UI with a plugin
2) command-line
3) web-service

Thanks in advance for any pointers.
Dave

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Dave Lin

Thank you Dannon. That is helpful.

 
What if I need to specify multiple inputs per run (i.e. .csfasta + .qual file)?

-Dave

On Mon, Feb 6, 2012 at 1:27 PM, Dannon Baker <[hidden email]> wrote:
Hi Dave,

Yes, galaxy's standard run-workflow dialog has a feature where you can select multiple datasets as input for a single "Input Dataset" step.  To do this, click the icon referenced by the tooltip in the screenshot below to select multiple files.  All parameters remain static between executions except for the single input dataset that gets modified for each run, and that only one input dataset can be set to multiple files in this fashion.

-Dannon






On Feb 6, 2012, at 4:18 PM, Dave Lin wrote:

Hi All,

I'm looking to batch process 40 large data sets with the same galaxy workflow.

This obviously can be done in a brute-force manual manner. 

However, is there a better way to schedule/invoke these jobs in batch

1) from the UI with a plugin
2) command-line
3) web-service

Thanks in advance for any pointers.
Dave

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Dannon Baker
This method only works for single inputs at the moment, though eventually it'd be nice to allow pairing.  Another option for you would be to use the workflows API, with which you can definitely specify multiple inputs.  See workflow_execute.py in the scripts/api folder of your galaxy installation for one method of doing this.

-Dannon


On Feb 6, 2012, at 4:53 PM, Dave Lin wrote:

> Thank you Dannon. That is helpful.
>
>  
> What if I need to specify multiple inputs per run (i.e. .csfasta + .qual file)?
>
> -Dave
>
> On Mon, Feb 6, 2012 at 1:27 PM, Dannon Baker <[hidden email]> wrote:
> Hi Dave,
>
> Yes, galaxy's standard run-workflow dialog has a feature where you can select multiple datasets as input for a single "Input Dataset" step.  To do this, click the icon referenced by the tooltip in the screenshot below to select multiple files.  All parameters remain static between executions except for the single input dataset that gets modified for each run, and that only one input dataset can be set to multiple files in this fashion.
>
> -Dannon
>
> <PastedGraphic-2.png>
>
>
>
>
>
> On Feb 6, 2012, at 4:18 PM, Dave Lin wrote:
>
>> Hi All,
>>
>> I'm looking to batch process 40 large data sets with the same galaxy workflow.
>>
>> This obviously can be done in a brute-force manual manner.  
>>
>> However, is there a better way to schedule/invoke these jobs in batch
>>
>> 1) from the UI with a plugin
>> 2) command-line
>> 3) web-service
>>
>> Thanks in advance for any pointers.
>> Dave
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>
>>  http://lists.bx.psu.edu/
>
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Louise-Amélie Schmitt
Hello Dannon

Could it be possible to have the input dataset's display name appended
to the new history's name instead of plain numbers when the "Send
results in a new history" option is checked?

This new feature is indeed veeeeery useful (thanks a million for it) but
the numbered suffixes make it hard to track what new history belongs to
which dataset.

Thanks,
L-A


Le 06/02/2012 23:00, Dannon Baker a écrit :

> This method only works for single inputs at the moment, though eventually it'd be nice to allow pairing.  Another option for you would be to use the workflows API, with which you can definitely specify multiple inputs.  See workflow_execute.py in the scripts/api folder of your galaxy installation for one method of doing this.
>
> -Dannon
>
>
> On Feb 6, 2012, at 4:53 PM, Dave Lin wrote:
>
>> Thank you Dannon. That is helpful.
>>
>>
>> What if I need to specify multiple inputs per run (i.e. .csfasta + .qual file)?
>>
>> -Dave
>>
>> On Mon, Feb 6, 2012 at 1:27 PM, Dannon Baker<[hidden email]>  wrote:
>> Hi Dave,
>>
>> Yes, galaxy's standard run-workflow dialog has a feature where you can select multiple datasets as input for a single "Input Dataset" step.  To do this, click the icon referenced by the tooltip in the screenshot below to select multiple files.  All parameters remain static between executions except for the single input dataset that gets modified for each run, and that only one input dataset can be set to multiple files in this fashion.
>>
>> -Dannon
>>
>> <PastedGraphic-2.png>
>>
>>
>>
>>
>>
>> On Feb 6, 2012, at 4:18 PM, Dave Lin wrote:
>>
>>> Hi All,
>>>
>>> I'm looking to batch process 40 large data sets with the same galaxy workflow.
>>>
>>> This obviously can be done in a brute-force manual manner.
>>>
>>> However, is there a better way to schedule/invoke these jobs in batch
>>>
>>> 1) from the UI with a plugin
>>> 2) command-line
>>> 3) web-service
>>>
>>> Thanks in advance for any pointers.
>>> Dave
>>>
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>>
>>>   http://lists.bx.psu.edu/
>>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>    http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Dannon Baker
Thanks for the suggestion, I like that!  I'll make the change shortly.

-Dannon

On Feb 7, 2012, at 8:03 AM, Louise-Amélie Schmitt wrote:

> Hello Dannon
>
> Could it be possible to have the input dataset's display name appended to the new history's name instead of plain numbers when the "Send results in a new history" option is checked?
>
> This new feature is indeed veeeeery useful (thanks a million for it) but the numbered suffixes make it hard to track what new history belongs to which dataset.
>
> Thanks,
> L-A
>
>
> Le 06/02/2012 23:00, Dannon Baker a écrit :
>> This method only works for single inputs at the moment, though eventually it'd be nice to allow pairing.  Another option for you would be to use the workflows API, with which you can definitely specify multiple inputs.  See workflow_execute.py in the scripts/api folder of your galaxy installation for one method of doing this.
>>
>> -Dannon
>>
>>
>> On Feb 6, 2012, at 4:53 PM, Dave Lin wrote:
>>
>>> Thank you Dannon. That is helpful.
>>>
>>>
>>> What if I need to specify multiple inputs per run (i.e. .csfasta + .qual file)?
>>>
>>> -Dave
>>>
>>> On Mon, Feb 6, 2012 at 1:27 PM, Dannon Baker<[hidden email]>  wrote:
>>> Hi Dave,
>>>
>>> Yes, galaxy's standard run-workflow dialog has a feature where you can select multiple datasets as input for a single "Input Dataset" step.  To do this, click the icon referenced by the tooltip in the screenshot below to select multiple files.  All parameters remain static between executions except for the single input dataset that gets modified for each run, and that only one input dataset can be set to multiple files in this fashion.
>>>
>>> -Dannon
>>>
>>> <PastedGraphic-2.png>
>>>
>>>
>>>
>>>
>>>
>>> On Feb 6, 2012, at 4:18 PM, Dave Lin wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm looking to batch process 40 large data sets with the same galaxy workflow.
>>>>
>>>> This obviously can be done in a brute-force manual manner.
>>>>
>>>> However, is there a better way to schedule/invoke these jobs in batch
>>>>
>>>> 1) from the UI with a plugin
>>>> 2) command-line
>>>> 3) web-service
>>>>
>>>> Thanks in advance for any pointers.
>>>> Dave
>>>>
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this
>>>> and other Galaxy lists, please use the interface at:
>>>>
>>>>  http://lists.bx.psu.edu/
>>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Bernd Jagla
In reply to this post by Dannon Baker
Dannon Baker <dannonbaker@...> writes:

>
> Hi Dave,
>
> Yes, galaxy's standard run-workflow dialog has a feature where you can select
multiple datasets as input
> for a single "Input Dataset" step.  To do this, click the icon referenced by
the tooltip in the screenshot
> below to select multiple files.  All parameters remain static between
executions except for the single
> input dataset that gets modified for each run, and that only one input dataset
can be set to multiple files
> in this fashion.
>
> -Dannon

Dannon,

what if I don't have this icon??? How can I enable this? Where is this
documented?

Thanks,

Bernd


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

Dannon Baker
In your workflows, are you using "Input Dataset" steps?  Galaxy uses these steps to know how to map datasets to do special things like this.  If you're not currently using them, just open the workflow editor and add input dataset steps (it's at the very bottom of the tool list) connected to the tool inputs at the highest level of the workflow, and you'll see the multiple dataset flagging when you go to run it next time.

-Dannon

On Jul 4, 2012, at 3:19 AM, Bernd Jagla wrote:

> Dannon Baker <dannonbaker@...> writes:
>
>>
>> Hi Dave,
>>
>> Yes, galaxy's standard run-workflow dialog has a feature where you can select
> multiple datasets as input
>> for a single "Input Dataset" step.  To do this, click the icon referenced by
> the tooltip in the screenshot
>> below to select multiple files.  All parameters remain static between
> executions except for the single
>> input dataset that gets modified for each run, and that only one input dataset
> can be set to multiple files
>> in this fashion.
>>
>> -Dannon
>
> Dannon,
>
> what if I don't have this icon??? How can I enable this? Where is this
> documented?
>
> Thanks,
>
> Bernd
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
| Threaded
Open this post in threaded view
|

Re: Looking for recommendations: How to run galaxy workflows in batch

thondeboer
In reply to this post by Bernd Jagla
But this only works if you have a single dataset (such as a BAM file) for each workflow to run on.
If you have pairs of files (such as paired end FASTQ files, not an uncommon workflow nowadays :) ) you need to resort to using the API, since there is no support for paired end sequencing in GALAXY in this batch processing from the UI (yet?). You can run the Workflow one at a time, but you have to choose the FASTQ pairs your self.

I have written a fairly generic execution engine that I can share, that uses a config file to describe the files you need from the library in simple key:value pairs and that can execute the paired-end sequencing on hundreds of FASTQ files...It's a little hacky and requires your FASTQ files to have some consistent naming for the forward and reverse reads (_R1.fastq & _R2.fastq) but other than that it seems to do the job...

There is however a nasty bug in the API, in that it removes the files from your history if you use them in the API (I will post something on that later) but it seems to work fine for data in the libraries...

Thon
Regards,

Thon de Boer, Ph.D.
Bioinformatics Guru
+1-650-799-6839




On Jul 4, 2012, at 12:19 AM, Bernd Jagla wrote:

Dannon Baker <dannonbaker@...> writes:


Hi Dave,

Yes, galaxy's standard run-workflow dialog has a feature where you can select
multiple datasets as input
for a single "Input Dataset" step.  To do this, click the icon referenced by
the tooltip in the screenshot
below to select multiple files.  All parameters remain static between
executions except for the single
input dataset that gets modified for each run, and that only one input dataset
can be set to multiple files
in this fashion.

-Dannon

Dannon,

what if I don't have this icon??? How can I enable this? Where is this
documented?

Thanks,

Bernd


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/