Relabeling dataset pairs in 'list:paired' collection

classic Classic list List threaded Threaded
3 messages Options
| Threaded
Open this post in threaded view
|

Relabeling dataset pairs in 'list:paired' collection

Peter Briggs
Dear Developers

Is there an existing tool or mechanism that can be used to duplicate a
"list of pairs" dataset collection, keeping the paired datasets the same
but relabeling each pair with a new identifier taken from a user
supplied file or list?

I've cobbled together my own tool to try and do something like this:

https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml

and while it works, it doesn't feel like a good solution as it creates
duplicates of the datasets from the first collection and consumes
additional disk/quota space unnecessarily. (This is particularly
undesirable as we expect that the input collections might be relatively
large numbers of FASTQ pairs e.g. 30 or more.)

Looking at some of the 'Collection Operations' tools that come with
Galaxy, it appears that these are able to create new collections without
making duplicate datasets, which seems much better. But these tools work
by directly invoking Python classes from the Galaxy core, so I don't
know if a similar approach could be used in a non-core tool.

Any advice or suggestions are very welcome! Thanks

Best wishes

Peter

--
Peter Briggs [hidden email]
Bioinformatics Core Facility University of Manchester
B.1083 Michael Smith Bldg Tel: (0161) 2751482
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Relabeling dataset pairs in 'list:paired' collection

John Chilton-4
Thanks for your interest in this topic. The collection operations
exist the way they do as tools distributed with the core framework
because they can't be expressed as normal tools and they utilize
abstractions that I don't consider public at this time (or really have
any confidence in making public). For this reason I think it would be
best to just implement a collection operation inside of the core
framework - Philip Mabon created such an operation as part of this
https://github.com/galaxyproject/galaxy/pull/2771 and it was merged.

I think the kinds of things that would benefit from collection
operations tend to be expressible in fairly generic terms so they
don't apply to just one field. I think your relabelling example is
exactly such a case. I have opened a PR with the start of such as tool
that I hope would address your use case with
https://github.com/galaxyproject/galaxy/pull/3603. I'm calling it
relabelling from a file because I'd also like to implement a
relabelling with a function someday but there has been some push back
on extending these collection operations to consume small functions.
Even if that were implemented though - the use case of reading from a
file is pretty different and frankly probably more practical for
typical workflows that might make use of these.

Perhaps we can continue this conversation on the #3603.

Thanks again,
-John

On Mon, Feb 13, 2017 at 11:11 AM, Peter Briggs
<[hidden email]> wrote:

> Dear Developers
>
> Is there an existing tool or mechanism that can be used to duplicate a "list
> of pairs" dataset collection, keeping the paired datasets the same but
> relabeling each pair with a new identifier taken from a user supplied file
> or list?
>
> I've cobbled together my own tool to try and do something like this:
>
> https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml
>
> and while it works, it doesn't feel like a good solution as it creates
> duplicates of the datasets from the first collection and consumes additional
> disk/quota space unnecessarily. (This is particularly undesirable as we
> expect that the input collections might be relatively large numbers of FASTQ
> pairs e.g. 30 or more.)
>
> Looking at some of the 'Collection Operations' tools that come with Galaxy,
> it appears that these are able to create new collections without making
> duplicate datasets, which seems much better. But these tools work by
> directly invoking Python classes from the Galaxy core, so I don't know if a
> similar approach could be used in a non-core tool.
>
> Any advice or suggestions are very welcome! Thanks
>
> Best wishes
>
> Peter
>
> --
> Peter Briggs [hidden email]
> Bioinformatics Core Facility University of Manchester
> B.1083 Michael Smith Bldg Tel: (0161) 2751482
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Relabeling dataset pairs in 'list:paired' collection

Peter Briggs
Hello John

Thanks for your extensive and detailed reply, and also for making the
new tool. This approach (implementing the operation within the
framework) looks vastly better than my hack for generating a new
collection, and should address the current use case that I have.

(I agree also, it makes sense that collection operations should be as
generic as possible.)

I'll add any comments to the #3603 PR, however between you and Marius I
think so far everything I'd thought of (and more) is already covered there.

Thanks again for looking into this!

Best wishes

Peter

On 13/02/17 19:43, John Chilton wrote:

> Thanks for your interest in this topic. The collection operations
> exist the way they do as tools distributed with the core framework
> because they can't be expressed as normal tools and they utilize
> abstractions that I don't consider public at this time (or really have
> any confidence in making public). For this reason I think it would be
> best to just implement a collection operation inside of the core
> framework - Philip Mabon created such an operation as part of this
> https://github.com/galaxyproject/galaxy/pull/2771 and it was merged.
>
> I think the kinds of things that would benefit from collection
> operations tend to be expressible in fairly generic terms so they
> don't apply to just one field. I think your relabelling example is
> exactly such a case. I have opened a PR with the start of such as tool
> that I hope would address your use case with
> https://github.com/galaxyproject/galaxy/pull/3603. I'm calling it
> relabelling from a file because I'd also like to implement a
> relabelling with a function someday but there has been some push back
> on extending these collection operations to consume small functions.
> Even if that were implemented though - the use case of reading from a
> file is pretty different and frankly probably more practical for
> typical workflows that might make use of these.
>
> Perhaps we can continue this conversation on the #3603.
>
> Thanks again,
> -John
>
> On Mon, Feb 13, 2017 at 11:11 AM, Peter Briggs
> <[hidden email]> wrote:
>> Dear Developers
>>
>> Is there an existing tool or mechanism that can be used to duplicate a "list
>> of pairs" dataset collection, keeping the paired datasets the same but
>> relabeling each pair with a new identifier taken from a user supplied file
>> or list?
>>
>> I've cobbled together my own tool to try and do something like this:
>>
>> https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml
>>
>> and while it works, it doesn't feel like a good solution as it creates
>> duplicates of the datasets from the first collection and consumes additional
>> disk/quota space unnecessarily. (This is particularly undesirable as we
>> expect that the input collections might be relatively large numbers of FASTQ
>> pairs e.g. 30 or more.)
>>
>> Looking at some of the 'Collection Operations' tools that come with Galaxy,
>> it appears that these are able to create new collections without making
>> duplicate datasets, which seems much better. But these tools work by
>> directly invoking Python classes from the Galaxy core, so I don't know if a
>> similar approach could be used in a non-core tool.
>>
>> Any advice or suggestions are very welcome! Thanks
>>
>> Best wishes
>>
>> Peter
>>
>> --
>> Peter Briggs [hidden email]
>> Bioinformatics Core Facility University of Manchester
>> B.1083 Michael Smith Bldg Tel: (0161) 2751482
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>  https://lists.galaxyproject.org/
>>
>> To search Galaxy mailing lists use the unified search at:
>>  http://galaxyproject.org/search/mailinglists/

--
Peter Briggs [hidden email]
Bioinformatics Core Facility University of Manchester
B.1083 Michael Smith Bldg Tel: (0161) 2751482
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/