FastQC wrapper not seeing files at gzipped

classic Classic list List threaded Threaded
9 messages Options
| Threaded
Open this post in threaded view
|

FastQC wrapper not seeing files at gzipped

ryang
Hi all - I've got a bunch of fatsq files uploaded into a data library in Galaxy.  The underlying files is gzipped however Galaxy strips the .gz from the filename and displays it as .fastq.  When the python wrapper rgFastQC.py gets called, it correctly sees the fastq.gz file.  The wrapper creates a symbolic link to the .gz file in a tmp directory.  The link is .fastq.  When FastQC tries to read this file, it fails because its compressed.  So one of two things is going wrong here:

1)  It looks like the wrapper is incorrectly renaming the file, but its using the name given to it in Galaxy.

2)  When the file is uploaded into the data library, Galaxy is stripping off the .gz extension.

I think #2 is the more correct problem.  How can I keep Galaxy from stripping the .gz extension?

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

Peter Cock
Hi Ryan,

The problem isn't Galaxy stripping the extension, rather
Galaxy is actually decompressing the file as part of the
upload process.

Unfortunately (and there is an open Trello enhancement
request on this), Galaxy does not support sorting any of
the defined datatypes in compressed form UNLESS they
are defined that way (like BAM files).

This has lead some Galaxy Admins to define a new datatype
lgzippedfastq (or similar - I'd have to check my old emails
for the exact name used as a gripped alternative to the
Galaxy sangerfastq datatype) and then modified many/all
their tools to handle this. That is a lot of work, but does
offer big disk savings for this key datatype.

The Galaxy team instead use a compressed file system,
so for usegalaxy.org ALL their data files are compressed
but Galaxy can ignore this complexity.

Peter

On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <[hidden email]> wrote:

> Hi all - I've got a bunch of fatsq files uploaded into a data library in
> Galaxy.  The underlying files is gzipped however Galaxy strips the .gz from
> the filename and displays it as .fastq.  When the python wrapper rgFastQC.py
> gets called, it correctly sees the fastq.gz file.  The wrapper creates a
> symbolic link to the .gz file in a tmp directory.  The link is .fastq.  When
> FastQC tries to read this file, it fails because its compressed.  So one of
> two things is going wrong here:
>
> 1)  It looks like the wrapper is incorrectly renaming the file, but its
> using the name given to it in Galaxy.
>
> 2)  When the file is uploaded into the data library, Galaxy is stripping off
> the .gz extension.
>
> I think #2 is the more correct problem.  How can I keep Galaxy from
> stripping the .gz extension?
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

ryang
Galaxy is not decompressing the file.  The file is linked to on the filesystem. 

On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <[hidden email]> wrote:
Hi Ryan,

The problem isn't Galaxy stripping the extension, rather
Galaxy is actually decompressing the file as part of the
upload process.

Unfortunately (and there is an open Trello enhancement
request on this), Galaxy does not support sorting any of
the defined datatypes in compressed form UNLESS they
are defined that way (like BAM files).

This has lead some Galaxy Admins to define a new datatype
lgzippedfastq (or similar - I'd have to check my old emails
for the exact name used as a gripped alternative to the
Galaxy sangerfastq datatype) and then modified many/all
their tools to handle this. That is a lot of work, but does
offer big disk savings for this key datatype.

The Galaxy team instead use a compressed file system,
so for usegalaxy.org ALL their data files are compressed
but Galaxy can ignore this complexity.

Peter

On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <[hidden email]> wrote:
> Hi all - I've got a bunch of fatsq files uploaded into a data library in
> Galaxy.  The underlying files is gzipped however Galaxy strips the .gz from
> the filename and displays it as .fastq.  When the python wrapper rgFastQC.py
> gets called, it correctly sees the fastq.gz file.  The wrapper creates a
> symbolic link to the .gz file in a tmp directory.  The link is .fastq.  When
> FastQC tries to read this file, it fails because its compressed.  So one of
> two things is going wrong here:
>
> 1)  It looks like the wrapper is incorrectly renaming the file, but its
> using the name given to it in Galaxy.
>
> 2)  When the file is uploaded into the data library, Galaxy is stripping off
> the .gz extension.
>
> I think #2 is the more correct problem.  How can I keep Galaxy from
> stripping the .gz extension?
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

Peter Cock
Ah. Then this is more subtle... are you using the
library import option where Galaxy just symlinks
to existing files? I thought that was not possible
with gzipped files (for the reasons given below).
Perhaps this is not being blocked, leading to the
confused state you're seeing?

Peter

On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]> wrote:

> Galaxy is not decompressing the file.  The file is linked to on the
> filesystem.
>
> On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <[hidden email]>
> wrote:
>>
>> Hi Ryan,
>>
>> The problem isn't Galaxy stripping the extension, rather
>> Galaxy is actually decompressing the file as part of the
>> upload process.
>>
>> Unfortunately (and there is an open Trello enhancement
>> request on this), Galaxy does not support sorting any of
>> the defined datatypes in compressed form UNLESS they
>> are defined that way (like BAM files).
>>
>> This has lead some Galaxy Admins to define a new datatype
>> lgzippedfastq (or similar - I'd have to check my old emails
>> for the exact name used as a gripped alternative to the
>> Galaxy sangerfastq datatype) and then modified many/all
>> their tools to handle this. That is a lot of work, but does
>> offer big disk savings for this key datatype.
>>
>> The Galaxy team instead use a compressed file system,
>> so for usegalaxy.org ALL their data files are compressed
>> but Galaxy can ignore this complexity.
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <[hidden email]>
>> wrote:
>> > Hi all - I've got a bunch of fatsq files uploaded into a data library in
>> > Galaxy.  The underlying files is gzipped however Galaxy strips the .gz
>> > from
>> > the filename and displays it as .fastq.  When the python wrapper
>> > rgFastQC.py
>> > gets called, it correctly sees the fastq.gz file.  The wrapper creates a
>> > symbolic link to the .gz file in a tmp directory.  The link is .fastq.
>> > When
>> > FastQC tries to read this file, it fails because its compressed.  So one
>> > of
>> > two things is going wrong here:
>> >
>> > 1)  It looks like the wrapper is incorrectly renaming the file, but its
>> > using the name given to it in Galaxy.
>> >
>> > 2)  When the file is uploaded into the data library, Galaxy is stripping
>> > off
>> > the .gz extension.
>> >
>> > I think #2 is the more correct problem.  How can I keep Galaxy from
>> > stripping the .gz extension?
>> >
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   https://lists.galaxyproject.org/
>> >
>> > To search Galaxy mailing lists use the unified search at:
>> >   http://galaxyproject.org/search/mailinglists/
>
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

ryang
Yes, I'm doing a link to file on file system when doing a library import.  Does this mean I should link to the the uncompressed file? 

On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <[hidden email]> wrote:
Ah. Then this is more subtle... are you using the
library import option where Galaxy just symlinks
to existing files? I thought that was not possible
with gzipped files (for the reasons given below).
Perhaps this is not being blocked, leading to the
confused state you're seeing?

Peter

On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]> wrote:
> Galaxy is not decompressing the file.  The file is linked to on the
> filesystem.
>
> On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <[hidden email]>
> wrote:
>>
>> Hi Ryan,
>>
>> The problem isn't Galaxy stripping the extension, rather
>> Galaxy is actually decompressing the file as part of the
>> upload process.
>>
>> Unfortunately (and there is an open Trello enhancement
>> request on this), Galaxy does not support sorting any of
>> the defined datatypes in compressed form UNLESS they
>> are defined that way (like BAM files).
>>
>> This has lead some Galaxy Admins to define a new datatype
>> lgzippedfastq (or similar - I'd have to check my old emails
>> for the exact name used as a gripped alternative to the
>> Galaxy sangerfastq datatype) and then modified many/all
>> their tools to handle this. That is a lot of work, but does
>> offer big disk savings for this key datatype.
>>
>> The Galaxy team instead use a compressed file system,
>> so for usegalaxy.org ALL their data files are compressed
>> but Galaxy can ignore this complexity.
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <[hidden email]>
>> wrote:
>> > Hi all - I've got a bunch of fatsq files uploaded into a data library in
>> > Galaxy.  The underlying files is gzipped however Galaxy strips the .gz
>> > from
>> > the filename and displays it as .fastq.  When the python wrapper
>> > rgFastQC.py
>> > gets called, it correctly sees the fastq.gz file.  The wrapper creates a
>> > symbolic link to the .gz file in a tmp directory.  The link is .fastq.
>> > When
>> > FastQC tries to read this file, it fails because its compressed.  So one
>> > of
>> > two things is going wrong here:
>> >
>> > 1)  It looks like the wrapper is incorrectly renaming the file, but its
>> > using the name given to it in Galaxy.
>> >
>> > 2)  When the file is uploaded into the data library, Galaxy is stripping
>> > off
>> > the .gz extension.
>> >
>> > I think #2 is the more correct problem.  How can I keep Galaxy from
>> > stripping the .gz extension?
>> >
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   https://lists.galaxyproject.org/
>> >
>> > To search Galaxy mailing lists use the unified search at:
>> >   http://galaxyproject.org/search/mailinglists/
>
>


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

ryang
To (I think) fix this, I changed line 50 in rgFastQC.py from
infname = self.opts.inputfilename

to
infname = self.opts.input

This will force FastQC to look at the "real" file and not the renamed dataset.


On Mon, Jan 12, 2015 at 12:20 PM, Ryan G <[hidden email]> wrote:
Yes, I'm doing a link to file on file system when doing a library import.  Does this mean I should link to the the uncompressed file? 

On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <[hidden email]> wrote:
Ah. Then this is more subtle... are you using the
library import option where Galaxy just symlinks
to existing files? I thought that was not possible
with gzipped files (for the reasons given below).
Perhaps this is not being blocked, leading to the
confused state you're seeing?

Peter

On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]> wrote:
> Galaxy is not decompressing the file.  The file is linked to on the
> filesystem.
>
> On Mon, Jan 12, 2015 at 10:28 AM, Peter Cock <[hidden email]>
> wrote:
>>
>> Hi Ryan,
>>
>> The problem isn't Galaxy stripping the extension, rather
>> Galaxy is actually decompressing the file as part of the
>> upload process.
>>
>> Unfortunately (and there is an open Trello enhancement
>> request on this), Galaxy does not support sorting any of
>> the defined datatypes in compressed form UNLESS they
>> are defined that way (like BAM files).
>>
>> This has lead some Galaxy Admins to define a new datatype
>> lgzippedfastq (or similar - I'd have to check my old emails
>> for the exact name used as a gripped alternative to the
>> Galaxy sangerfastq datatype) and then modified many/all
>> their tools to handle this. That is a lot of work, but does
>> offer big disk savings for this key datatype.
>>
>> The Galaxy team instead use a compressed file system,
>> so for usegalaxy.org ALL their data files are compressed
>> but Galaxy can ignore this complexity.
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 3:15 PM, Ryan G <[hidden email]>
>> wrote:
>> > Hi all - I've got a bunch of fatsq files uploaded into a data library in
>> > Galaxy.  The underlying files is gzipped however Galaxy strips the .gz
>> > from
>> > the filename and displays it as .fastq.  When the python wrapper
>> > rgFastQC.py
>> > gets called, it correctly sees the fastq.gz file.  The wrapper creates a
>> > symbolic link to the .gz file in a tmp directory.  The link is .fastq.
>> > When
>> > FastQC tries to read this file, it fails because its compressed.  So one
>> > of
>> > two things is going wrong here:
>> >
>> > 1)  It looks like the wrapper is incorrectly renaming the file, but its
>> > using the name given to it in Galaxy.
>> >
>> > 2)  When the file is uploaded into the data library, Galaxy is stripping
>> > off
>> > the .gz extension.
>> >
>> > I think #2 is the more correct problem.  How can I keep Galaxy from
>> > stripping the .gz extension?
>> >
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   https://lists.galaxyproject.org/
>> >
>> > To search Galaxy mailing lists use the unified search at:
>> >   http://galaxyproject.org/search/mailinglists/
>
>



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

Peter Cock
In reply to this post by ryang
Hi Ryan,

That is the workaround I am using, which means
keeping an uncompressed copy of the FASTQ
file on our main storage from where Galaxy can
see it (for people to use within their histories).

From a long term storage perspective this is not
ideal - so I am keen for better handling of gzipped
files within Galaxy (particularly within libraries
which we use for raw data).

Peter

On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <[hidden email]> wrote:

> Yes, I'm doing a link to file on file system when doing a library import.
> Does this mean I should link to the the uncompressed file?
>
> On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <[hidden email]>
> wrote:
>>
>> Ah. Then this is more subtle... are you using the
>> library import option where Galaxy just symlinks
>> to existing files? I thought that was not possible
>> with gzipped files (for the reasons given below).
>> Perhaps this is not being blocked, leading to the
>> confused state you're seeing?
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]>
>> wrote:
>> > Galaxy is not decompressing the file.  The file is linked to on the
>> > filesystem.
>> >
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

ryang
Agreed. 

On Mon, Jan 12, 2015 at 10:24 PM, Peter Cock <[hidden email]> wrote:
Hi Ryan,

That is the workaround I am using, which means
keeping an uncompressed copy of the FASTQ
file on our main storage from where Galaxy can
see it (for people to use within their histories).

From a long term storage perspective this is not
ideal - so I am keen for better handling of gzipped
files within Galaxy (particularly within libraries
which we use for raw data).

Peter

On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <[hidden email]> wrote:
> Yes, I'm doing a link to file on file system when doing a library import.
> Does this mean I should link to the the uncompressed file?
>
> On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <[hidden email]>
> wrote:
>>
>> Ah. Then this is more subtle... are you using the
>> library import option where Galaxy just symlinks
>> to existing files? I thought that was not possible
>> with gzipped files (for the reasons given below).
>> Perhaps this is not being blocked, leading to the
>> confused state you're seeing?
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]>
>> wrote:
>> > Galaxy is not decompressing the file.  The file is linked to on the
>> > filesystem.
>> >


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: FastQC wrapper not seeing files at gzipped

John Chilton-4
Peter has already voted and if I recall correctly Ryan cannot access
Trello - so this might be a waste to bring up - but here is a Trello
card for voting on this issue and tracking progress
https://trello.com/c/3RkTDnIn.

To summarize previous discussion - this would be fantastic to have and
Galaxy needs this - but we solve this on usegalaxy.org by using a
compressed file system - a more elegant solution when it is a
possibility - so it has never been a tier one priority for the
devteam. The only update on this is that I don't think we are using a
compressed file system anymore so this might become and issue again
someday soon.

This would be non-trivial to implement - but I have always felt this
would be a fairly fun project to work on if anyone really tight on
space locally wants to try to tackle it :).

-John

On Tue, Jan 13, 2015 at 9:54 AM, Ryan G <[hidden email]> wrote:

> Agreed.
>
> On Mon, Jan 12, 2015 at 10:24 PM, Peter Cock <[hidden email]>
> wrote:
>>
>> Hi Ryan,
>>
>> That is the workaround I am using, which means
>> keeping an uncompressed copy of the FASTQ
>> file on our main storage from where Galaxy can
>> see it (for people to use within their histories).
>>
>> From a long term storage perspective this is not
>> ideal - so I am keen for better handling of gzipped
>> files within Galaxy (particularly within libraries
>> which we use for raw data).
>>
>> Peter
>>
>> On Mon, Jan 12, 2015 at 5:20 PM, Ryan G <[hidden email]>
>> wrote:
>> > Yes, I'm doing a link to file on file system when doing a library
>> > import.
>> > Does this mean I should link to the the uncompressed file?
>> >
>> > On Mon, Jan 12, 2015 at 12:14 PM, Peter Cock <[hidden email]>
>> > wrote:
>> >>
>> >> Ah. Then this is more subtle... are you using the
>> >> library import option where Galaxy just symlinks
>> >> to existing files? I thought that was not possible
>> >> with gzipped files (for the reasons given below).
>> >> Perhaps this is not being blocked, leading to the
>> >> confused state you're seeing?
>> >>
>> >> Peter
>> >>
>> >> On Mon, Jan 12, 2015 at 4:52 PM, Ryan G <[hidden email]>
>> >> wrote:
>> >> > Galaxy is not decompressing the file.  The file is linked to on the
>> >> > filesystem.
>> >> >
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/