strange issue with .RData files

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

strange issue with .RData files

Lukasse, Pieter

Hi,

 

When I upload any .RData file to my Galaxy server it seems to be unpacking/changing it. The resulting file in my history is different and around 2x larger than the uploaded file. The tool that needs to use it also aborts with an error due to this erroneous file.

 

What are the workarounds?

 

Thanks,

 

Pieter Lukasse

Wageningen UR, Plant Research International

Department of Bioinformatics (Bioscience)

Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
Wageningen, the Netherlands

T: +31-317481122;
M: +31-628189540;
skype: pieter.lukasse.wur

http://www.pri.wur.nl

 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

John Chilton-4
Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this
in detail - but I have encountered this before with datatypes that are
compressed - zipped, gzipped, etc.... Galaxy will attempt to
decompress them in order to figure out what they are. I believe this
is what is happening to your data. If you register the type as a
sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py
https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:

> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different and
> around 2x larger than the uploaded file. The tool that needs to use it also
> aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

Lukasse, Pieter
Hi John,

Yes, I think this should work as I have seen it work for another binary type I made before. See below:

class FileSet( Binary ):
    """FileSet containing N files"""
    file_ext = "prims.fileset.zip"
    blurb = "(zipped) FileSet containing multiple files"
    def sniff( self, filename ):
        # If the zip file contains multiple files then return true, false otherwise:
        zf = zipfile.ZipFile(filename)
        if (len(zf.infolist())>1):
            return True
        else :
            return False

# the if is just for backwards compatibility...could remove this at some point
if hasattr(Binary, 'register_sniffable_binary_format'):
    Binary.register_sniffable_binary_format('FileSet', 'prims.fileset.zip', FileSet)


Now the question I have is: what would be a good logic to use in the sniff method? I need something that uniquely distinguishes this zipped file from other zip files, right? In the previous example above I found a solution by checking whether the zip file has multiple files inside and return true if this is the case. Now with RData, does it mean I have to try to parse the binary contents inside and come with a good heuristic/rule ? Just wondering if someone already has thought about such a rule, specifically for RData.

Thanks,

Pieter.


-----Original Message-----
From: John Chilton [mailto:[hidden email]]
Sent: donderdag 23 oktober 2014 3:02
To: Lukasse, Pieter
Cc: [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this in detail - but I have encountered this before with datatypes that are compressed - zipped, gzipped, etc.... Galaxy will attempt to decompress them in order to figure out what they are. I believe this is what is happening to your data. If you register the type as a sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:

> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different
> and around 2x larger than the uploaded file. The tool that needs to
> use it also aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

Ross-2
Rdata is binary and serialises R objects so I sure hope you don't have to peek inside - probably needs most of an R environment - like rpy or something.
A binary header signature magic number would be ideal - I checked a few saved rdata files lying around here and all seemed to start with the following bytes - all were variable after the 12th:
1f 8b 08 00 00 00 00 00  00 03 d4 fd
Maybe someone else can confirm that as a reliable binary hex header signature for rdata? 
I couldn't find anything in the R docs - probably take a good deep dive into the guts of the save/load function source to be sure.

On Thu, Oct 30, 2014 at 8:54 PM, Lukasse, Pieter <[hidden email]> wrote:
Hi John,

Yes, I think this should work as I have seen it work for another binary type I made before. See below:

class FileSet( Binary ):
    """FileSet containing N files"""
    file_ext = "prims.fileset.zip"
    blurb = "(zipped) FileSet containing multiple files"
    def sniff( self, filename ):
        # If the zip file contains multiple files then return true, false otherwise:
        zf = zipfile.ZipFile(filename)
        if (len(zf.infolist())>1):
            return True
        else :
            return False

# the if is just for backwards compatibility...could remove this at some point
if hasattr(Binary, 'register_sniffable_binary_format'):
    Binary.register_sniffable_binary_format('FileSet', 'prims.fileset.zip', FileSet)


Now the question I have is: what would be a good logic to use in the sniff method? I need something that uniquely distinguishes this zipped file from other zip files, right? In the previous example above I found a solution by checking whether the zip file has multiple files inside and return true if this is the case. Now with RData, does it mean I have to try to parse the binary contents inside and come with a good heuristic/rule ? Just wondering if someone already has thought about such a rule, specifically for RData.

Thanks,

Pieter.


-----Original Message-----
From: John Chilton [mailto:[hidden email]]
Sent: donderdag 23 oktober 2014 3:02
To: Lukasse, Pieter
Cc: [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this in detail - but I have encountered this before with datatypes that are compressed - zipped, gzipped, etc.... Galaxy will attempt to decompress them in order to figure out what they are. I believe this is what is happening to your data. If you register the type as a sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:
> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different
> and around 2x larger than the uploaded file. The tool that needs to
> use it also aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

Lukasse, Pieter

An alternative would be perhaps to look at the file name. If it ends in “.RData” then I could mark it as such. Being able to access the original file name (so not the internal one but the name that also appears in the history UI) would allow me to do this. What would be the way to access this from within my sniff method?

 

Thanks,


Pieter

 

 

 

 

From: Ross [mailto:[hidden email]]
Sent: donderdag 30 oktober 2014 11:20
To: Lukasse, Pieter
Cc: John Chilton; [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

 

Rdata is binary and serialises R objects so I sure hope you don't have to peek inside - probably needs most of an R environment - like rpy or something.

A binary header signature magic number would be ideal - I checked a few saved rdata files lying around here and all seemed to start with the following bytes - all were variable after the 12th:

1f 8b 08 00 00 00 00 00  00 03 d4 fd

Maybe someone else can confirm that as a reliable binary hex header signature for rdata? 

I couldn't find anything in the R docs - probably take a good deep dive into the guts of the save/load function source to be sure.

 

On Thu, Oct 30, 2014 at 8:54 PM, Lukasse, Pieter <[hidden email]> wrote:

Hi John,

Yes, I think this should work as I have seen it work for another binary type I made before. See below:

class FileSet( Binary ):
    """FileSet containing N files"""
    file_ext = "prims.fileset.zip"
    blurb = "(zipped) FileSet containing multiple files"
    def sniff( self, filename ):
        # If the zip file contains multiple files then return true, false otherwise:
        zf = zipfile.ZipFile(filename)
        if (len(zf.infolist())>1):
            return True
        else :
            return False

# the if is just for backwards compatibility...could remove this at some point
if hasattr(Binary, 'register_sniffable_binary_format'):
    Binary.register_sniffable_binary_format('FileSet', 'prims.fileset.zip', FileSet)


Now the question I have is: what would be a good logic to use in the sniff method? I need something that uniquely distinguishes this zipped file from other zip files, right? In the previous example above I found a solution by checking whether the zip file has multiple files inside and return true if this is the case. Now with RData, does it mean I have to try to parse the binary contents inside and come with a good heuristic/rule ? Just wondering if someone already has thought about such a rule, specifically for RData.

Thanks,

Pieter.



-----Original Message-----
From: John Chilton [mailto:[hidden email]]
Sent: donderdag 23 oktober 2014 3:02
To: Lukasse, Pieter
Cc: [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this in detail - but I have encountered this before with datatypes that are compressed - zipped, gzipped, etc.... Galaxy will attempt to decompress them in order to figure out what they are. I believe this is what is happening to your data. If you register the type as a sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:
> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different
> and around 2x larger than the uploaded file. The tool that needs to
> use it also aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

Lukasse, Pieter

Hi Ross,

 

I also found the following information:

 

“save works by writing a single-line header (typically RDX2\n for a binary save: the only other current value is RDA2\n for save(files=TRUE)),”

From http://cran.r-project.org/doc/manuals/r-release/R-ints.html  (section 1.8 – Serialization Formats).

 

Regards,

 

Pieter.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Lukasse, Pieter
Sent: donderdag 6 november 2014 10:25
To: 'Ross'
Cc: John Chilton; [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

 

An alternative would be perhaps to look at the file name. If it ends in “.RData” then I could mark it as such. Being able to access the original file name (so not the internal one but the name that also appears in the history UI) would allow me to do this. What would be the way to access this from within my sniff method?

 

Thanks,


Pieter

 

 

 

 

From: Ross [[hidden email]]
Sent: donderdag 30 oktober 2014 11:20
To: Lukasse, Pieter
Cc: John Chilton; [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

 

Rdata is binary and serialises R objects so I sure hope you don't have to peek inside - probably needs most of an R environment - like rpy or something.

A binary header signature magic number would be ideal - I checked a few saved rdata files lying around here and all seemed to start with the following bytes - all were variable after the 12th:

1f 8b 08 00 00 00 00 00  00 03 d4 fd

Maybe someone else can confirm that as a reliable binary hex header signature for rdata? 

I couldn't find anything in the R docs - probably take a good deep dive into the guts of the save/load function source to be sure.

 

On Thu, Oct 30, 2014 at 8:54 PM, Lukasse, Pieter <[hidden email]> wrote:

Hi John,

Yes, I think this should work as I have seen it work for another binary type I made before. See below:

class FileSet( Binary ):
    """FileSet containing N files"""
    file_ext = "prims.fileset.zip"
    blurb = "(zipped) FileSet containing multiple files"
    def sniff( self, filename ):
        # If the zip file contains multiple files then return true, false otherwise:
        zf = zipfile.ZipFile(filename)
        if (len(zf.infolist())>1):
            return True
        else :
            return False

# the if is just for backwards compatibility...could remove this at some point
if hasattr(Binary, 'register_sniffable_binary_format'):
    Binary.register_sniffable_binary_format('FileSet', 'prims.fileset.zip', FileSet)


Now the question I have is: what would be a good logic to use in the sniff method? I need something that uniquely distinguishes this zipped file from other zip files, right? In the previous example above I found a solution by checking whether the zip file has multiple files inside and return true if this is the case. Now with RData, does it mean I have to try to parse the binary contents inside and come with a good heuristic/rule ? Just wondering if someone already has thought about such a rule, specifically for RData.

Thanks,

Pieter.



-----Original Message-----
From: John Chilton [mailto:[hidden email]]
Sent: donderdag 23 oktober 2014 3:02
To: Lukasse, Pieter
Cc: [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this in detail - but I have encountered this before with datatypes that are compressed - zipped, gzipped, etc.... Galaxy will attempt to decompress them in order to figure out what they are. I believe this is what is happening to your data. If you register the type as a sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:
> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different
> and around 2x larger than the uploaded file. The tool that needs to
> use it also aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: strange issue with .RData files

Ross-2
Pieter, 
<speculation type="untested">
if the uploaded file has an ext of .Rdata and you have defined an Rdata datatype as (eg) a subclass of gzip, then AFAIK the extension will be matched as part of the sniff processing without the contents needing to be examined - ie it should "just work" without a new specialised sniffer?
</speculation>
I think that different versions of R may have different headers in rdata - worse, they may have already been compressed so reading the header may be challenging - and not needed if the extension of the upload matches the datatype definition?

Unfortunately, without actually reading contents to find a reliable magic number, it will be difficult to stop a user uploading an rdata file called "foo.fasta" - unpredictable things may happen.


On Thu, Nov 6, 2014 at 8:35 PM, Lukasse, Pieter <[hidden email]> wrote:

Hi Ross,

 

I also found the following information:

 

“save works by writing a single-line header (typically RDX2\n for a binary save: the only other current value is RDA2\n for save(files=TRUE)),”

From http://cran.r-project.org/doc/manuals/r-release/R-ints.html  (section 1.8 – Serialization Formats).

 

Regards,

 

Pieter.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Lukasse, Pieter
Sent: donderdag 6 november 2014 10:25
To: 'Ross'


Cc: John Chilton; [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

 

An alternative would be perhaps to look at the file name. If it ends in “.RData” then I could mark it as such. Being able to access the original file name (so not the internal one but the name that also appears in the history UI) would allow me to do this. What would be the way to access this from within my sniff method?

 

Thanks,


Pieter

 

 

 

 

From: Ross [[hidden email]]
Sent: donderdag 30 oktober 2014 11:20
To: Lukasse, Pieter
Cc: John Chilton; [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

 

Rdata is binary and serialises R objects so I sure hope you don't have to peek inside - probably needs most of an R environment - like rpy or something.

A binary header signature magic number would be ideal - I checked a few saved rdata files lying around here and all seemed to start with the following bytes - all were variable after the 12th:

1f 8b 08 00 00 00 00 00  00 03 d4 fd

Maybe someone else can confirm that as a reliable binary hex header signature for rdata? 

I couldn't find anything in the R docs - probably take a good deep dive into the guts of the save/load function source to be sure.

 

On Thu, Oct 30, 2014 at 8:54 PM, Lukasse, Pieter <[hidden email]> wrote:

Hi John,

Yes, I think this should work as I have seen it work for another binary type I made before. See below:

class FileSet( Binary ):
    """FileSet containing N files"""
    file_ext = "prims.fileset.zip"
    blurb = "(zipped) FileSet containing multiple files"
    def sniff( self, filename ):
        # If the zip file contains multiple files then return true, false otherwise:
        zf = zipfile.ZipFile(filename)
        if (len(zf.infolist())>1):
            return True
        else :
            return False

# the if is just for backwards compatibility...could remove this at some point
if hasattr(Binary, 'register_sniffable_binary_format'):
    Binary.register_sniffable_binary_format('FileSet', 'prims.fileset.zip', FileSet)


Now the question I have is: what would be a good logic to use in the sniff method? I need something that uniquely distinguishes this zipped file from other zip files, right? In the previous example above I found a solution by checking whether the zip file has multiple files inside and return true if this is the case. Now with RData, does it mean I have to try to parse the binary contents inside and come with a good heuristic/rule ? Just wondering if someone already has thought about such a rule, specifically for RData.

Thanks,

Pieter.



-----Original Message-----
From: John Chilton [mailto:[hidden email]]
Sent: donderdag 23 oktober 2014 3:02
To: Lukasse, Pieter
Cc: [hidden email]
Subject: Re: [galaxy-dev] strange issue with .RData files

Hey Pieter,

  Sorry I am swamped right now so I don't have time to dig into this in detail - but I have encountered this before with datatypes that are compressed - zipped, gzipped, etc.... Galaxy will attempt to decompress them in order to figure out what they are. I believe this is what is happening to your data. If you register the type as a sniffable binary it looks like it should skip the decompression though
- unless I am reading this logic wrong in tools/data_source/upload.py https://gist.github.com/jmchilton/54b5d7485fcd16eec984.

E.g. like bam datatypes:

class Bam( Binary ):
   ....

Binary.register_sniffable_binary_format("bam", "bam", Bam)

Have you registered a sniffable binary datatype for RData?

-John



On Wed, Oct 22, 2014 at 9:38 AM, Lukasse, Pieter <[hidden email]> wrote:
> Hi,
>
>
>
> When I upload any .RData file to my Galaxy server it seems to be
> unpacking/changing it. The resulting file in my history is different
> and around 2x larger than the uploaded file. The tool that needs to
> use it also aborts with an error due to this erroneous file.
>
>
>
> What are the workarounds?
>
>
>
> Thanks,
>
>
>
> Pieter Lukasse
>
> Wageningen UR, Plant Research International
>
> Department of Bioinformatics (Bioscience)
>
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
> Wageningen, the Netherlands
>
> T: +31-317481122;
> M: +31-628189540;
> skype: pieter.lukasse.wur
>
> http://www.pri.wur.nl
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

 



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/