S3/Swift object store cache path and `extra_dir`s

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

S3/Swift object store cache path and `extra_dir`s

Brian Claywell-2
Our local instance currently uses the traditional directories under
`database/` for datasets, job working directories, and temporary
files. Ultimately we wish to transition to using our Swift object
store for storage. We've been doing some experimentation with Galaxy's
Swift backend and have run into a few issues.

The first major issue we came across was Swift's 5 GB segment size
limit, since the segmentation/multipart upload code is bypassed for
instances of SwiftObjectStore [1]. SwiftStack support provided a patch
enabling multipart uploads for Swift (PR #648) which has been working
well for us so far. (Thanks, Charles!)

The next issue is that the path attribute of the cache tag in
object_store_conf.xml appears to be ignored. The value does get stored
to self.cache_path in _parse_config_xml, but elsewhere in the file
self.staging_path is used instead.

Finally, adding extra_dir tags to the Swift object store config
doesn't appear to do anything. Here's my object_store_conf.xml:

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
       <object_store type="swift" id="primary" order="0">
            <auth access_key="..." secret_key="..."/>
            <bucket name="galaxy_store"/>
            <connection host="tin.fhcrc.org" port="443"/>
            <cache path="database/object_store_cache" size="1000"/>
            <extra_dir type="temp" path="database/tmp"/>
            <extra_dir type="job_work" path="database/job_working_directory"/>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files"/>
        </object_store>
    </backends>
</object_store>

The goal with the hierarchical setup above is for new datasets to be
created in the primary (Swift) object store, caching to
`database/object_store_cache`, while the job and temporary directories
remain at `database/job_working_directory` and `database/tmp`,
respectively. Existing (pre-Swift) datasets remain in `database/files`
and are handled by the secondary disk store.

What actually happens (after renaming self.cache_path to
self.staging_path in _parse_config_xml to get the cache path working)
is this:

galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for
job is: /home/bclaywel/workspace/galaxy-central/database/object_store_cache/000/1

That is, the job working directory is created directly under the cache
path's hash directories. I assume temp files would probably end up
there also.

We're quite excited to get Galaxy and Swift working well together, and
I'm more than happy to help debug and test!


Cheers,

Brian


[1] https://bitbucket.org/galaxy/galaxy-central/src/54ed3adb6575addba47d627944ebd72f7547082d/lib/galaxy/objectstore/s3.py?at=default#cl-331

--
Brian Claywell | programmer/analyst
Matsen Group   | http://matsen.fredhutch.org
Fred Hutchinson Cancer Research Center
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: S3/Swift object store cache path and `extra_dir`s

Dannon Baker-2
Hey Brian,

Thanks for the interest in Galaxy's Swift object store!  I also tested Charles' PR and it looks like a nice improvement -- I'll go ahead and get that pulled into Galaxy shortly.

The HierarchicalObjectStore was written with exactly what you're trying to do in mind, so you're definitely on the right track here.  I'll see if I can verify and fix the file location issues you point out and will get back to you.

-Dannon



On Mon Feb 09 2015 at 4:29:35 PM Brian Claywell <[hidden email]> wrote:
Our local instance currently uses the traditional directories under
`database/` for datasets, job working directories, and temporary
files. Ultimately we wish to transition to using our Swift object
store for storage. We've been doing some experimentation with Galaxy's
Swift backend and have run into a few issues.

The first major issue we came across was Swift's 5 GB segment size
limit, since the segmentation/multipart upload code is bypassed for
instances of SwiftObjectStore [1]. SwiftStack support provided a patch
enabling multipart uploads for Swift (PR #648) which has been working
well for us so far. (Thanks, Charles!)

The next issue is that the path attribute of the cache tag in
object_store_conf.xml appears to be ignored. The value does get stored
to self.cache_path in _parse_config_xml, but elsewhere in the file
self.staging_path is used instead.

Finally, adding extra_dir tags to the Swift object store config
doesn't appear to do anything. Here's my object_store_conf.xml:

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
       <object_store type="swift" id="primary" order="0">
            <auth access_key="..." secret_key="..."/>
            <bucket name="galaxy_store"/>
            <connection host="tin.fhcrc.org" port="443"/>
            <cache path="database/object_store_cache" size="1000"/>
            <extra_dir type="temp" path="database/tmp"/>
            <extra_dir type="job_work" path="database/job_working_directory"/>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files"/>
        </object_store>
    </backends>
</object_store>

The goal with the hierarchical setup above is for new datasets to be
created in the primary (Swift) object store, caching to
`database/object_store_cache`, while the job and temporary directories
remain at `database/job_working_directory` and `database/tmp`,
respectively. Existing (pre-Swift) datasets remain in `database/files`
and are handled by the secondary disk store.

What actually happens (after renaming self.cache_path to
self.staging_path in _parse_config_xml to get the cache path working)
is this:

galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for
job is: /home/bclaywel/workspace/galaxy-central/database/object_store_cache/000/1

That is, the job working directory is created directly under the cache
path's hash directories. I assume temp files would probably end up
there also.

We're quite excited to get Galaxy and Swift working well together, and
I'm more than happy to help debug and test!


Cheers,

Brian


[1] https://bitbucket.org/galaxy/galaxy-central/src/54ed3adb6575addba47d627944ebd72f7547082d/lib/galaxy/objectstore/s3.py?at=default#cl-331

--
Brian Claywell | programmer/analyst
Matsen Group   | http://matsen.fredhutch.org
Fred Hutchinson Cancer Research Center
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: S3/Swift object store cache path and `extra_dir`s

Brian Claywell-2
Hi Dannon,

I'm stoked that Charles's PR made it into master -- thanks! 

Have you had a chance to look into the job_work and temp extra_dir issue? Please let me know if there's anything I can do to help out!


Cheers,

Brian

On Mon, Feb 9, 2015 at 1:39 PM, Dannon Baker <[hidden email]> wrote:
Hey Brian,

Thanks for the interest in Galaxy's Swift object store!  I also tested Charles' PR and it looks like a nice improvement -- I'll go ahead and get that pulled into Galaxy shortly.

The HierarchicalObjectStore was written with exactly what you're trying to do in mind, so you're definitely on the right track here.  I'll see if I can verify and fix the file location issues you point out and will get back to you.

-Dannon



On Mon Feb 09 2015 at 4:29:35 PM Brian Claywell <[hidden email]> wrote:
Our local instance currently uses the traditional directories under
`database/` for datasets, job working directories, and temporary
files. Ultimately we wish to transition to using our Swift object
store for storage. We've been doing some experimentation with Galaxy's
Swift backend and have run into a few issues.

The first major issue we came across was Swift's 5 GB segment size
limit, since the segmentation/multipart upload code is bypassed for
instances of SwiftObjectStore [1]. SwiftStack support provided a patch
enabling multipart uploads for Swift (PR #648) which has been working
well for us so far. (Thanks, Charles!)

The next issue is that the path attribute of the cache tag in
object_store_conf.xml appears to be ignored. The value does get stored
to self.cache_path in _parse_config_xml, but elsewhere in the file
self.staging_path is used instead.

Finally, adding extra_dir tags to the Swift object store config
doesn't appear to do anything. Here's my object_store_conf.xml:

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
       <object_store type="swift" id="primary" order="0">
            <auth access_key="..." secret_key="..."/>
            <bucket name="galaxy_store"/>
            <connection host="tin.fhcrc.org" port="443"/>
            <cache path="database/object_store_cache" size="1000"/>
            <extra_dir type="temp" path="database/tmp"/>
            <extra_dir type="job_work" path="database/job_working_directory"/>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files"/>
        </object_store>
    </backends>
</object_store>

The goal with the hierarchical setup above is for new datasets to be
created in the primary (Swift) object store, caching to
`database/object_store_cache`, while the job and temporary directories
remain at `database/job_working_directory` and `database/tmp`,
respectively. Existing (pre-Swift) datasets remain in `database/files`
and are handled by the secondary disk store.

What actually happens (after renaming self.cache_path to
self.staging_path in _parse_config_xml to get the cache path working)
is this:

galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for
job is: /home/bclaywel/workspace/galaxy-central/database/object_store_cache/000/1

That is, the job working directory is created directly under the cache
path's hash directories. I assume temp files would probably end up
there also.

We're quite excited to get Galaxy and Swift working well together, and
I'm more than happy to help debug and test!


Cheers,

Brian


[1] https://bitbucket.org/galaxy/galaxy-central/src/54ed3adb6575addba47d627944ebd72f7547082d/lib/galaxy/objectstore/s3.py?at=default#cl-331

--
Brian Claywell | programmer/analyst
Matsen Group   | http://matsen.fredhutch.org
Fred Hutchinson Cancer Research Center
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/



--
Brian Claywell | programmer/analyst
Matsen Group   | http://matsen.fredhutch.org
Fred Hutchinson Cancer Research Center

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/