Plans for Docker image generation in galaxy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Plans for Docker image generation in galaxy

Marius van den Beek

Hi everyone, 

it’s my first post to the list, so forgive me if I miss something obvious, I tried to get all the information I could so if there was already a discussion I’d be glad if you can point me towards this.

 

I’m posting to start a discussion about how galaxy is going to handle interactions with Docker.

I know there are already some tools and resources out there that make use of Docker. As I understood, we can already specify Docker as a job destination in the tool xml, along with an image to use (See https://github.com/apetkau/galaxy-hackathon-2014/tree/master/smalt for a detailed example). This is cool, and allows for the controlling of resources, but I think we could take the integration to the next step if we allowed the generation of Docker images in galaxy.

 

I was thinking if there is a trusted baseimage (ie without root access), we could let users install

all the packages they need, commit after the install, and let them use this new docker image. This could be beneficial for generating new tools with a sandboxed version of the Galaxy Tool Factory (http://www.ncbi.nlm.nih.gov/pubmed/23024011).

 

I am currently working on this (https://bitbucket.org/mvdbeek/dockertoolfactory), and I would like to have your opinion on how to manage these user-generated Docker images.

Some questions that I came across:

Do we store the committed images in the user’s history (they could potentially become very big!)?

How to display available images to the user? Something like a per-user image registry?

How to identify available images (dataset_id as the tag?)

How to transfer user-generated images to the toolshed?

Is it a good idea at all to store user-generated images in the toolshed?

What do we do if the security of the baseimage is compromised? Obviously

we can blacklist execution of images, but what if somebody installed a dangerous image from the toolshed, and is not aware of this?

 

Ultimately I think this could be a way to have advanced users run their own scripts inside galaxy, to generate their own tools and tool-dependencies inside galaxy, and why not, even have “user-space tools”.

It would bridge the gap between galaxy users and galaxy tool-developers,

so I’m curious what you’re thinking about this.

 

Cheers,

Marius

 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Plans for Docker image generation in galaxy

Björn Grüning-3
Hi Marius,

thanks for shifting the discussion over to the mailinglist.

Am 19.08.2014 um 13:13 schrieb Marius van den Beek:
> Hi everyone,
>
> it’s my first post to the list, so forgive me if I miss something obvious,
> I tried to get all the information I could so if there was already a
> discussion I’d be glad if you can point me towards this.


https://trello.com/c/fYB8fMY4/43-docker-in-galaxy-in-docker
https://trello.com/c/tm1fPvyq
https://trello.com/c/8k55T24F
https://github.com/bgruening/galaxy-ipython
https://github.com/bgruening/docker-galaxy-stable

The galaxy-docker-stable image can be used to generate self-defined set
of Tools with a full working Galaxy instance.

> I’m posting to start a discussion about how galaxy is going to handle
> interactions with Docker.
>
> I know there are already some tools and resources out there that make use
> of Docker. As I understood, we can already specify Docker as a job
> destination in the tool xml, along with an image to use (See
> https://github.com/apetkau/galaxy-hackathon-2014/tree/master/smalt for a
> detailed example). This is cool, and allows for the controlling of
> resources, but I think we could take the integration to the next step if we
> allowed the generation of Docker images in galaxy.


The Tool Shed can, since the latest release, generate you "ready to go"
Dockerfiles with all your favorite tools.


> I was thinking if there is a trusted baseimage (ie without root access), we
> could let users install

Can you please elaborate on this a little bit more?

> all the packages they need, commit after the install, and let them use this
> new docker image. This could be beneficial for generating new tools with a
> sandboxed version of the Galaxy Tool Factory (
> http://www.ncbi.nlm.nih.gov/pubmed/23024011).


> I am currently working on this (
> https://bitbucket.org/mvdbeek/dockertoolfactory), and I would like to have
> your opinion on how to manage these user-generated Docker images.
>
> Some questions that I came across:
>
> Do we store the committed images in the user’s history (they could
> potentially become very big!)?
>
> How to display available images to the user? Something like a per-user
> image registry?
>
> How to identify available images (dataset_id as the tag?)
>
> How to transfer user-generated images to the toolshed?
>
> Is it a good idea at all to store user-generated images in the toolshed?
>
> What do we do if the security of the baseimage is compromised? Obviously
>
> we can blacklist execution of images, but what if somebody installed a
> dangerous image from the toolshed, and is not aware of this?

To answer these questions it would be good to have some use-cases for
your docker tool images.
For example I think most of the time an IPython instance would be
sufficient. You will only store notebook files in your history (small)
and execute them on new datasets.

I would be really interested in your use-cases.

Imho, all this should be taken with care, in the end we would like to
have proper tools and images in the toolshed and reusable for many
others. I consider IPython Integration and the Toolfactory as bonus and
tools mostly for developers. Nothing a biologist should deal with. And
we should take care to not over complicate the tool setup/usage.


> Ultimately I think this could be a way to have advanced users run their own
> scripts inside galaxy, to generate their own tools and tool-dependencies
> inside galaxy, and why not, even have “user-space tools”.
>
> It would bridge the gap between galaxy users and galaxy tool-developers,
>
> so I’m curious what you’re thinking about this.

Absolutely, thanks for bringing this up!
Bjoern

>
>
> Cheers,
>
> Marius
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>    http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>    http://galaxyproject.org/search/mailinglists/
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Plans for Docker image generation in galaxy

Marius van den Beek
On 19 August 2014 14:07, Björn Grüning <[hidden email]> wrote:
Hi Marius,

thanks for shifting the discussion over to the mailinglist.

Am 19.08.2014 um 13:13 schrieb Marius van den Beek:

Hi everyone,

it’s my first post to the list, so forgive me if I miss something obvious,
I tried to get all the information I could so if there was already a
discussion I’d be glad if you can point me towards this.


https://trello.com/c/fYB8fMY4/43-docker-in-galaxy-in-docker
https://trello.com/c/tm1fPvyq
https://trello.com/c/8k55T24F
https://github.com/bgruening/galaxy-ipython
https://github.com/bgruening/docker-galaxy-stable

The galaxy-docker-stable image can be used to generate self-defined set of Tools with a full working Galaxy instance.
 
OK, thanks for the collection of resources, that's useful! 


I’m posting to start a discussion about how galaxy is going to handle
interactions with Docker.

I know there are already some tools and resources out there that make use
of Docker. As I understood, we can already specify Docker as a job
destination in the tool xml, along with an image to use (See
https://github.com/apetkau/galaxy-hackathon-2014/tree/master/smalt for a
detailed example). This is cool, and allows for the controlling of
resources, but I think we could take the integration to the next step if we
allowed the generation of Docker images in galaxy.


The Tool Shed can, since the latest release, generate you "ready to go" Dockerfiles with all your favorite tools.
 
That's pretty cool indeed, I didn't know this. 
 

I was thinking if there is a trusted baseimage (ie without root access), we
could let users install

Can you please elaborate on this a little bit more?
 
Sure. My idea was that the normal, non-admin galaxy user would be dropped into a more or less populated container, with normal user rights, so
that there shouldn't be a way to do much harm to the docker host. Then they could install their favorite R, python, perl, ... package (or import a toolshed tool ...).
When satisfied and all software is there, a commit is triggered, and the image would be stored for later use.
Or alternatively the steps necessary for installation would be transformed into a script that is added and executed during the Docker build process. 
The advantage here would be that you don't need to be admin to do this.
I'll try to get the basic functionality working, a demo of this might clear things up.


all the packages they need, commit after the install, and let them use this
new docker image. This could be beneficial for generating new tools with a
sandboxed version of the Galaxy Tool Factory (
http://www.ncbi.nlm.nih.gov/pubmed/23024011).


I am currently working on this (
https://bitbucket.org/mvdbeek/dockertoolfactory), and I would like to have
your opinion on how to manage these user-generated Docker images.

Some questions that I came across:

Do we store the committed images in the user’s history (they could
potentially become very big!)?

How to display available images to the user? Something like a per-user
image registry?

How to identify available images (dataset_id as the tag?)

How to transfer user-generated images to the toolshed?

Is it a good idea at all to store user-generated images in the toolshed?

What do we do if the security of the baseimage is compromised? Obviously

we can blacklist execution of images, but what if somebody installed a
dangerous image from the toolshed, and is not aware of this?

To answer these questions it would be good to have some use-cases for your docker tool images.
For example I think most of the time an IPython instance would be sufficient. You will only store notebook files in your history (small) and execute them on new datasets.

Yes, especially since you can also interact with R and the shell inside IPython, but right now it can't be used inside a workflow, for example ... unless I missed something.

I would be really interested in your use-cases.
Well, in my experience, many biologists working with genomics data *can* put together a commandline, but few are going to use galaxy,
because they can't run the specific tool that is not available in the institute's instance, or they need a specific option not exposed ... . And so the students/ other people in their team are also not going to use galaxy.
I think the moment they had the possibility to run their code, I think a lot of them could be lured in the galaxy ecosystem. If you show them how easy it is to share their analyses with coworkers ... .
So you can let those people run any script. If it works as expected, it can be used in workflows, and/or transformed in a (local) toolshed tool, that would even allow version control.
Of course, a toolshed tool is not automatically a good tool, but it's something to start with!
You could even think about letting them import tools from the toolshed, and have them sandboxed transparently.

So that would be the main use case, basically the same target audience of the original toolfactory, but extended to non-admins.
It might not be a good idea for the main galaxy instance, but why not do this for the server at your local institute?


Imho, all this should be taken with care, in the end we would like to have proper tools and images in the toolshed and reusable for many others. I consider IPython Integration and the Toolfactory as bonus and tools mostly for developers. Nothing a biologist should deal with. And we should take care to not over complicate the tool setup/usage.


I agree, it has to stay simple (or become easier!), shareable and reproducible. But I think we can come up with ways to get more command-line people on board without increasing complexity.
I disagree on one part though, I think that biologists dealing with big data have to know at least one programming/scripting language, and why not having them learn inside galaxy and ipython, with their own data?
That would be much better than GFFs edited in excel ... .
 


Ultimately I think this could be a way to have advanced users run their own
scripts inside galaxy, to generate their own tools and tool-dependencies
inside galaxy, and why not, even have “user-space tools”.

It would bridge the gap between galaxy users and galaxy tool-developers,

so I’m curious what you’re thinking about this.

Absolutely, thanks for bringing this up!
Bjoern

Sure, I'm looking forward to how this is going to develop!
More feedback welcome! 



Cheers,

Marius



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/