galaxy user data on file system with limited file lifetime

classic Classic list List threaded Threaded
1 message Options
| Threaded
Open this post in threaded view
Report Content as Inappropriate

galaxy user data on file system with limited file lifetime

Matthias Bernt
Dear galaxy developers,

The question in short:
How can galaxy user data (e.g. file_path) be stored safely on a file
system where files have a limited life time?

Galaxy will run on a cluster (~2000 cores) head node where
data is stored at three points:

/home (Individual User Homes) with 50 GB quota
/data (Research Group Directories) with 4 up to 100 TB quota
/work (User Directories - Temporary File Area) 60 days life time

We plan to store galaxy related data as follows:

/work/galaxy/files <- file_path
/work/galaxy/tmp   <- new_file_path
/work/galaxy/jobs  <- job_working_directory

As a note: Storing these in /data/ would undermine our quota system
which our admins do not like.

/data/galaxy will contain the galaxy installation including tool data
(and I hope that we can just set the quotas high enough to never run out
of space).

Data libraries will be added using the "link mechanism" from /home/USER
and /data/GROUP. I hope that I can automatize import and appropriate
setting of permissions via the API / bioblend. Are there already scripts?

Is this scheme reasonable?

If yes: The main question is how I can guarantee that the life time of
data of /work/ and the galaxy server play nice together.

My idea consists of two parts:

1. Adapt cleanup_datasets.py (i.e. the function  purge_histories) such
that all histories (also those that have not been deleted) are purged
which are at the file system life time.
The modification seems to be to remove the test:
app.model.History.table.c.deleted == true()
At the same time the included data sets will be purged.

2. Using the API I will get the update time of each history or the
update time of the youngest included data set (or is it the same
anyway). For the files corresponding to the included data sets I will
update the access times in the file systems. Such I will guarantee that
only complete histories are purged.

The script(s) can then be run via cron with a life time set to 1 day
less than the file system life time (just to be sure).

In theory jobs could run longer than 60 days. Therefore my idea would be
to update access times of all files in job_working_directory daily.

Thank you very much for any help.



Matthias Bernt
Bioinformatics Service
Molekulare Systembiologie (MOLSYB)
Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
Helmholtz Centre for Environmental Research GmbH - UFZ
Permoserstraße 15, 04318 Leipzig, Germany
Phone +49 341 235 482296,
[hidden email], www.ufz.de

Sitz der Gesellschaft/Registered Office: Leipzig
Registergericht/Registration Office: Amtsgericht Leipzig
Handelsregister Nr./Trade Register Nr.: B 4703
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:
MinDirig Wilfried Kraus
Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
Prof. Dr. Dr. h.c. Georg Teutsch
Administrative Geschäftsführerin/ Administrative Managing Director:
Prof. Dr. Heike Graßmann

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

smime.p7s (7K) Download Attachment