LWR runner configuration for shared folder in cluster

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

LWR runner configuration for shared folder in cluster

Krampis, Konstantinos
Hi all,

  I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
a shared filesystem and in universe.wsgi this parameter is set :

job_working_directory = /mnt/shared
...
clustalw = lwr://http://192.168.33.12:8913
....

this folder has been "chown-ed" to the galaxy user, and also is "a+w",
while it has been verified that can been read / written by ssh-ing to
each node of the cluster. The sticky bit is set.

When I try to run jobs (I used clustalw as example) there seems to be
confusion between where Galaxy puts files and where LWR tries to read
them from. Here are two setups that error out:




1). When in server.ini for LWR the following is set as:
staging_directory = /mnt/shared/000

galaxy error:

galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128


lwr error (on the cluster node):

  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
    manager.setup_job_directory(job_id)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
    os.mkdir(job_directory)
OSError: [Errno 17] File exists: '/mnt/shared/000/128'




2). When in server.ini for LWR the following is set as:
staging_directory = /mnt/shared

galaxy error:

galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129


lwr error (on the cluster node):

    resp.app_iter = FileIterator(result)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
    self.input = open(path, 'rb')
IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'




The full error stacks are at the end of this email. It might be something very simple that I am missing,
but any feedback would be greatly appreciated. Thanks !

Ntino



--
Konstantinos (Ntino) Krampis, Ph.D.
Asst. Professor, Informatics
J.Craig Venter Institute

[hidden email]
[hidden email]
+1-540-200-8277

Web:
http://bit.ly/cloud-research
http://cloudbiolinux.org/
http://twitter.com/agbiotec



---- GALAXY ERROR

galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
Traceback (most recent call last):
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
    file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
    job_config = client.setup()
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
    return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
    response = self.__raw_execute(command, args, data)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
    response = self.url_open(request, data)
  File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
    return urllib2.urlopen(request, data)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error



---- LWR ERROR

Exception happened during processing of request from ('192.168.33.11', 44802)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
    self.handle()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
    BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
    self.wsgi_execute()
  File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
    self.wsgi_start_response)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
    return controller(environ, start_response, **request_args)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
    result = func(**args)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
    manager.setup_job_directory(job_id)
  File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
    os.mkdir(job_directory)
OSError: [Errno 17] File exists: '/mnt/shared/000/128'

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: LWR runner configuration for shared folder in cluster

John Chilton
Hello Dr. Krampis,

  At the present, the LWR is most valuable when there is not a shared
file system between the server executing the jobs and the server
hosting Galaxy. In this case you seem to have a shared filesystem so I
would think setting up something like sun grid engine to manage the
jobs and using the DRMAA job runner would be the best route forward.

  The LWR and the corresponding galaxy job runner will coordinate to
stage jobs, but the upshot is that the LWR should be the only thing
writing to its staging directory. In this case you have configured the
LWR and Galaxy to both use the same directory. You should change this
configuration immediately, I am worried the LWR is going to delete or
overwrite files maintained by Galaxy. I am sorry for the confusion, I
will update the documentation to explicitly warn against this.

  If you still feel there is a compelling reason to use the LWR in
this situation, you will just want to change the staging_directory in
the LWR configuration to something else. It has long been on my TODO
list to allow one to disable (or selectively disable by path
regex/globs) file staging with the LWR, it seems like that would what
would also help in your situation. Let me know if that is of interest
to you.

-John


On Mon, May 6, 2013 at 8:32 AM, Krampis, Konstantinos <[hidden email]> wrote:

> Hi all,
>
>   I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
> a shared filesystem and in universe.wsgi this parameter is set :
>
> job_working_directory = /mnt/shared
> ...
> clustalw = lwr://http://192.168.33.12:8913
> ....
>
> this folder has been "chown-ed" to the galaxy user, and also is "a+w",
> while it has been verified that can been read / written by ssh-ing to
> each node of the cluster. The sticky bit is set.
>
> When I try to run jobs (I used clustalw as example) there seems to be
> confusion between where Galaxy puts files and where LWR tries to read
> them from. Here are two setups that error out:
>
>
>
>
> 1). When in server.ini for LWR the following is set as:
> staging_directory = /mnt/shared/000
>
> galaxy error:
>
> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>
>
> lwr error (on the cluster node):
>
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>     manager.setup_job_directory(job_id)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>     os.mkdir(job_directory)
> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>
>
>
>
> 2). When in server.ini for LWR the following is set as:
> staging_directory = /mnt/shared
>
> galaxy error:
>
> galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
> galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
> 192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
> galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129
>
>
> lwr error (on the cluster node):
>
>     resp.app_iter = FileIterator(result)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
>     self.input = open(path, 'rb')
> IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'
>
>
>
>
> The full error stacks are at the end of this email. It might be something very simple that I am missing,
> but any feedback would be greatly appreciated. Thanks !
>
> Ntino
>
>
>
> --
> Konstantinos (Ntino) Krampis, Ph.D.
> Asst. Professor, Informatics
> J.Craig Venter Institute
>
> [hidden email]
> [hidden email]
> +1-540-200-8277
>
> Web:
> http://bit.ly/cloud-research
> http://cloudbiolinux.org/
> http://twitter.com/agbiotec
>
>
>
> ---- GALAXY ERROR
>
> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
> Traceback (most recent call last):
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
>     file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
>     job_config = client.setup()
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
>     return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
>     response = self.__raw_execute(command, args, data)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
>     response = self.url_open(request, data)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
>     return urllib2.urlopen(request, data)
>   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
>     return _opener.open(url, data, timeout)
>   File "/usr/lib/python2.7/urllib2.py", line 406, in open
>     response = meth(req, response)
>   File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
>     'http', request, response, code, msg, hdrs)
>   File "/usr/lib/python2.7/urllib2.py", line 444, in error
>     return self._call_chain(*args)
>   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
>     result = func(*args)
>   File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
>     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
> HTTPError: HTTP Error 500: Internal Server Error
>
>
>
> ---- LWR ERROR
>
> Exception happened during processing of request from ('192.168.33.11', 44802)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
>     self.finish_request(request, client_address)
>   File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
>     self.handle()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
>     BaseHTTPRequestHandler.handle(self)
>   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
>     self.handle_one_request()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
>     self.wsgi_execute()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
>     self.wsgi_start_response)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
>     return controller(environ, start_response, **request_args)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
>     result = func(**args)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>     manager.setup_job_directory(job_id)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>     os.mkdir(job_directory)
> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: LWR runner configuration for shared folder in cluster

Krampis, Konstantinos
Hi John,

  thanks for the quick reply! It is actually a virtual cluster using virtual machines
with VirtualBox, which are brought up via Vagrant (www.vagrantup.com) - that
also sets up the shared filesystem. I chose LWR for the reason that it not a real
cluster, and needed something lightweight instead of having to stand up SGE etc.

  I am very interested in the 3rd option you present in your response below. I am
still confused though that if Galaxy and LWR doe not use shared filesystem, how does
Galaxy know when the job is finished, or where to find the output files ?

Or does the output gets "piped" back to Galaxy ? That would work (all virtual machines
are on the same box, so no latency), as long as the output gets deleted by LWR after
being send to Galaxy (to save space in the worker virtual machines).

  If you could provide some pointers to documentation / code, or explain how
could be set up it would be of great help !

 many thanks,

Ntino

 

________________________________________
From: [hidden email] [[hidden email]] On Behalf Of John Chilton [[hidden email]]
Sent: Monday, May 06, 2013 9:50 AM
To: Krampis, Konstantinos
Cc: [hidden email]
Subject: Re: [galaxy-dev] LWR runner configuration for shared folder in cluster

Hello Dr. Krampis,

  At the present, the LWR is most valuable when there is not a shared
file system between the server executing the jobs and the server
hosting Galaxy. In this case you seem to have a shared filesystem so I
would think setting up something like sun grid engine to manage the
jobs and using the DRMAA job runner would be the best route forward.

  The LWR and the corresponding galaxy job runner will coordinate to
stage jobs, but the upshot is that the LWR should be the only thing
writing to its staging directory. In this case you have configured the
LWR and Galaxy to both use the same directory. You should change this
configuration immediately, I am worried the LWR is going to delete or
overwrite files maintained by Galaxy. I am sorry for the confusion, I
will update the documentation to explicitly warn against this.

  If you still feel there is a compelling reason to use the LWR in
this situation, you will just want to change the staging_directory in
the LWR configuration to something else. It has long been on my TODO
list to allow one to disable (or selectively disable by path
regex/globs) file staging with the LWR, it seems like that would what
would also help in your situation. Let me know if that is of interest
to you.

-John


On Mon, May 6, 2013 at 8:32 AM, Krampis, Konstantinos <[hidden email]> wrote:

> Hi all,
>
>   I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
> a shared filesystem and in universe.wsgi this parameter is set :
>
> job_working_directory = /mnt/shared
> ...
> clustalw = lwr://http://192.168.33.12:8913
> ....
>
> this folder has been "chown-ed" to the galaxy user, and also is "a+w",
> while it has been verified that can been read / written by ssh-ing to
> each node of the cluster. The sticky bit is set.
>
> When I try to run jobs (I used clustalw as example) there seems to be
> confusion between where Galaxy puts files and where LWR tries to read
> them from. Here are two setups that error out:
>
>
>
>
> 1). When in server.ini for LWR the following is set as:
> staging_directory = /mnt/shared/000
>
> galaxy error:
>
> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>
>
> lwr error (on the cluster node):
>
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>     manager.setup_job_directory(job_id)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>     os.mkdir(job_directory)
> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>
>
>
>
> 2). When in server.ini for LWR the following is set as:
> staging_directory = /mnt/shared
>
> galaxy error:
>
> galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
> galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
> 192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
> galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129
>
>
> lwr error (on the cluster node):
>
>     resp.app_iter = FileIterator(result)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
>     self.input = open(path, 'rb')
> IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'
>
>
>
>
> The full error stacks are at the end of this email. It might be something very simple that I am missing,
> but any feedback would be greatly appreciated. Thanks !
>
> Ntino
>
>
>
> --
> Konstantinos (Ntino) Krampis, Ph.D.
> Asst. Professor, Informatics
> J.Craig Venter Institute
>
> [hidden email]
> [hidden email]
> +1-540-200-8277
>
> Web:
> http://bit.ly/cloud-research
> http://cloudbiolinux.org/
> http://twitter.com/agbiotec
>
>
>
> ---- GALAXY ERROR
>
> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
> Traceback (most recent call last):
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
>     file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
>     job_config = client.setup()
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
>     return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
>     response = self.__raw_execute(command, args, data)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
>     response = self.url_open(request, data)
>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
>     return urllib2.urlopen(request, data)
>   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
>     return _opener.open(url, data, timeout)
>   File "/usr/lib/python2.7/urllib2.py", line 406, in open
>     response = meth(req, response)
>   File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
>     'http', request, response, code, msg, hdrs)
>   File "/usr/lib/python2.7/urllib2.py", line 444, in error
>     return self._call_chain(*args)
>   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
>     result = func(*args)
>   File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
>     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
> HTTPError: HTTP Error 500: Internal Server Error
>
>
>
> ---- LWR ERROR
>
> Exception happened during processing of request from ('192.168.33.11', 44802)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
>     self.finish_request(request, client_address)
>   File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
>     self.handle()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
>     BaseHTTPRequestHandler.handle(self)
>   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
>     self.handle_one_request()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
>     self.wsgi_execute()
>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
>     self.wsgi_start_response)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
>     return controller(environ, start_response, **request_args)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
>     result = func(**args)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>     manager.setup_job_directory(job_id)
>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>     os.mkdir(job_directory)
> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: LWR runner configuration for shared folder in cluster

John Chilton
On Mon, May 6, 2013 at 9:35 AM, Krampis, Konstantinos <[hidden email]> wrote:
> Hi John,
>
>   thanks for the quick reply! It is actually a virtual cluster using virtual machines
> with VirtualBox, which are brought up via Vagrant (www.vagrantup.com) - that
> also sets up the shared filesystem. I chose LWR for the reason that it not a real
> cluster, and needed something lightweight instead of having to stand up SGE etc.

Fascinating. Unless you don't have root on the physical server, I
would still really recommend SGE. I cannot see any reason it shouldn't
work on virtual servers, it  works with CloudMan and there the master
and workers are both virtualized.

>
>   I am very interested in the 3rd option you present in your response below. I am
> still confused though that if Galaxy and LWR doe not use shared filesystem, how does
> Galaxy know when the job is finished, or where to find the output files ?
>
> Or does the output gets "piped" back to Galaxy ? That would work (all virtual machines
> are on the same box, so no latency), as long as the output gets deleted by LWR after
> being send to Galaxy (to save space in the worker virtual machines).

The LWR job runner communicates with the LWR client over HTTP to work
out these details. The job description is rewritten with new file
paths (input, output, tool files, config files), staged and submitted
to the remote LWR server which executes it and tracks its progress.
Meanwhile, the LWR job runner polls the LWR server waiting for
completion of the job. Upon completion, it downloads the outputs and
places them in the paths Galaxy is expecting.

This is quite different than the other job runners, so there are all
sorts of assumptions tools may make that could cause this to fail - so
if you are going to run more than a handful of tools this way I would
really recommend the SGE or otherwise you are going to spend time
tracking down little annoying problems I suspect.

As for the option of not staging the files, I am going to make the
rookie mistake of suggesting that this is straight forward to
implement in your case - as long as all of the files are available on
the remote system. I have created an issue for this, I don't know when
I will get to this but you can subscribe to the issue to stay in the
loop:

https://bitbucket.org/jmchilton/lwr/issue/8/allow-alternate-staging-rules

To outline how to just disable staging, I think most of the changes
will need to be made to





>
>   If you could provide some pointers to documentation / code, or explain how
> could be set up it would be of great help !
>
>  many thanks,
>
> Ntino
>
>
>
> ________________________________________
> From: [hidden email] [[hidden email]] On Behalf Of John Chilton [[hidden email]]
> Sent: Monday, May 06, 2013 9:50 AM
> To: Krampis, Konstantinos
> Cc: [hidden email]
> Subject: Re: [galaxy-dev] LWR runner configuration for shared folder in cluster
>
> Hello Dr. Krampis,
>
>   At the present, the LWR is most valuable when there is not a shared
> file system between the server executing the jobs and the server
> hosting Galaxy. In this case you seem to have a shared filesystem so I
> would think setting up something like sun grid engine to manage the
> jobs and using the DRMAA job runner would be the best route forward.
>
>   The LWR and the corresponding galaxy job runner will coordinate to
> stage jobs, but the upshot is that the LWR should be the only thing
> writing to its staging directory. In this case you have configured the
> LWR and Galaxy to both use the same directory. You should change this
> configuration immediately, I am worried the LWR is going to delete or
> overwrite files maintained by Galaxy. I am sorry for the confusion, I
> will update the documentation to explicitly warn against this.
>
>   If you still feel there is a compelling reason to use the LWR in
> this situation, you will just want to change the staging_directory in
> the LWR configuration to something else. It has long been on my TODO
> list to allow one to disable (or selectively disable by path
> regex/globs) file staging with the LWR, it seems like that would what
> would also help in your situation. Let me know if that is of interest
> to you.
>
> -John
>
>
> On Mon, May 6, 2013 at 8:32 AM, Krampis, Konstantinos <[hidden email]> wrote:
>> Hi all,
>>
>>   I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
>> a shared filesystem and in universe.wsgi this parameter is set :
>>
>> job_working_directory = /mnt/shared
>> ...
>> clustalw = lwr://http://192.168.33.12:8913
>> ....
>>
>> this folder has been "chown-ed" to the galaxy user, and also is "a+w",
>> while it has been verified that can been read / written by ssh-ing to
>> each node of the cluster. The sticky bit is set.
>>
>> When I try to run jobs (I used clustalw as example) there seems to be
>> confusion between where Galaxy puts files and where LWR tries to read
>> them from. Here are two setups that error out:
>>
>>
>>
>>
>> 1). When in server.ini for LWR the following is set as:
>> staging_directory = /mnt/shared/000
>>
>> galaxy error:
>>
>> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
>> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
>> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>>
>>
>> lwr error (on the cluster node):
>>
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>>     manager.setup_job_directory(job_id)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>>     os.mkdir(job_directory)
>> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>>
>>
>>
>>
>> 2). When in server.ini for LWR the following is set as:
>> staging_directory = /mnt/shared
>>
>> galaxy error:
>>
>> galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
>> galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
>> galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
>> 192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
>> galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129
>>
>>
>> lwr error (on the cluster node):
>>
>>     resp.app_iter = FileIterator(result)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
>>     self.input = open(path, 'rb')
>> IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'
>>
>>
>>
>>
>> The full error stacks are at the end of this email. It might be something very simple that I am missing,
>> but any feedback would be greatly appreciated. Thanks !
>>
>> Ntino
>>
>>
>>
>> --
>> Konstantinos (Ntino) Krampis, Ph.D.
>> Asst. Professor, Informatics
>> J.Craig Venter Institute
>>
>> [hidden email]
>> [hidden email]
>> +1-540-200-8277
>>
>> Web:
>> http://bit.ly/cloud-research
>> http://cloudbiolinux.org/
>> http://twitter.com/agbiotec
>>
>>
>>
>> ---- GALAXY ERROR
>>
>> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
>> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
>> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>> Traceback (most recent call last):
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
>>     file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
>>     job_config = client.setup()
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
>>     return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
>>     response = self.__raw_execute(command, args, data)
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
>>     response = self.url_open(request, data)
>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
>>     return urllib2.urlopen(request, data)
>>   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
>>     return _opener.open(url, data, timeout)
>>   File "/usr/lib/python2.7/urllib2.py", line 406, in open
>>     response = meth(req, response)
>>   File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
>>     'http', request, response, code, msg, hdrs)
>>   File "/usr/lib/python2.7/urllib2.py", line 444, in error
>>     return self._call_chain(*args)
>>   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
>>     result = func(*args)
>>   File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
>>     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
>> HTTPError: HTTP Error 500: Internal Server Error
>>
>>
>>
>> ---- LWR ERROR
>>
>> Exception happened during processing of request from ('192.168.33.11', 44802)
>> Traceback (most recent call last):
>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
>>     self.finish_request(request, client_address)
>>   File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
>>     self.RequestHandlerClass(request, client_address, self)
>>   File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
>>     self.handle()
>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
>>     BaseHTTPRequestHandler.handle(self)
>>   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
>>     self.handle_one_request()
>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
>>     self.wsgi_execute()
>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
>>     self.wsgi_start_response)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
>>     return controller(environ, start_response, **request_args)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
>>     result = func(**args)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>>     manager.setup_job_directory(job_id)
>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>>     os.mkdir(job_directory)
>> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>   http://lists.bx.psu.edu/
>>
>> To search Galaxy mailing lists use the unified search at:
>>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: LWR runner configuration for shared folder in cluster

John Chilton
Opps that last message got sent before I meant to.

I meant to finish saying that this file would need to be changed:

https://bitbucket.org/galaxy/galaxy-central/src/default/lib/galaxy/jobs/runners/lwr.py#cl-80

queue_job() will likely need to skip everything to do with the
FileStager class and just call client.setup() directly. You will want
to submit the original command-line, not a rewritten one. Also in
finish job, I think you will just want to skip the for loops with the
downloads.

Thanks,
-John




On Mon, May 6, 2013 at 11:55 AM, John Chilton <[hidden email]> wrote:

> On Mon, May 6, 2013 at 9:35 AM, Krampis, Konstantinos <[hidden email]> wrote:
>> Hi John,
>>
>>   thanks for the quick reply! It is actually a virtual cluster using virtual machines
>> with VirtualBox, which are brought up via Vagrant (www.vagrantup.com) - that
>> also sets up the shared filesystem. I chose LWR for the reason that it not a real
>> cluster, and needed something lightweight instead of having to stand up SGE etc.
>
> Fascinating. Unless you don't have root on the physical server, I
> would still really recommend SGE. I cannot see any reason it shouldn't
> work on virtual servers, it  works with CloudMan and there the master
> and workers are both virtualized.
>
>>
>>   I am very interested in the 3rd option you present in your response below. I am
>> still confused though that if Galaxy and LWR doe not use shared filesystem, how does
>> Galaxy know when the job is finished, or where to find the output files ?
>>
>> Or does the output gets "piped" back to Galaxy ? That would work (all virtual machines
>> are on the same box, so no latency), as long as the output gets deleted by LWR after
>> being send to Galaxy (to save space in the worker virtual machines).
>
> The LWR job runner communicates with the LWR client over HTTP to work
> out these details. The job description is rewritten with new file
> paths (input, output, tool files, config files), staged and submitted
> to the remote LWR server which executes it and tracks its progress.
> Meanwhile, the LWR job runner polls the LWR server waiting for
> completion of the job. Upon completion, it downloads the outputs and
> places them in the paths Galaxy is expecting.
>
> This is quite different than the other job runners, so there are all
> sorts of assumptions tools may make that could cause this to fail - so
> if you are going to run more than a handful of tools this way I would
> really recommend the SGE or otherwise you are going to spend time
> tracking down little annoying problems I suspect.
>
> As for the option of not staging the files, I am going to make the
> rookie mistake of suggesting that this is straight forward to
> implement in your case - as long as all of the files are available on
> the remote system. I have created an issue for this, I don't know when
> I will get to this but you can subscribe to the issue to stay in the
> loop:
>
> https://bitbucket.org/jmchilton/lwr/issue/8/allow-alternate-staging-rules
>
> To outline how to just disable staging, I think most of the changes
> will need to be made to
>
>
>
>
>
>>
>>   If you could provide some pointers to documentation / code, or explain how
>> could be set up it would be of great help !
>>
>>  many thanks,
>>
>> Ntino
>>
>>
>>
>> ________________________________________
>> From: [hidden email] [[hidden email]] On Behalf Of John Chilton [[hidden email]]
>> Sent: Monday, May 06, 2013 9:50 AM
>> To: Krampis, Konstantinos
>> Cc: [hidden email]
>> Subject: Re: [galaxy-dev] LWR runner configuration for shared folder in cluster
>>
>> Hello Dr. Krampis,
>>
>>   At the present, the LWR is most valuable when there is not a shared
>> file system between the server executing the jobs and the server
>> hosting Galaxy. In this case you seem to have a shared filesystem so I
>> would think setting up something like sun grid engine to manage the
>> jobs and using the DRMAA job runner would be the best route forward.
>>
>>   The LWR and the corresponding galaxy job runner will coordinate to
>> stage jobs, but the upshot is that the LWR should be the only thing
>> writing to its staging directory. In this case you have configured the
>> LWR and Galaxy to both use the same directory. You should change this
>> configuration immediately, I am worried the LWR is going to delete or
>> overwrite files maintained by Galaxy. I am sorry for the confusion, I
>> will update the documentation to explicitly warn against this.
>>
>>   If you still feel there is a compelling reason to use the LWR in
>> this situation, you will just want to change the staging_directory in
>> the LWR configuration to something else. It has long been on my TODO
>> list to allow one to disable (or selectively disable by path
>> regex/globs) file staging with the LWR, it seems like that would what
>> would also help in your situation. Let me know if that is of interest
>> to you.
>>
>> -John
>>
>>
>> On Mon, May 6, 2013 at 8:32 AM, Krampis, Konstantinos <[hidden email]> wrote:
>>> Hi all,
>>>
>>>   I am trying to set up a Galaxy cluster using the LWR runner. The nodes have
>>> a shared filesystem and in universe.wsgi this parameter is set :
>>>
>>> job_working_directory = /mnt/shared
>>> ...
>>> clustalw = lwr://http://192.168.33.12:8913
>>> ....
>>>
>>> this folder has been "chown-ed" to the galaxy user, and also is "a+w",
>>> while it has been verified that can been read / written by ssh-ing to
>>> each node of the cluster. The sticky bit is set.
>>>
>>> When I try to run jobs (I used clustalw as example) there seems to be
>>> confusion between where Galaxy puts files and where LWR tries to read
>>> them from. Here are two setups that error out:
>>>
>>>
>>>
>>>
>>> 1). When in server.ini for LWR the following is set as:
>>> staging_directory = /mnt/shared/000
>>>
>>> galaxy error:
>>>
>>> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
>>> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
>>> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
>>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
>>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>>>
>>>
>>> lwr error (on the cluster node):
>>>
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>>>     manager.setup_job_directory(job_id)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>>>     os.mkdir(job_directory)
>>> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>>>
>>>
>>>
>>>
>>> 2). When in server.ini for LWR the following is set as:
>>> staging_directory = /mnt/shared
>>>
>>> galaxy error:
>>>
>>> galaxy.jobs DEBUG 2013-05-06 10:28:46,872 (129) Working directory for job is: /mnt/shared/000/129
>>> galaxy.jobs.handler DEBUG 2013-05-06 10:28:46,872 dispatching job 129 to lwr runner
>>> galaxy.jobs.handler INFO 2013-05-06 10:28:46,967 (129) Job dispatched
>>> 192.168.33.1 - - [06/May/2013:10:28:48 -0200] "GET /api/histories/2a56795cad3c7db3 HTTP/1.1" 200 - "http://192.168.33.11:8080/history" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"
>>> galaxy.jobs.runners.lwr DEBUG 2013-05-06 10:28:50,653 run_results {'status': 'status', 'returncode': 0, 'complete': 'true', 'stderr': '', 'stdout': ''}
>>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:28:50,970 Cleaning up external metadata files
>>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:28:51,050 failure running job 129
>>>
>>>
>>> lwr error (on the cluster node):
>>>
>>>     resp.app_iter = FileIterator(result)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 111, in __init__
>>>     self.input = open(path, 'rb')
>>> IOError: [Errno 2] No such file or directory: u'/mnt/shared/129/outputs/dataset_170.dat'
>>>
>>>
>>>
>>>
>>> The full error stacks are at the end of this email. It might be something very simple that I am missing,
>>> but any feedback would be greatly appreciated. Thanks !
>>>
>>> Ntino
>>>
>>>
>>>
>>> --
>>> Konstantinos (Ntino) Krampis, Ph.D.
>>> Asst. Professor, Informatics
>>> J.Craig Venter Institute
>>>
>>> [hidden email]
>>> [hidden email]
>>> +1-540-200-8277
>>>
>>> Web:
>>> http://bit.ly/cloud-research
>>> http://cloudbiolinux.org/
>>> http://twitter.com/agbiotec
>>>
>>>
>>>
>>> ---- GALAXY ERROR
>>>
>>> galaxy.jobs DEBUG 2013-05-06 10:21:22,320 (128) Working directory for job is: /mnt/shared/000/128
>>> galaxy.jobs.handler DEBUG 2013-05-06 10:21:22,320 dispatching job 128 to lwr runner
>>> galaxy.jobs.handler INFO 2013-05-06 10:21:22,427 (128) Job dispatched
>>> galaxy.datatypes.metadata DEBUG 2013-05-06 10:21:22,875 Cleaning up external metadata files
>>> galaxy.jobs.runners.lwr ERROR 2013-05-06 10:21:22,902 failure running job 128
>>> Traceback (most recent call last):
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 286, in run_job
>>>     file_stager = FileStager(client, command_line, job_wrapper.extra_filenames, input_files, output_files, job_wrapper.tool.tool_dir)
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 40, in __init__
>>>     job_config = client.setup()
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 212, in setup
>>>     return self.__raw_execute_and_parse("setup", { "job_id" : self.job_id })
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 150, in __raw_execute_and_parse
>>>     response = self.__raw_execute(command, args, data)
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 146, in __raw_execute
>>>     response = self.url_open(request, data)
>>>   File "/home/vagrant/galaxy-dist/lib/galaxy/jobs/runners/lwr.py", line 134, in url_open
>>>     return urllib2.urlopen(request, data)
>>>   File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
>>>     return _opener.open(url, data, timeout)
>>>   File "/usr/lib/python2.7/urllib2.py", line 406, in open
>>>     response = meth(req, response)
>>>   File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
>>>     'http', request, response, code, msg, hdrs)
>>>   File "/usr/lib/python2.7/urllib2.py", line 444, in error
>>>     return self._call_chain(*args)
>>>   File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
>>>     result = func(*args)
>>>   File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
>>>     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
>>> HTTPError: HTTP Error 500: Internal Server Error
>>>
>>>
>>>
>>> ---- LWR ERROR
>>>
>>> Exception happened during processing of request from ('192.168.33.11', 44802)
>>> Traceback (most recent call last):
>>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 1068, in process_request_in_thread
>>>     self.finish_request(request, client_address)
>>>   File "/usr/lib/python2.7/SocketServer.py", line 323, in finish_request
>>>     self.RequestHandlerClass(request, client_address, self)
>>>   File "/usr/lib/python2.7/SocketServer.py", line 638, in __init__
>>>     self.handle()
>>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 442, in handle
>>>     BaseHTTPRequestHandler.handle(self)
>>>   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
>>>     self.handle_one_request()
>>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 437, in handle_one_request
>>>     self.wsgi_execute()
>>>   File "/usr/local/lib/python2.7/dist-packages/Paste-1.7.5.1-py2.7.egg/paste/httpserver.py", line 287, in wsgi_execute
>>>     self.wsgi_start_response)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 35, in __call__
>>>     return controller(environ, start_response, **request_args)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/framework.py", line 90, in controller_replacement
>>>     result = func(**args)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/app.py", line 81, in setup
>>>     manager.setup_job_directory(job_id)
>>>   File "/home/vagrant/jmchilton-lwr-5213f6dce32d/lwr/manager.py", line 101, in setup_job_directory
>>>     os.mkdir(job_directory)
>>> OSError: [Errno 17] File exists: '/mnt/shared/000/128'
>>>
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>>   http://lists.bx.psu.edu/
>>>
>>> To search Galaxy mailing lists use the unified search at:
>>>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/