troubleshooting Galaxy with LSF

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

troubleshooting Galaxy with LSF

I Kozin
Hello,
This is largely a repost from the biostar forum following the suggestion there to post here.

I'm doing my first steps in setting up a Galaxy server with an LSF job scheduler. Recently LSF started supporting DRMAA again so I decided to give it a go. 

I have two setups. The one that works is a stand along server (OpenSuse 12.1, python 2.7.2, LSF 9.1.2). By "works" I mean that when I login into Galaxy using a browser and upload a file, a job gets submitted and run and everything seems fine.

The second setup does not work (RH 6.4, python 2.6.6, LSF 9.1.2). It's a server running Galaxy which is meant to submit jobs to an LSF cluster. When I similarly pick and download a file I get

Job <72266> is submitted to queue <short>.
./run.sh: line 79: 99087 Segmentation fault      python ./scripts/paster.py serve universe_wsgi.ini $@

For the moment, I'm not bothered with the full server setup, I'm just testing whether Galaxy works with LSF and therefore run ./run.sh as a user. 

The job configuration job_conf.xml is identical in both cases:

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="lsf" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner">
            <param id="drmaa_library_path">/opt/gridware/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so</param>
        </plugin>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="lsf_default">
        <destination id="lsf_default" runner="lsf">
            <param id="nativeSpecification">-W 24:00</param>
        </destination>
    </destinations>
</job_conf>

run.sh is only changed to allow remote access.

Most recently I tried replacing python with 2.7.5 to no avail. Still the same kind of error. I also updated Galaxy.

Any hints would be much appreciated. Thank you


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

Kandalaft, Iyad

This is just a guess, which may help you troubleshoot.

It could be a that python is reaching a stack limit: run ulimit -s  and set it to a higher value if required

I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd?

 

Regards,

Iyad Kandalaft

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone 613-759-1228
Facsimile | Télécopieur 613-759-1701
Teletypewriter | Téléimprimeur 613-773-2600
Government of Canada | Gouvernement du Canada

 

 

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of I Kozin
Sent: Tuesday, June 10, 2014 12:42 PM
To: [hidden email]
Subject: [galaxy-dev] troubleshooting Galaxy with LSF

 

Hello,

This is largely a repost from the biostar forum following the suggestion there to post here.

 

I'm doing my first steps in setting up a Galaxy server with an LSF job scheduler. Recently LSF started supporting DRMAA again so I decided to give it a go. 

 

I have two setups. The one that works is a stand along server (OpenSuse 12.1, python 2.7.2, LSF 9.1.2). By "works" I mean that when I login into Galaxy using a browser and upload a file, a job gets submitted and run and everything seems fine.

The second setup does not work (RH 6.4, python 2.6.6, LSF 9.1.2). It's a server running Galaxy which is meant to submit jobs to an LSF cluster. When I similarly pick and download a file I get

Job <72266> is submitted to queue <short>.
./run.sh: line 79: 99087 Segmentation fault      python ./scripts/paster.py serve universe_wsgi.ini $@

For the moment, I'm not bothered with the full server setup, I'm just testing whether Galaxy works with LSF and therefore run ./run.sh as a user. 

The job configuration job_conf.xml is identical in both cases:

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="lsf" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner">
            <param id="drmaa_library_path">/opt/gridware/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so</param>
        </plugin>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="lsf_default">
        <destination id="lsf_default" runner="lsf">
            <param id="nativeSpecification">-W 24:00</param>
        </destination>
    </destinations>
</job_conf>

run.sh is only changed to allow remote access.

Most recently I tried replacing python with 2.7.5 to no avail. Still the same kind of error. I also updated Galaxy.

Any hints would be much appreciated. Thank you


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

I Kozin
Thank you, Iayd. Indeed, setting ulimit -s to unlimited helped to advance this further.
I can see now that a job gets generated and submitted. However Galaxy crashes immediately after that. 
Job <108038> is submitted to queue <short>.
*** glibc detected *** python: free(): invalid pointer: 0x00007fff79f10b64 ***
======= Backtrace: =========
< further output is omitted >

Tracking the job through the scheduler reveals that the job finished successfully.

The command in the job script is something like this:

python /galaxy-dist/tools/data_source/upload.py /galaxy-dist /galaxy-dist/database/tmp/tmpGY5_lI /galaxy-dist/database/tmp/tmpr7VKGy         1:/galaxy-dist/database/job_working_directory/000/1/dataset_1_files:/galaxy-dist/database/files/000/dataset_1.dat

usage: upload.py <root> <datatypes_conf> <json paramfile> <output spec> ...

I cannot re-run it because only the first file in the tmp folder is there. The second (json paramfile, tmpr7VKGy) is gone. I presume dataset_1.dat is the output and it's there.

The second half of the job script is the execution of set_metadata.sh
I can execute it without issues (is this a db update?).

One significant difference between the setup which works and which doesnt is that the working setup sits on local disk whereas the not working on Lustre. Could that be relevant?

By the way, is there a method for removing the pending job?
When I re-run Galaxy, it promptly crashes again due the stuck job.

When Galaxy starts, the only error that I see is this
IOError: [Errno 2] No such file or directory: './tools/mutation/visualize.xml'
While it might be a good question why mutation directory is not there, the error is very likely not relevant to the issue.

So I'm open to further suggestions as to how to understand what's going on.

Thank you


On 10 June 2014 19:24, Kandalaft, Iyad <[hidden email]> wrote:

This is just a guess, which may help you troubleshoot.

It could be a that python is reaching a stack limit: run ulimit -s  and set it to a higher value if required

I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd?

 

Regards,

Iyad Kandalaft

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone <a href="tel:613-759-1228" value="+16137591228" target="_blank">613-759-1228
Facsimile | Télécopieur <a href="tel:613-759-1701" value="+16137591701" target="_blank">613-759-1701
Teletypewriter | Téléimprimeur <a href="tel:613-773-2600" value="+16137732600" target="_blank">613-773-2600
Government of Canada | Gouvernement du Canada


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

Kandalaft, Iyad

Hi Kozin,

 

Are you using a python environment specifically for galaxy?  If not, then jobs running on the compute will be using the wrong python environment.  I setup galaxy (universe_wsgi.ini option) to source the python environment for galaxy before every job.

 

Galaxy is coded to work only if it is shared across the cluster under the same path for all the nodes.  Is this the case for the install sitting on Lustre?  Hence, /home/galaxy/ is mounted on every compute node in the cluster from your LustreFS system?

I would be interested in the omitted output (assuming it is relevant).

 

Regards,

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone 613-759-1228
Facsimile | Télécopieur 613-759-1701
Teletypewriter | Téléimprimeur 613-773-2600
Government of Canada | Gouvernement du Canada

 

 

 

 

From: I Kozin [mailto:[hidden email]]
Sent: Wednesday, June 11, 2014 12:55 PM
To: Kandalaft, Iyad
Cc: [hidden email]
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

 

Thank you, Iayd. Indeed, setting ulimit -s to unlimited helped to advance this further.

I can see now that a job gets generated and submitted. However Galaxy crashes immediately after that. 

Job <108038> is submitted to queue <short>.

*** glibc detected *** python: free(): invalid pointer: 0x00007fff79f10b64 ***

======= Backtrace: =========

< further output is omitted >

 

Tracking the job through the scheduler reveals that the job finished successfully.

 

The command in the job script is something like this:

 

python /galaxy-dist/tools/data_source/upload.py /galaxy-dist /galaxy-dist/database/tmp/tmpGY5_lI /galaxy-dist/database/tmp/tmpr7VKGy         1:/galaxy-dist/database/job_working_directory/000/1/dataset_1_files:/galaxy-dist/database/files/000/dataset_1.dat

 

usage: upload.py <root> <datatypes_conf> <json paramfile> <output spec> ...

 

I cannot re-run it because only the first file in the tmp folder is there. The second (json paramfile, tmpr7VKGy) is gone. I presume dataset_1.dat is the output and it's there.

 

The second half of the job script is the execution of set_metadata.sh

I can execute it without issues (is this a db update?).

 

One significant difference between the setup which works and which doesnt is that the working setup sits on local disk whereas the not working on Lustre. Could that be relevant?

 

By the way, is there a method for removing the pending job?

When I re-run Galaxy, it promptly crashes again due the stuck job.

 

When Galaxy starts, the only error that I see is this

IOError: [Errno 2] No such file or directory: './tools/mutation/visualize.xml'

While it might be a good question why mutation directory is not there, the error is very likely not relevant to the issue.

 

So I'm open to further suggestions as to how to understand what's going on.

 

Thank you

 

On 10 June 2014 19:24, Kandalaft, Iyad <[hidden email]> wrote:

This is just a guess, which may help you troubleshoot.

It could be a that python is reaching a stack limit: run ulimit -s  and set it to a higher value if required

I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd?

 

Regards,

Iyad Kandalaft

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone <a href="tel:613-759-1228" target="_blank">613-759-1228
Facsimile | Télécopieur <a href="tel:613-759-1701" target="_blank">613-759-1701
Teletypewriter | Téléimprimeur <a href="tel:613-773-2600" target="_blank">613-773-2600
Government of Canada | Gouvernement du Canada


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

I Kozin
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse 12.1 box, I'm getting a segfault on RH 6.4. 
Surprisingly, the job however gets submitted and run successfully. 

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

INKozin
 
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse
12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

Kandalaft, Iyad

Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?

 

Regards,

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone 613-759-1228
Facsimile | Télécopieur 613-759-1701
Teletypewriter | Téléimprimeur 613-773-2600
Government of Canada | Gouvernement du Canada

 

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of INKozin
Sent: Thursday, June 12, 2014 11:35 AM
To: [hidden email]
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

 

 

The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse
12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.

 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

INKozin
I'm not sure what you mean. I'm not using python virtual env, anaconda etc
But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH.
In addition to the default Python 2.6, I also tried Python 2.7 built from source.


On 13 June 2014 17:34, Kandalaft, Iyad <[hidden email]> wrote:

Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?

 

Regards,

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  [hidden email]
Telephone | Téléphone <a href="tel:613-759-1228" value="+16137591228" target="_blank">613-759-1228
Facsimile | Télécopieur <a href="tel:613-759-1701" value="+16137591701" target="_blank">613-759-1701
Teletypewriter | Téléimprimeur <a href="tel:613-773-2600" value="+16137732600" target="_blank">613-773-2600
Government of Canada | Gouvernement du Canada

 

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of INKozin
Sent: Thursday, June 12, 2014 11:35 AM
To: [hidden email]


Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

 

 

The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse


12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.

 



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

Kandalaft, Iyad
If you are running galaxy on a cluster, you can do one of two things:
1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method
2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run.  I usually activate it by putting it in .bashrc

Iyad Kandalaft
Bioinformatics Programmer
Microbial Biodiversity Bioinformatics
Science & Technology Branch
Agriculture & Agri-Food Canada
[hidden email] | (613) 759-1228
________________________________
From: [hidden email] [[hidden email]] on behalf of INKozin [[hidden email]]
Sent: June 14, 2014 6:18 PM
To: Kandalaft, Iyad
Cc: [hidden email]
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

I'm not sure what you mean. I'm not using python virtual env, anaconda etc
But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH.
In addition to the default Python 2.6, I also tried Python 2.7 built from source.


On 13 June 2014 17:34, Kandalaft, Iyad <[hidden email]<mailto:[hidden email]>> wrote:
Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?

Regards,

Iyad Kandalaft
Microbial Biodiversity Bioinformatics
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling
Ottawa, ON| Ottawa (ON) K1A 0C6
E-mail Address / Adresse courriel  [hidden email]<mailto:[hidden email]>
Telephone | Téléphone 613-759-1228<tel:613-759-1228>
Facsimile | Télécopieur 613-759-1701<tel:613-759-1701>
Teletypewriter | Téléimprimeur 613-773-2600<tel:613-773-2600>
Government of Canada | Gouvernement du Canada



From: [hidden email]<mailto:[hidden email]> [mailto:[hidden email]<mailto:[hidden email]>] On Behalf Of INKozin
Sent: Thursday, June 12, 2014 11:35 AM
To: [hidden email]<mailto:[hidden email]>

Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF


The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse

12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

INKozin
We are doing 1 one our clusters.
Like I said elsewhere the problem appears to be in the interaction between Python DRMAA and LSF DRMAA, not Galaxy.


On 15 June 2014 01:51, Kandalaft, Iyad <[hidden email]> wrote:
If you are running galaxy on a cluster, you can do one of two things:
1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method
2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run.  I usually activate it by putting it in .bashrc

Iyad Kandalaft
Bioinformatics Programmer
Microbial Biodiversity Bioinformatics
Science & Technology Branch
Agriculture & Agri-Food Canada
[hidden email] | <a href="tel:%28613%29%20759-1228" value="+16137591228">(613) 759-1228
________________________________
From: [hidden email] [[hidden email]] on behalf of INKozin [[hidden email]]
Sent: June 14, 2014 6:18 PM
To: Kandalaft, Iyad
Cc: [hidden email]
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

I'm not sure what you mean. I'm not using python virtual env, anaconda etc
But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH.
In addition to the default Python 2.6, I also tried Python 2.7 built from source.


On 13 June 2014 17:34, Kandalaft, Iyad <[hidden email]<mailto:[hidden email]>> wrote:
Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?

Regards,

Iyad Kandalaft
Microbial Biodiversity Bioinformatics
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling
Ottawa, ON| Ottawa (ON) K1A 0C6
E-mail Address / Adresse courriel  [hidden email]<mailto:[hidden email]>
Telephone | Téléphone <a href="tel:613-759-1228" value="+16137591228">613-759-1228<tel:<a href="tel:613-759-1228" value="+16137591228">613-759-1228>
Facsimile | Télécopieur <a href="tel:613-759-1701" value="+16137591701">613-759-1701<tel:<a href="tel:613-759-1701" value="+16137591701">613-759-1701>
Teletypewriter | Téléimprimeur <a href="tel:613-773-2600" value="+16137732600">613-773-2600<tel:<a href="tel:613-773-2600" value="+16137732600">613-773-2600>
Government of Canada | Gouvernement du Canada



From: [hidden email]<mailto:[hidden email]> [mailto:[hidden email]<mailto:[hidden email]>] On Behalf Of INKozin
Sent: Thursday, June 12, 2014 11:35 AM
To: [hidden email]<mailto:[hidden email]>

Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF


The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse

12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

Nate Coraor (nate@bx.psu.edu)
Hi,

The drmaa library seems suspect here, I am not sure what the current state of the Platform/IBM-supplied libdrmaa is but the vendor versions have given me problems in the past. Could you try FedStage DRMAA for LSF?:


This does look a bit older than FedStage/PSNC's other DRMAA implementations but the FedStage/PSNC implementations have been known to work well in the past, and this library can also have debugging output enabled (./configure --enable-debug) if you get segfaults with it (and would be useful for producing cores).

--nate


On Sun, Jun 15, 2014 at 10:15 AM, INKozin <[hidden email]> wrote:
We are doing 1 one our clusters.
Like I said elsewhere the problem appears to be in the interaction between Python DRMAA and LSF DRMAA, not Galaxy.



On 15 June 2014 01:51, Kandalaft, Iyad <[hidden email]> wrote:
If you are running galaxy on a cluster, you can do one of two things:
1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method
2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run.  I usually activate it by putting it in .bashrc

Iyad Kandalaft
Bioinformatics Programmer
Microbial Biodiversity Bioinformatics
Science & Technology Branch
Agriculture & Agri-Food Canada
[hidden email] | <a href="tel:%28613%29%20759-1228" value="+16137591228" target="_blank">(613) 759-1228
________________________________
From: [hidden email] [[hidden email]] on behalf of INKozin [[hidden email]]
Sent: June 14, 2014 6:18 PM
To: Kandalaft, Iyad
Cc: [hidden email]
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF

I'm not sure what you mean. I'm not using python virtual env, anaconda etc
But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH.
In addition to the default Python 2.6, I also tried Python 2.7 built from source.


On 13 June 2014 17:34, Kandalaft, Iyad <[hidden email]<mailto:[hidden email]>> wrote:
Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?

Regards,

Iyad Kandalaft
Microbial Biodiversity Bioinformatics
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling
Ottawa, ON| Ottawa (ON) K1A 0C6
E-mail Address / Adresse courriel  [hidden email]<mailto:[hidden email]>
Telephone | Téléphone <a href="tel:613-759-1228" value="+16137591228" target="_blank">613-759-1228<tel:<a href="tel:613-759-1228" value="+16137591228" target="_blank">613-759-1228>
Facsimile | Télécopieur <a href="tel:613-759-1701" value="+16137591701" target="_blank">613-759-1701<tel:<a href="tel:613-759-1701" value="+16137591701" target="_blank">613-759-1701>
Teletypewriter | Téléimprimeur <a href="tel:613-773-2600" value="+16137732600" target="_blank">613-773-2600<tel:<a href="tel:613-773-2600" value="+16137732600" target="_blank">613-773-2600>
Government of Canada | Gouvernement du Canada



From: [hidden email]<mailto:[hidden email]> [mailto:[hidden email]<mailto:[hidden email]>] On Behalf Of INKozin
Sent: Thursday, June 12, 2014 11:35 AM
To: [hidden email]<mailto:[hidden email]>

Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF


The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse

12.1 box, I'm getting a segfault on RH 6.4.
Surprisingly, the job however gets submitted and run successfully.




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: troubleshooting Galaxy with LSF

INKozin
Hi Nate,
The "new" LSF DRMAA is the old FedStage source code. However the latest release (v 1.1.1) is claimed to be compatible with LSF 9.1.2 which we are using. And it does work indeed.

Yes, I've re-built the lib with --enable-debug and confirmed that the segfault happens on returning from drmaa_run_job back to Python. I'm talking to Dan Blanchard, Python DRMAA maintainer, so there is a hope this can get resolved. 

I'm doing my debugging on the latest Python DRMAA release - 0.7.6. Galaxy is still on 0.6 so need to update at some point.

Best
Igor

On 16 June 2014 14:35, Nate Coraor <[hidden email]> wrote:
Hi,

The drmaa library seems suspect here, I am not sure what the current state of the Platform/IBM-supplied libdrmaa is but the vendor versions have given me problems in the past. Could you try FedStage DRMAA for LSF?:


This does look a bit older than FedStage/PSNC's other DRMAA implementations but the FedStage/PSNC implementations have been known to work well in the past, and this library can also have debugging output enabled (./configure --enable-debug) if you get segfaults with it (and would be useful for producing cores).

--nate

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/