Error cleaning up Condor jobs

classic Classic list List threaded Threaded
5 messages Options
| Threaded
Open this post in threaded view
|

Error cleaning up Condor jobs

Branden Timm-2
Hi All,
  I've been working to configure a new Galaxy instance to run jobs under Condor.  Things are 99% working at this point, but what seems to be happening is after the Condor job finishes Galaxy tries to clean up a cluster file that isn't there, namely the .ec (exit code) file.  Relevant log info:

galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to condor runner
galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job destination (destination id: condor)
galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command is: python /home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py '/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' '/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd /home/GLBRCORG/galaxy/galaxy-central; /home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh /home/GLBRCORG/galaxy/database/files /home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . /home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini /home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job destination (destination id: condor)
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job has completed
galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] No such file or directory: '/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up external metadata files

I've done a watch on the condor job directory, and as far as I can tell galaxy_1985.ec never gets created.  From a cursory look at lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the cleanup is happening in the AsynchronousJobState::cleanup method, which iterates on the cleanup_file_attributes list.  I naively tried to override cleanup_file_attributes in CondorJobState to disinclude 'exit_code_file', to no avail.

I'm hoping somebody can spot where the hiccup is here.  Another question that is on my mind is should a failure to cleanup cluster files set the dataset state to ERROR?  An inspection of the output file from my job leads me to believe it finished just fine, and indicating failure to the user because Galaxy couldn't cleanup a 1b error code file seems a little extreme to me.

Thanks!

--
Branden Timm
[hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Error cleaning up Condor jobs

Nate Coraor (nate@bx.psu.edu)
On May 8, 2013, at 10:08 AM, Branden Timm wrote:

> Hi All,
>   I've been working to configure a new Galaxy instance to run jobs under Condor.  Things are 99% working at this point, but what seems to be happening is after the Condor job finishes Galaxy tries to clean up a cluster file that isn't there, namely the .ec (exit code) file.  Relevant log info:
>
> galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
> galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to condor runner
> galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job destination (destination id: condor)
> galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh      
> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command is: python /home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py '/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' '/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd /home/GLBRCORG/galaxy/galaxy-central; /home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh /home/GLBRCORG/galaxy/database/files /home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . /home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini /home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/GL!
 BRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
> galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
> galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job destination (destination id: condor)
> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job is now running
> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job is now running
> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job has completed
> galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] No such file or directory: '/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
> galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
> galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
> galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up external metadata files

Hi Branden,

The ec file is optional, and the message that it's unable to be cleaned up is a red herring in this case.  The state is being set to ERROR, i suspect because the check of its outputs on line 894 of lib/galaxy/jobs/__init__.py is failing:

 894             if ( self.check_tool_output( stdout, stderr, tool_exit_code, job )):

You might need to add some debugging to see where exactly this error determination is coming from.

--nate

>
> I've done a watch on the condor job directory, and as far as I can tell galaxy_1985.ec never gets created.  From a cursory look at lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the cleanup is happening in the AsynchronousJobState::cleanup method, which iterates on the cleanup_file_attributes list.  I naively tried to override cleanup_file_attributes in CondorJobState to disinclude 'exit_code_file', to no avail.
>
> I'm hoping somebody can spot where the hiccup is here.  Another question that is on my mind is should a failure to cleanup cluster files set the dataset state to ERROR?  An inspection of the output file from my job leads me to believe it finished just fine, and indicating failure to the user because Galaxy couldn't cleanup a 1b error code file seems a little extreme to me.
>
> Thanks!
>
> --
> Branden Timm
> [hidden email]
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Error cleaning up Condor jobs

Branden Timm-2
Nate, thanks for the tip.  I'm adding some debugging info around that
block now to inspect what is going on.

One thing I just remembered (been awhile since I debugged Galaxy tools)
- does Galaxy still treat ANY stderr output as an indication of job
failure?  There are two warnings in the stderr for the job:

WARNING:galaxy.datatypes.registry:Error loading datatype with extension 'blastxml': 'module' object has no attribute 'BlastXml'
WARNING:galaxy.datatypes.registry:Error appending sniffer for datatype 'galaxy.datatypes.xml:BlastXml' to sniff_order: 'module' object has no attribute 'BlastXml'

--
Branden Timm
[hidden email]


On 5/8/2013 10:17 AM, Nate Coraor wrote:

> On May 8, 2013, at 10:08 AM, Branden Timm wrote:
>
>> Hi All,
>>    I've been working to configure a new Galaxy instance to run jobs under Condor.  Things are 99% working at this point, but what seems to be happening is after the Condor job finishes Galaxy tries to clean up a cluster file that isn't there, namely the .ec (exit code) file.  Relevant log info:
>>
>> galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
>> galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to condor runner
>> galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job destination (destination id: condor)
>> galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command is: python /home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py '/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' '/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd /home/GLBRCORG/galaxy/galaxy-central; /home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh /home/GLBRCORG/galaxy/database/files /home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . /home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini /home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/G!
 LBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk

>> galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
>> galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job destination (destination id: condor)
>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job is now running
>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job is now running
>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job has completed
>> galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] No such file or directory: '/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
>> galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
>> galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
>> galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up external metadata files
> Hi Branden,
>
> The ec file is optional, and the message that it's unable to be cleaned up is a red herring in this case.  The state is being set to ERROR, i suspect because the check of its outputs on line 894 of lib/galaxy/jobs/__init__.py is failing:
>
>   894             if ( self.check_tool_output( stdout, stderr, tool_exit_code, job )):
>
> You might need to add some debugging to see where exactly this error determination is coming from.
>
> --nate
>
>> I've done a watch on the condor job directory, and as far as I can tell galaxy_1985.ec never gets created.  From a cursory look at lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the cleanup is happening in the AsynchronousJobState::cleanup method, which iterates on the cleanup_file_attributes list.  I naively tried to override cleanup_file_attributes in CondorJobState to disinclude 'exit_code_file', to no avail.
>>
>> I'm hoping somebody can spot where the hiccup is here.  Another question that is on my mind is should a failure to cleanup cluster files set the dataset state to ERROR?  An inspection of the output file from my job leads me to believe it finished just fine, and indicating failure to the user because Galaxy couldn't cleanup a 1b error code file seems a little extreme to me.
>>
>> Thanks!
>>
>> --
>> Branden Timm
>> [hidden email]
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>   http://lists.bx.psu.edu/
>>
>> To search Galaxy mailing lists use the unified search at:
>>   http://galaxyproject.org/search/mailinglists/



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Error cleaning up Condor jobs

Nate Coraor (nate@bx.psu.edu)
On May 8, 2013, at 12:21 PM, Branden Timm wrote:

> Nate, thanks for the tip.  I'm adding some debugging info around that block now to inspect what is going on.
>
> One thing I just remembered (been awhile since I debugged Galaxy tools) - does Galaxy still treat ANY stderr output as an indication of job failure?

Yes, unless the tool defines that error codes should be used instead.  Most tools do not.

> There are two warnings in the stderr for the job:
>
> WARNING:galaxy.datatypes.registry:Error loading datatype with extension 'blastxml': 'module' object has no attribute 'BlastXml'
> WARNING:galaxy.datatypes.registry:Error appending sniffer for datatype 'galaxy.datatypes.xml:BlastXml' to sniff_order: 'module' object has no attribute 'BlastXml'

This is likely the problem, and should be fixable by updating your datatypes_conf.xml from the sample in your release.

--nate

>
> --
> Branden Timm
> [hidden email]
>
>
> On 5/8/2013 10:17 AM, Nate Coraor wrote:
>> On May 8, 2013, at 10:08 AM, Branden Timm wrote:
>>
>>> Hi All,
>>>   I've been working to configure a new Galaxy instance to run jobs under Condor.  Things are 99% working at this point, but what seems to be happening is after the Condor job finishes Galaxy tries to clean up a cluster file that isn't there, namely the .ec (exit code) file.  Relevant log info:
>>>
>>> galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
>>> galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to condor runner
>>> galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job destination (destination id: condor)
>>> galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command is: python /home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py '/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' '/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd /home/GLBRCORG/galaxy/galaxy-central; /home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh /home/GLBRCORG/galaxy/database/files /home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . /home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini /home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/!
 G!

> LBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
>>> galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
>>> galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job destination (destination id: condor)
>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job is now running
>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job is now running
>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job has completed
>>> galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] No such file or directory: '/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
>>> galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
>>> galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
>>> galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up external metadata files
>> Hi Branden,
>>
>> The ec file is optional, and the message that it's unable to be cleaned up is a red herring in this case.  The state is being set to ERROR, i suspect because the check of its outputs on line 894 of lib/galaxy/jobs/__init__.py is failing:
>>
>>  894             if ( self.check_tool_output( stdout, stderr, tool_exit_code, job )):
>>
>> You might need to add some debugging to see where exactly this error determination is coming from.
>>
>> --nate
>>
>>> I've done a watch on the condor job directory, and as far as I can tell galaxy_1985.ec never gets created.  From a cursory look at lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the cleanup is happening in the AsynchronousJobState::cleanup method, which iterates on the cleanup_file_attributes list.  I naively tried to override cleanup_file_attributes in CondorJobState to disinclude 'exit_code_file', to no avail.
>>>
>>> I'm hoping somebody can spot where the hiccup is here.  Another question that is on my mind is should a failure to cleanup cluster files set the dataset state to ERROR?  An inspection of the output file from my job leads me to believe it finished just fine, and indicating failure to the user because Galaxy couldn't cleanup a 1b error code file seems a little extreme to me.
>>>
>>> Thanks!
>>>
>>> --
>>> Branden Timm
>>> [hidden email]
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>>  http://lists.bx.psu.edu/
>>>
>>> To search Galaxy mailing lists use the unified search at:
>>>  http://galaxyproject.org/search/mailinglists/
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
> http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Error cleaning up Condor jobs

Branden Timm-2
Nate, thanks - replacing the data types file worked.

-Branden

On 5/8/2013 11:29 AM, Nate Coraor wrote:

> On May 8, 2013, at 12:21 PM, Branden Timm wrote:
>
>> Nate, thanks for the tip.  I'm adding some debugging info around that block now to inspect what is going on.
>>
>> One thing I just remembered (been awhile since I debugged Galaxy tools) - does Galaxy still treat ANY stderr output as an indication of job failure?
> Yes, unless the tool defines that error codes should be used instead.  Most tools do not.
>
>> There are two warnings in the stderr for the job:
>>
>> WARNING:galaxy.datatypes.registry:Error loading datatype with extension 'blastxml': 'module' object has no attribute 'BlastXml'
>> WARNING:galaxy.datatypes.registry:Error appending sniffer for datatype 'galaxy.datatypes.xml:BlastXml' to sniff_order: 'module' object has no attribute 'BlastXml'
> This is likely the problem, and should be fixable by updating your datatypes_conf.xml from the sample in your release.
>
> --nate
>
>> --
>> Branden Timm
>> [hidden email]
>>
>>
>> On 5/8/2013 10:17 AM, Nate Coraor wrote:
>>> On May 8, 2013, at 10:08 AM, Branden Timm wrote:
>>>
>>>> Hi All,
>>>>    I've been working to configure a new Galaxy instance to run jobs under Condor.  Things are 99% working at this point, but what seems to be happening is after the Condor job finishes Galaxy tries to clean up a cluster file that isn't there, namely the .ec (exit code) file.  Relevant log info:
>>>>
>>>> galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
>>>> galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to condor runner
>>>> galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job destination (destination id: condor)
>>>> galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
>>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
>>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command is: python /home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py '/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' '/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd /home/GLBRCORG/galaxy/galaxy-central; /home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh /home/GLBRCORG/galaxy/database/files /home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . /home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini /home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home!
 /G!

>> LBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
>>>> galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
>>>> galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job destination (destination id: condor)
>>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job is now running
>>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job is now running
>>>> galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job has completed
>>>> galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] No such file or directory: '/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
>>>> galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
>>>> galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
>>>> galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up external metadata files
>>> Hi Branden,
>>>
>>> The ec file is optional, and the message that it's unable to be cleaned up is a red herring in this case.  The state is being set to ERROR, i suspect because the check of its outputs on line 894 of lib/galaxy/jobs/__init__.py is failing:
>>>
>>>   894             if ( self.check_tool_output( stdout, stderr, tool_exit_code, job )):
>>>
>>> You might need to add some debugging to see where exactly this error determination is coming from.
>>>
>>> --nate
>>>
>>>> I've done a watch on the condor job directory, and as far as I can tell galaxy_1985.ec never gets created.  From a cursory look at lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the cleanup is happening in the AsynchronousJobState::cleanup method, which iterates on the cleanup_file_attributes list.  I naively tried to override cleanup_file_attributes in CondorJobState to disinclude 'exit_code_file', to no avail.
>>>>
>>>> I'm hoping somebody can spot where the hiccup is here.  Another question that is on my mind is should a failure to cleanup cluster files set the dataset state to ERROR?  An inspection of the output file from my job leads me to believe it finished just fine, and indicating failure to the user because Galaxy couldn't cleanup a 1b error code file seems a little extreme to me.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Branden Timm
>>>> [hidden email]
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this
>>>> and other Galaxy lists, please use the interface at:
>>>>   http://lists.bx.psu.edu/
>>>>
>>>> To search Galaxy mailing lists use the unified search at:
>>>>   http://galaxyproject.org/search/mailinglists/
>>
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> http://lists.bx.psu.edu/
>>
>> To search Galaxy mailing lists use the unified search at:
>> http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/