Database deadlock with large workflows + dataset collections

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Database deadlock with large workflows + dataset collections

Aaron Petkau
Hey,

I wanted to know if anyone else has had experience with database deadlock when using dataset collections and running a large number of samples through a workflow.

Traceback (most recent call last):
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 565, in finish_job
    job_state.job_wrapper.finish( stdout, stderr, exit_code )
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py", line 1250, in finish
    self.sa_session.flush()
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/scoping.py", line 114, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py", line 1718, in flush
    self._flush(objects)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py", line 1789, in _flush
    flush_context.execute()
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py", line 331, in execute
    rec.execute(self)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py", line 475, in execute
    uow
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py", line 59, in save_obj
    mapper, table, update)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py", line 485, in _emit_update_statements
    execute(statement, params)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py", line 1449, in execute
    params)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py", line 1698, in _execute_context
    context)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py", line 1691, in _execute_context
    context)
  File "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/default.py", line 331, in do_execute
    cursor.execute(statement, parameters)
DBAPIError: (TransactionRollbackError) deadlock detected
DETAIL:  Process 25859 waits for ShareLock on transaction 144373; blocked by process 25858.
Process 25858 waits for ShareLock on transaction 144372; blocked by process 25859.
HINT:  See server log for query details.
 'UPDATE workflow_invocation SET update_time=%(update_time)s WHERE workflow_invocation.id = %(workflow_invocation_id)s' {'update_time': datetime.datetime(2015, 2, 27, 3, 51, 57, 81403), 'workflow_invocation_id': 48}

I saw this post http://dev.list.galaxyproject.org/data-collections-workflow-bug-td4666496.html with a similar issue and the solution was to make sure not to use a sqlite database, but I'm using a postgres database and still encountered this issue.  This was after running a very large number of samples (~200) using dataset collections.  Just wondering if anyone else was running into this issue?

Thanks,

Aaron




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Database deadlock with large workflows + dataset collections

John Chilton-4
Hey Aaron,

Thanks for the bug report - I have added it to Trello here
(https://trello.com/c/I0n23JEP). I assume this is the January stable
release (15.01)? Any clue if this workflow was being scheduled in a
web thread or a background job handler thread? (Did the submission
take forever - or was the web page responsive and the server bogged
down).

So there is a race condition here it would seem - and I don't have a
fix right away - but I do think (hope) the next release due out in a
couple of weeks will result in a massive speed up in workflow
scheduling - so hopefully we will be less likely to hit these
conditions.

-John

On Fri, Feb 27, 2015 at 10:11 AM, Aaron Petkau <[hidden email]> wrote:

> Hey,
>
> I wanted to know if anyone else has had experience with database deadlock
> when using dataset collections and running a large number of samples through
> a workflow.
>
> Traceback (most recent call last):
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 565, in finish_job
>     job_state.job_wrapper.finish( stdout, stderr, exit_code )
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py",
> line 1250, in finish
>     self.sa_session.flush()
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/scoping.py",
> line 114, in do
>     return getattr(self.registry(), name)(*args, **kwargs)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py",
> line 1718, in flush
>     self._flush(objects)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py",
> line 1789, in _flush
>     flush_context.execute()
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py",
> line 331, in execute
>     rec.execute(self)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py",
> line 475, in execute
>     uow
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py",
> line 59, in save_obj
>     mapper, table, update)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py",
> line 485, in _emit_update_statements
>     execute(statement, params)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1449, in execute
>     params)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1584, in _execute_clauseelement
>     compiled_sql, distilled_params
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1698, in _execute_context
>     context)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1691, in _execute_context
>     context)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/default.py",
> line 331, in do_execute
>     cursor.execute(statement, parameters)
> DBAPIError: (TransactionRollbackError) deadlock detected
> DETAIL:  Process 25859 waits for ShareLock on transaction 144373; blocked by
> process 25858.
> Process 25858 waits for ShareLock on transaction 144372; blocked by process
> 25859.
> HINT:  See server log for query details.
>  'UPDATE workflow_invocation SET update_time=%(update_time)s WHERE
> workflow_invocation.id = %(workflow_invocation_id)s' {'update_time':
> datetime.datetime(2015, 2, 27, 3, 51, 57, 81403), 'workflow_invocation_id':
> 48}
>
> I saw this post
> http://dev.list.galaxyproject.org/data-collections-workflow-bug-td4666496.html
> with a similar issue and the solution was to make sure not to use a sqlite
> database, but I'm using a postgres database and still encountered this
> issue.  This was after running a very large number of samples (~200) using
> dataset collections.  Just wondering if anyone else was running into this
> issue?
>
> Thanks,
>
> Aaron
>
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Database deadlock with large workflows + dataset collections

Aaron Petkau
Hello John,

Awesome, thanks so much for looking into this and all the other work you've been doing on dataset collections.  Yes, this is the January stable release, exact commit is https://bitbucket.org/galaxy/galaxy-dist/commits/097bbb3b7d3246faaa5188a1fc2a79b01630025c.  I currently have one web worker and one background job handler so it would have been scheduled in the job handler.  It is configured like:

[server:handler0]
use = egg:Paste#http
port = 8079
host = 127.0.0.1
use_threadpool = true
threadpool_workers = 5


I can paste the relevant section from the handler log, but I didn't see anything there other then the same exception and stack trace.

Our job handler is submitting to our cluster using DRMAA, from job_conf.xml:

<plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/>

The submission did take forever, over an hour.  It was being submitted through the API so I didn't check the web response time but from my experience before I imagine it would be very slow.  I've noticed this slowdown before when I get into the 150-200 paired-end samples range.

I'll look forward to the next release though.  It would be awesome to have the workflow submission sped up.

Let me know if there's any other info I can provide to help debug,

Aaron

On Fri, Feb 27, 2015 at 9:38 AM, John Chilton <[hidden email]> wrote:
Hey Aaron,

Thanks for the bug report - I have added it to Trello here
(https://trello.com/c/I0n23JEP). I assume this is the January stable
release (15.01)? Any clue if this workflow was being scheduled in a
web thread or a background job handler thread? (Did the submission
take forever - or was the web page responsive and the server bogged
down).

So there is a race condition here it would seem - and I don't have a
fix right away - but I do think (hope) the next release due out in a
couple of weeks will result in a massive speed up in workflow
scheduling - so hopefully we will be less likely to hit these
conditions.

-John

On Fri, Feb 27, 2015 at 10:11 AM, Aaron Petkau <[hidden email]> wrote:
> Hey,
>
> I wanted to know if anyone else has had experience with database deadlock
> when using dataset collections and running a large number of samples through
> a workflow.
>
> Traceback (most recent call last):
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 565, in finish_job
>     job_state.job_wrapper.finish( stdout, stderr, exit_code )
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/lib/galaxy/jobs/__init__.py",
> line 1250, in finish
>     self.sa_session.flush()
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/scoping.py",
> line 114, in do
>     return getattr(self.registry(), name)(*args, **kwargs)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py",
> line 1718, in flush
>     self._flush(objects)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/session.py",
> line 1789, in _flush
>     flush_context.execute()
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py",
> line 331, in execute
>     rec.execute(self)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/unitofwork.py",
> line 475, in execute
>     uow
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py",
> line 59, in save_obj
>     mapper, table, update)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/orm/persistence.py",
> line 485, in _emit_update_statements
>     execute(statement, params)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1449, in execute
>     params)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1584, in _execute_clauseelement
>     compiled_sql, distilled_params
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1698, in _execute_context
>     context)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/base.py",
> line 1691, in _execute_context
>     context)
>   File
> "/Warehouse/Applications/irida/galaxy/galaxy-dist/eggs/SQLAlchemy-0.7.9-py2.6-linux-x86_64-ucs4.egg/sqlalchemy/engine/default.py",
> line 331, in do_execute
>     cursor.execute(statement, parameters)
> DBAPIError: (TransactionRollbackError) deadlock detected
> DETAIL:  Process 25859 waits for ShareLock on transaction 144373; blocked by
> process 25858.
> Process 25858 waits for ShareLock on transaction 144372; blocked by process
> 25859.
> HINT:  See server log for query details.
>  'UPDATE workflow_invocation SET update_time=%(update_time)s WHERE
> workflow_invocation.id = %(workflow_invocation_id)s' {'update_time':
> datetime.datetime(2015, 2, 27, 3, 51, 57, 81403), 'workflow_invocation_id':
> 48}
>
> I saw this post
> http://dev.list.galaxyproject.org/data-collections-workflow-bug-td4666496.html
> with a similar issue and the solution was to make sure not to use a sqlite
> database, but I'm using a postgres database and still encountered this
> issue.  This was after running a very large number of samples (~200) using
> dataset collections.  Just wondering if anyone else was running into this
> issue?
>
> Thanks,
>
> Aaron
>
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/