Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: "Remaining connection slots are reserved for non-replication superuser connections" #117

Open
CasperWA opened this issue Sep 16, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@CasperWA
Copy link
Collaborator

Following error was found from the scdm / autowannier server hosted on Materials Cloud:

[11:35:55 ERROR] QueryError raised.
Error message: 
QueryError
  Message: <font color="red">Error(s) during querying. Message(s) from server:<br> - 'FATAL:  remaining connection slots are reserved for non-replication superuser connections\n'</font>
  Remove target: False
About this exception: Error while querying specific implementation (or provider)
[11:35:55 ERROR] Errored response:
{
  "meta": {
    "query": {
      "representation": "/structures?sort=-nsites&response_format=json&response_fields=nsites&page_limit=1"
    },
    "api_version": "1.0.0",
    "more_data_available": false,
    "time_stamp": "2020-09-16T11:35:55Z",
    "data_returned": 0,
    "provider": {
      "name": "Materials Cloud",
      "description": "A platform for Open Science built for seamless sharing of resources in computational materials science",
      "prefix": "mcloud",
      "homepage": "https://materialscloud.org"
    },
    "data_available": 0,
    "implementation": {
      "name": "aiida-optimade",
      "version": "0.9.0+mcloud",
      "source_url": "https://github.com/aiidateam/aiida-optimade",
      "maintainer": {
        "email": "casper.andersen@epfl.ch"
      }
    }
  },
  "errors": [
    {
      "status": "500",
      "title": "OperationalError",
      "detail": "FATAL:  remaining connection slots are reserved for non-replication superuser connections\n"
    }
  ]
}

This means the server raised a Python exception like: OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections.

@CasperWA CasperWA added the bug Something isn't working label Sep 16, 2020
@CasperWA
Copy link
Collaborator Author

Traceback from within the Docker container on Materials Cloud (aiida-dev):

INFO:     172.19.0.1:34074 - "GET /v1.0.0/structures?sort=nsites&response_format=json&response_fields=nsites&page_limit=1 HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 390, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.7/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/fastapi/applications.py", line 180, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/applications.py", line 111, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 25, in __call__
    response = await self.dispatch_func(request, self.call_next)
  File "./aiida_optimade/middleware.py", line 35, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 45, in call_next
    task.result()
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 38, in coro
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 25, in __call__
    response = await self.dispatch_func(request, self.call_next)
  File "/usr/local/lib/python3.7/site-packages/optimade/server/middleware.py", line 73, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 45, in call_next
    task.result()
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/base.py", line 38, in coro
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/cors.py", line 78, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 566, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 41, in app
    response = await func(request)
  File "/usr/local/lib/python3.7/site-packages/fastapi/routing.py", line 197, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/usr/local/lib/python3.7/site-packages/fastapi/routing.py", line 149, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/usr/local/lib/python3.7/site-packages/starlette/concurrency.py", line 34, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./aiida_optimade/routers/utils.py", line 172, in wrapper
    value = func(*args, **kwargs)
  File "./aiida_optimade/routers/structures.py", line 39, in get_structures
    params=params,
  File "./aiida_optimade/routers/utils.py", line 103, in get_entries
    ) = collection.find(params)
  File "./aiida_optimade/entry_collections.py", line 168, in find
    dict(zip(criteria["project"], entity))
  File "./aiida_optimade/mappers/entries.py", line 68, in map_back
    new_object_attributes, new_object["id"]
  File "./aiida_optimade/mappers/structures.py", line 68, in build_attributes
    translator.store_attributes()
  File "./aiida_optimade/translators/entities.py", line 76, in store_attributes
    with get_manager().get_backend().transaction():
  File "/usr/local/lib/python3.7/site-packages/django/db/transaction.py", line 175, in __enter__
    if not connection.get_autocommit():
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/base/base.py", line 379, in get_autocommit
    self.ensure_connection()
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/usr/local/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  remaining connection slots are reserved for non-replication superuser connections

@CasperWA
Copy link
Collaborator Author

@giovannipizzi or @sphuber would you know what the issue is here? Does it have to do with the SQLAlchemy sessions setup in AiiDA or could it be something different?

@sphuber
Copy link

sphuber commented Sep 16, 2020

This is a message from PostgreSQL that it has too many open connections. The default is configured to something like a 100. Exceed that number and new connections will simply be refused. Note that this is across all databases. So if you have a PostgreSQL installation with a hundred databases and each database has a single open connection, then you also hit the limit. Since this is a server, this could be simply this, a scaling/configuration problem, or there may be a bug in our code where connections are not correctly cleaned up. I suspect it is the former where the database just gets too many concurrent open connection requests and some of them fail. How many databases are you running on this server and are they getting a lot of (simultaneous) requests?

@CasperWA
Copy link
Collaborator Author

CasperWA commented Sep 16, 2020

(...) How many databases are you running on this server and are they getting a lot of (simultaneous) requests?

This is connected to a PostgreSQL database cluster on a separate db server, which hosts all Materials Cloud's production databases. Connecting to another database, i.e., another AiiDA-OPTIMADE server, which uses a PostgreSQL database on the separate db server, there is no issue when doing queries.

Each AiiDA-OPTIMADE server is running within Docker, where the external PostgreSQL databases are found through a closed Docker network (as far as I understand, perhaps @ltalirz could clarify this?)

@CasperWA
Copy link
Collaborator Author

On another note, it seems the error is (temporarily) not thrown.

@ltalirz
Copy link
Member

ltalirz commented Sep 16, 2020

There are a number of separate processes using that database (AiiDA REST APIs, optimade APIs, discover sections) but certainly less than 100 overall.
It is possible that some of these services are using more database connections than necessary, and this should be investigated. You could start digging into this by listing the active sessions on the database server
https://dataedo.com/kb/query/postgresql/list-database-sessions

@ltalirz
Copy link
Member

ltalirz commented Sep 16, 2020

these are the active connections from the development server on Materials Cloud using the command I linked above

 process_id | username |    database_name    | client_address | application_name |         backend_start         |        state        |         state_change
------------+----------+---------------------+----------------+------------------+-------------------------------+---------------------+-------------------------------
       1604 |          |                     |                |                  | 2020-09-09 11:12:14.998675+02 |                     |
       1607 | postgres |                     |                |                  | 2020-09-09 11:12:14.998765+02 |                     |
      28998 | mcloud   | optimade_sample     | 192.168.0.33   |                  | 2020-09-16 06:26:14.046928+02 | idle in transaction | 2020-09-17 00:52:11.626857+02
      28999 | mcloud   | optimade_sample     | 192.168.0.33   |                  | 2020-09-16 06:26:14.07599+02  | idle in transaction | 2020-09-17 00:51:11.600115+02
      29001 | mcloud   | threedd             | 192.168.0.33   |                  | 2020-09-16 06:26:19.863526+02 | idle in transaction | 2020-09-17 00:52:18.538998+02
      29002 | mcloud   | threedd             | 192.168.0.33   |                  | 2020-09-16 06:26:19.893697+02 | idle in transaction | 2020-09-17 00:51:18.538999+02
       7035 | mcloud   | optimade_sample     | 192.168.0.33   |                  | 2020-09-09 14:51:51.369393+02 | idle                | 2020-09-17 00:52:14.807548+02
       7036 | mcloud   | optimade_sample     | 192.168.0.33   |                  | 2020-09-09 14:51:51.402525+02 | idle                | 2020-09-17 00:52:14.793051+02
       7042 | mcloud   | hcofs_co2           | 192.168.0.33   |                  | 2020-09-09 14:52:03.437816+02 | idle                | 2020-09-14 12:22:00.713236+02
       7043 | mcloud   | hcofs_co2           | 192.168.0.33   |                  | 2020-09-09 14:52:03.525671+02 | idle in transaction | 2020-09-17 00:51:12.235094+02
       6422 | mcloud   | twodtopo            | 192.168.0.19   |                  | 2020-09-16 11:18:49.982024+02 | idle                | 2020-09-17 00:48:51.440301+02
       6423 | mcloud   | twodtopo            | 192.168.0.19   |                  | 2020-09-16 11:18:50.847797+02 | idle                | 2020-09-16 14:43:59.990336+02
       6424 | mcloud   | twodtopo            | 192.168.0.19   |                  | 2020-09-16 11:18:50.967978+02 | idle                | 2020-09-17 00:48:51.386061+02
       6528 | mcloud   | stoceriaitf         | 192.168.0.19   |                  | 2020-09-16 11:21:22.57046+02  | idle                | 2020-09-16 11:21:22.588494+02
       6575 | mcloud   | twodtopo            | 192.168.0.19   |                  | 2020-09-16 11:22:49.20152+02  | idle                | 2020-09-17 00:48:51.768636+02
       6608 | mcloud   | sssp                | 192.168.0.19   |                  | 2020-09-16 11:23:51.936156+02 | idle                | 2020-09-16 23:53:52.714796+02
       7069 | mcloud   | hcofs_co2           | 192.168.0.33   |                  | 2020-09-09 14:53:03.39575+02  | idle                | 2020-09-14 12:22:17.365695+02
       7070 | mcloud   | hcofs_co2           | 192.168.0.33   |                  | 2020-09-09 14:53:03.477783+02 | idle in transaction | 2020-09-17 00:52:12.235683+02
       6609 | mcloud   | sssp                | 192.168.0.19   |                  | 2020-09-16 11:23:52.760895+02 | idle                | 2020-09-17 00:43:52.827027+02
       6610 | mcloud   | sssp                | 192.168.0.19   |                  | 2020-09-16 11:23:52.954966+02 | idle                | 2020-09-17 00:43:52.776684+02
       6628 | mcloud   | twodmaterials       | 192.168.0.19   |                  | 2020-09-16 11:24:25.369628+02 | idle                | 2020-09-17 00:44:24.782436+02
       6629 | mcloud   | twodmaterials       | 192.168.0.19   |                  | 2020-09-16 11:24:26.166658+02 | idle                | 2020-09-17 00:44:25.351396+02
       7179 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-09 14:57:01.980594+02 | idle                | 2020-09-17 00:47:53.672401+02
       7180 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-09 14:57:02.054614+02 | idle in transaction | 2020-09-17 00:47:53.637744+02
      29263 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.14797+02  | idle                | 2020-09-17 00:44:07.653125+02
      29264 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.148802+02 | idle                | 2020-09-16 14:25:30.074568+02
      29266 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.282765+02 | idle in transaction | 2020-09-17 00:44:07.668528+02
      29265 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.281977+02 | idle in transaction | 2020-09-16 14:25:29.904845+02
      29267 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.563163+02 | idle in transaction | 2020-09-17 00:44:07.391097+02
      29268 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 06:34:14.592616+02 | idle                | 2020-09-17 00:34:07.669507+02
       6630 | mcloud   | twodmaterials       | 192.168.0.19   |                  | 2020-09-16 11:24:26.416613+02 | idle                | 2020-09-17 00:44:25.145242+02
       6766 | mcloud   | tc_applicability    | 192.168.0.19   |                  | 2020-09-16 11:28:02.687732+02 | idle                | 2020-09-17 00:47:57.267954+02
       6767 | mcloud   | tc_applicability    | 192.168.0.19   |                  | 2020-09-16 11:28:03.548588+02 | idle                | 2020-09-17 00:47:57.642733+02
       6768 | mcloud   | tc_applicability    | 192.168.0.19   |                  | 2020-09-16 11:28:03.675171+02 | idle                | 2020-09-17 00:47:57.285721+02
       8607 | mcloud   | sssp                | 192.168.0.19   |                  | 2020-09-16 12:23:51.267473+02 | idle                | 2020-09-17 00:43:53.428577+02
        939 | postgres | postgres            |                | psql             | 2020-09-17 00:52:30.387464+02 | active              | 2020-09-17 00:52:35.385356+02
       2909 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 09:32:46.190128+02 | idle in transaction | 2020-09-17 00:44:07.359878+02
       2910 | mcloud   | curated_cofs_aiida1 | 192.168.0.33   |                  | 2020-09-16 09:32:46.301805+02 | idle                | 2020-09-17 00:24:07.655365+02
       5921 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.01576+02  | idle                | 2020-09-16 11:04:22.717408+02
       5923 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.017616+02 | idle                | 2020-09-16 11:04:04.122343+02
       5925 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.019536+02 | idle                | 2020-09-16 11:04:04.115145+02
       5922 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.016813+02 | idle                | 2020-09-16 11:04:04.11478+02
       5924 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.018356+02 | idle                | 2020-09-16 11:04:04.122174+02
       5920 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.014942+02 | idle                | 2020-09-16 11:04:04.121969+02
       5926 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.24257+02  | idle in transaction | 2020-09-16 11:04:21.874395+02
       5928 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.244214+02 | idle in transaction | 2020-09-16 11:04:22.734747+02
       5927 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.243271+02 | idle in transaction | 2020-09-16 11:04:04.271057+02
       5929 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:04.24653+02  | idle in transaction | 2020-09-16 11:04:21.868967+02
       5936 | mcloud   | scdm                | 192.168.0.33   |                  | 2020-09-16 11:04:14.911125+02 | idle in transaction | 2020-09-16 11:04:21.871229+02
       1602 |          |                     |                |                  | 2020-09-09 11:12:14.979974+02 |                     |
       1601 |          |                     |                |                  | 2020-09-09 11:12:14.979782+02 |                     |
       1603 |          |                     |                |                  | 2020-09-09 11:12:14.981479+02 |                     |
(52 rows)

Idle connections

As one can see from backend_start, all of these connections were opened a long time before I ran this command.
This raises the question: are we using connections correctly? Should idle processes continue to hold an open connection or should they close it?

There is a nice writeup on this topic here and, in essence, the answer is: connections are a scarce resource and idle worker processes should not hold open connections.

We should check the behavior of the REST API, Optimade API and discover sections in this regard and adapt them, if necessary.

Accounting

  • AiiDA 0.X databases (e.g. sssp, twodmaterials) will get connections from the REST API and, possibly, from a discover section (e.g. tc_applicability)
  • AiiDA 1.X databases (e.g. curated_cofs_aiida1, scdm) will get connections from the REST API, 1 connection from the optimade API and, possibly, from a discover section.

Even though the wsgi daemon for the REST API, the optimade API container and the discover section containers each run just a single process (I checked), I suspect that each thread will use its own connection. Perhaps someone can confirm (if threads were sharing DB connection, some logic would need to be in place for them not to interfere between each other; I guess this is not the case).

The wsgi daemons, for example, are currently configured to use up to 20 threads [1], however, I guess only threads that are actually "used" at least once will acquire an open connection.
I.e. the fact that curated_cofs_aiida1 and scdm have so many open connections may simply be related to the fact that there have been more requests for those.

[1] All these threads + a few more are created in advance as one can see from:

$ ps -ef | grep scdm
ubuntu    3116  2453  0 01:13 pts/0    00:00:00 grep --color=auto scdm
ubuntu   14055  7900  0 Sep16 ?        00:00:05 daemon-scdm       -k start
$ cat /proc/14055/status | grep -i threads
Threads:	26

@CasperWA
Copy link
Collaborator Author

Excellent detective work @ltalirz ! Thank you.

I think we have a quite large issue on our hands here, at least in the context of Materials Cloud, and future similar uses of AiiDA databases.

I will continue the discussion in aiidateam/aiida-core#4374, since it seems the solution depends foremost on AiiDA core and what policy we decide should be prevalent concerning this issue. If we decide not to address it in AiiDA, it then becomes a matter of implementing a workaround or solution for Materials Cloud.

@sphuber
Copy link

sphuber commented Sep 17, 2020

Since you guys are mostly just using AiiDA to query the database and are not running the daemon, the only connections should go through the query builder. Even for Django, this goes through a SqlAlchemy and it is SqlAlchemy that manages a connection pool. The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed. I think this may ultimately simply be a configuration problem. Of course if you start serving too many applications from a single server and PSQL cluster, at some point you run out of resources. If you think the current amount of projects should perfectly be manageable with the default of a 100 connections, then we can always configure the connection pool of SqlAlchemy. @CasperWA added a commit not too long ago (the commit to test the correct handling of the session for the REST API) that allows to configure the parameters of the SQLA connection pool. I don't think it is fully plumbed through that you can configure this dynamically per AiiDA profile, but if necessary you could add this. Otherwise you can try a temporary hardcoded solution and limit the number of connections in a pool. The default number of connections per pool seems to be 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants