Вопрос

I'm having trouble with unexpected system crashes emanating from a postgresql deployment on a moderately busy production server (Django 1.8 app deployed on DO droplet with a gunicorn application server and nginx reverse proxy).

I'm an accidental DBA and need an expert's opinion on piecing together what's going on. I have no clue right now; I'll provide various log data below:

The whole app crashes - new relic ends up showing the following error:

django.db.utils:OperationalError: ERROR: no more connections allowed (max_client_conn)

Full stack trace being:

File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/baseserver.py", line 26, in _handle_and_close_when_done
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 155, in handle
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 56, in handle
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 114, in handle_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 704, in __iter__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 1080, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 108, in get_response
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 228, in wrapper
File "/home/ubuntu/app/mproject/middleware/mycustommiddleware.py", line 6, in process_request
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/functional.py", line 225, in inner
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/functional.py", line 376, in _setup
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 22, in <lambda>
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 10, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 174, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/contrib/auth/backends.py", line 93, in get_user
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 328, in get
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 144, in __len__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 965, in _fetch_all
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/query.py", line 238, in iterator
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 838, in execute_sql
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 164, in cursor
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 135, in _cursor
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 130, in ensure_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 130, in ensure_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 119, in connect
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 176, in get_new_connection
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/database_dbapi2.py", line 102, in __call__
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 126, in connect
File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/psycogreen/gevent.py", line 32, in gevent_wait_callback

New relic also shows a huge spike in Postgres set:

enter image description here

Postgresql slow log shows exorbitant COMMIT durations like so:

2020-05-15 01:22:47.474 UTC [10722] ubuntu@myapp LOG:  duration: 280668.902 ms  statement: COMMIT
2020-05-15 01:22:47.474 UTC [11536] ubuntu@myapp LOG:  duration: 49251.277 ms  statement: COMMIT
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp LOG:  duration: 281609.394 ms  statement: COMMIT
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp LOG:  could not send data to client: Broken pipe
2020-05-15 01:22:47.474 UTC [10719] ubuntu@myapp FATAL:  connection to client lost
2020-05-15 01:22:47.475 UTC [10672] ubuntu@myapp LOG:  duration: 285371.872 ms  statement: COMMIT

Digital ocean's dashboard shows a spike in Load Average and Disk IO (but CPU, memory usage and disk usage remain low).

enter image description here

The snapshot stats at the time of that Load Average spike are:

enter image description here

syslog has myriad instances of the following error log:

May 15 01:22:46 main-app gunicorn[10779]: error: [Errno 111] Connection refused
May 15 01:22:46 main-app gunicorn[10779]: [2020-05-15 01:22:46 +0000] [10874] [ERROR] Socket error processing request.
May 15 01:22:46 main-app gunicorn[10779]: Traceback (most recent call last):
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 66, in handle
May 15 01:22:46 main-app gunicorn[10779]:     six.reraise(*sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 56, in handle
May 15 01:22:46 main-app gunicorn[10779]:     self.handle_request(listener_name, req, client, addr)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
May 15 01:22:46 main-app gunicorn[10779]:     addr)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 129, in handle_request
May 15 01:22:46 main-app gunicorn[10779]:     six.reraise(*sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gunicorn/workers/base_async.py", line 114, in handle_request
May 15 01:22:46 main-app gunicorn[10779]:     for item in respiter:
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 704, in __iter__
May 15 01:22:46 main-app gunicorn[10779]:     for item in self.generator:
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/web_transaction.py", line 1080, in __call__
May 15 01:22:46 main-app gunicorn[10779]:     self.start_response)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
May 15 01:22:46 main-app gunicorn[10779]:     response = self.get_response(request)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 218, in get_response
May 15 01:22:46 main-app gunicorn[10779]:     response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 448, in wrapper
May 15 01:22:46 main-app gunicorn[10779]:     return _wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/hooks/framework_django.py", line 441, in _wrapped
May 15 01:22:46 main-app gunicorn[10779]:     return wrapped(request, resolver, exc_info)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 256, in handle_uncaught_exception
May 15 01:22:46 main-app gunicorn[10779]:     'request': request
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/logging/__init__.py", line 1193, in error
May 15 01:22:46 main-app gunicorn[10779]:     self._log(ERROR, msg, args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log
May 15 01:22:46 main-app gunicorn[10779]:     self.handle(record)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle
May 15 01:22:46 main-app gunicorn[10779]:     self.callHandlers(record)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
May 15 01:22:46 main-app gunicorn[10779]:     hdlr.handle(record)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle
May 15 01:22:46 main-app gunicorn[10779]:     self.emit(record)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/log.py", line 129, in emit
May 15 01:22:46 main-app gunicorn[10779]:     self.send_mail(subject, message, fail_silently=True, html_message=html_message)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/utils/log.py", line 132, in send_mail
May 15 01:22:46 main-app gunicorn[10779]:     mail.mail_admins(subject, message, *args, connection=self.connection(), **kwargs)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/function_trace.py", line 110, in literal_wrapper
May 15 01:22:46 main-app gunicorn[10779]:     return wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/__init__.py", line 98, in mail_admins
May 15 01:22:46 main-app gunicorn[10779]:     mail.send(fail_silently=fail_silently)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/function_trace.py", line 110, in literal_wrapper
May 15 01:22:46 main-app gunicorn[10779]:     return wrapped(*args, **kwargs)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/message.py", line 303, in send
May 15 01:22:46 main-app gunicorn[10779]:     return self.get_connection(fail_silently).send_messages([self])
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 100, in send_messages
May 15 01:22:46 main-app gunicorn[10779]:     new_conn_created = self.open()
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 58, in open
May 15 01:22:46 main-app gunicorn[10779]:     self.connection = connection_class(self.host, self.port, **connection_params)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/smtplib.py", line 256, in __init__
May 15 01:22:46 main-app gunicorn[10779]:     (code, msg) = self.connect(host, port)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/smtplib.py", line 316, in connect
May 15 01:22:46 main-app gunicorn[10779]:     self.sock = self._get_socket(host, port, self.timeout)
May 15 01:22:46 main-app gunicorn[10779]:   File "/usr/lib/python2.7/smtplib.py", line 291, in _get_socket
May 15 01:22:46 main-app gunicorn[10779]:     return socket.create_connection((host, port), timeout)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/socket.py", line 96, in create_connection
May 15 01:22:46 main-app gunicorn[10779]:     sock.connect(sa)
May 15 01:22:46 main-app gunicorn[10779]:   File "/home/ubuntu/.virtualenvs/app/local/lib/python2.7/site-packages/gevent/_socket2.py", line 244, in connect
May 15 01:22:46 main-app gunicorn[10779]:     raise error(err, strerror(err))

Some lines from vmtstat 1:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 4698028 280600 16692992    0    0     0    14    0    0 13  1 87  0  0
 3  0      0 4697204 280600 16693016    0    0     0    24 8294 8111 12  0 88  0  0
 3  0      0 4698148 280600 16693044    0    0     0   204 8533 9047 15  0 85  0  0
 4  0      0 4697628 280600 16693092    0    0     0    76 8695 8775 17  1 83  0  0
 3  0      0 4697288 280600 16693140    0    0     8    16 9578 9720 19  1 80  0  0
 3  0      0 4695768 280600 16693200    0    0     0   152 8594 8073 16  1 83  0  0
 6  0      0 4686044 280600 16693272    0    0     0    32 7998 7706 12  1 87  0  0
 5  0      0 4681876 280600 16693308    0    0     0   224 9130 8810 14  1 85  0  0

That's pretty much all I have. I can't piece together what could be going on at all. It's happening almost once a day now. Can an expert shed some light on this? Thanks in advance!


Note: in case warranted, SELECT version(); yields:

PostgreSQL 9.6.16 on x86_64-pc-linux-gnu (Ubuntu 9.6.16-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
Это было полезно?

Решение

The 4.6 minute COMMIT is nearly definitive that your IO system is either grossly inadequate, or just completely hosed. Which would explain all the rest of the problems. If connections can't commit, they won't ordinarily be closed, so they build up. Eventually you get "ERROR: no more connections allowed". If a client gives up on waiting for a commit response and drops the connection, you might get 'FATAL: connection to client lost on' the server. Note there is no PostgreSQL "crash" here, it is still running and serving connections (slowly) until max_connections is exceeded, at which point it is still serving existing connections.

There are things other than fsync which a COMMIT does, and occasionally but rarely those thing can cause problems. So to be more definitive, you can sample pg_stat_activity and see what wait_event and wait_event_type for the commits are during these episodes. If wait_event is 'WALWriteLock' or 'WALSync' that is pretty definitive.

If you paid for a storage system with a service level > 'gerbil', I would say this falls squarely in Digital Ocean's hands.

Другие советы

The long commit times indicate that I/O on the database server is seriously overloaded. The machine seems to come to a crawl, so that it cannot respond to connection requests in a timely fashion any more.

Examine the database workload and see what causes the overload.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с dba.stackexchange
scroll top