Rebake завершается с ошибкой

Привет!
Когда я пытаюсь перезапечь посты с помощью rake-задачи, процесс иногда останавливается с таким сообщением:

168858 / 329447 ( 51.3%)WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command.

А в конце большого блока текста, содержащего такие строки, как:

/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.7/lib/active_record/relation.rb:828:in `exec_queries'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.7/lib/active_record/relation.rb:631:in `load'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.7/lib/active_record/relation.rb:249:in `records'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.7/lib/active_record/relation/delegation.rb:88:in `each'
/var/www/discourse/lib/tasks/posts.rake:124:in `block in rebake_posts'

есть вот что:
168859 / 329447 ( 51.3%)rake aborted! ActiveRecord::ConnectionNotEstablished: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is in recovery mode


Ошибка в логах Discourse:

Есть ли какие-то идеи, как отладить эту проблему?

Вам следует проверить логи PostgreSQL.

Спасибо!

Признаюсь, что PostgreSQL для меня — целый новый мир, и я совершенно не знаком с ним.
Можете подтвердить, что логи хранятся в /var/discourse/shared/standalone/log/var-log/postgres, или мне стоит искать конкретный лог ошибок где-то ещё?
Я посмотрел файл current в этой папке, но не нашёл ничего подозрительного, особенно в то время, которое показывает интерфейс логов Discourse.

Хотя вполне возможно, что я просто не знаю, что именно искать.


редактирование: нашёл это в файле current:

	
2022-03-12 05:27:51.651 UTC [450818] discourse@discourse LOG:  duration: 258.470 ms  parse <unnamed>: SELECT 1 AS one FROM "permalinks" WHERE "permalinks"."url" = 'images/emoji/twitter/roll_eyes.png' LIMIT 1
2022-03-12 05:27:52.287 UTC [450818] discourse@discourse LOG:  duration: 557.420 ms  bind <unnamed>: SELECT 1 AS one FROM "permalinks" WHERE "permalinks"."url" = 'images/emoji/twitter/roll_eyes.png' LIMIT 1
2022-03-12 05:27:52.500 UTC [450818] discourse@discourse LOG:  duration: 158.213 ms  execute <unnamed>: SELECT 1 AS one FROM "permalinks" WHERE "permalinks"."url" = 'images/emoji/twitter/roll_eyes.png' LIMIT 1
2022-03-12 05:27:53.047 UTC [450818] discourse@discourse LOG:  duration: 301.290 ms  parse <unnamed>: SELECT 1 AS one FROM "permalinks" WHERE "permalinks"."url" = 'images/emoji/twitter/roll_eyes.png' LIMIT 1
2022-03-12 05:27:56.647 UTC [560] LOG:  checkpointer process (PID 180496) was terminated by signal 9: Killed
2022-03-12 05:27:56.650 UTC [560] LOG:  terminating any other active server processes
2022-03-12 05:27:56.652 UTC [455579] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.652 UTC [455579] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.652 UTC [455579] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.652 UTC [455580] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.652 UTC [455580] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.652 UTC [455580] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455573] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455573] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455573] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455560] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455560] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455560] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455476] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455476] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455476] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [180506] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [180506] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [180506] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455477] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455477] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455477] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [180499] WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [180499] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [180499] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455581] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455581] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455581] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.653 UTC [455574] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.653 UTC [455574] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.653 UTC [455574] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.654 UTC [427436] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.654 UTC [427436] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.654 UTC [427436] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.657 UTC [455341] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.657 UTC [455341] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.657 UTC [455341] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.661 UTC [455559] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.661 UTC [455559] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.661 UTC [455559] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.698 UTC [450818] discourse@discourse WARNING:  terminating connection because of crash of another server process
2022-03-12 05:27:56.698 UTC [450818] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-03-12 05:27:56.698 UTC [450818] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-03-12 05:27:56.773 UTC [560] LOG:  all server processes terminated; reinitializing
2022-03-12 05:27:56.957 UTC [455632] LOG:  database system was interrupted; last known up at 2022-03-12 05:25:19 UTC
2022-03-12 05:27:56.957 UTC [455634] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.957 UTC [455633] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.958 UTC [455636] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.958 UTC [455635] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.966 UTC [455637] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.966 UTC [455638] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.966 UTC [455639] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.969 UTC [455640] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:56.977 UTC [455641] discourse@discourse FATAL:  the database system is in recovery mode
2022-03-12 05:27:57.331 UTC [455632] LOG:  database system was not properly shut down; automatic recovery in progress
2022-03-12 05:27:57.334 UTC [455632] LOG:  redo starts at E/B99AC028
2022-03-12 05:27:57.371 UTC [455632] LOG:  invalid record length at E/BA16D680: wanted 24, got 0
2022-03-12 05:27:57.371 UTC [455632] LOG:  redo done at E/BA16D640
2022-03-12 05:27:57.467 UTC [560] LOG:  database system is ready to accept connections

Я предполагаю, что самые важные строки это:

2022-03-12 05:27:56.647 UTC [560] LOG:  checkpointer process (PID 180496) was terminated by signal 9: Killed
2022-03-12 05:27:56.650 UTC [560] LOG:  terminating any other active server processes

Активность на форуме:


300 000 постов на перепечку (но ошибка может возникнуть после 150 000 или после 1000 перепечённых постов, это довольно случайно.

Характеристики сервера:
CPX21 от Hetzner: https://www.hetzner.com/cloud-fr

  • 3 vCPU
  • 4 ГБ ОЗУ
  • 80 ГБ диска (50% свободно)

Также я обнаружил, что команда dmesg может показать дополнительную информацию, вот конец её вывода:

[590461.105649] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=2f5e8cd98980f017bab02228d6af1cfbbd0068935aedea9ab3680470befa2030,mems_allowed=0,global_oom,task_memcg=/docker/2f5e8cd98980f017bab02228d6af1cfbbd0068935aedea9ab3680470befa2030,task=postmaster,pid=1135915,uid=105
[590461.105673] Out of memory: Killed process 1135915 (postmaster) total-vm:1172132kB, anon-rss:7244kB, file-rss:0kB, shmem-rss:839588kB, UID:105 pgtables:2104kB oom_score_adj:0
[590461.109673] oom_reaper: reaped process 1135915 (postmaster), now anon-rss:0kB, file-rss:0kB, shmem-rss:839588kB
[592120.454582] Process accounting resumed
[592121.145664] Process accounting resumed

Это хоть чем-то поможет?

Именно так.
Когда на вашем сервере заканчивается (виртуальная) память, операционная система начинает завершать процессы, чтобы убедиться, что система не станет полностью недоступной.

Поскольку это происходит только во время ребейка, вам может не понадобиться больше физической памяти, а лишь больше виртуальной памяти, чего можно добиться, увеличив файл подкачки.

Отлично!

Я немного удивлён, что возникает ошибка нехватки памяти при повторной обработке, особенно учитывая, что это происходит на случайных «этапах» процесса, что, похоже, память освобождается между каждой партией, и что партии по умолчанию довольно небольшие (около 150 постов?).


Я создал файл подкачки размером 2 ГБ, следуя этой инструкции. Надеюсь, этого будет достаточно. :slight_smile: