Regular Discourse crashes since update from 1.9 to 2.1


(Allen) #1

After migrating from phpbb, Discourse 1.9 was running rock solid for over a month. Recently I performed my first maintenance window and did an upgrade to v2.1.0.beta2 +72. At the same time, I also performed an update of the underlying Ubuntu 16.04 LTS OS. Since this update, all hell broke loose. Of important note, Docker was not updated, and currently remains at 18.03.1-ce. Hence, docker cannot be the culprit, as the same version was not a problem before. After the crashes, I did a rebuild without any added plugins. Unfortunately, the problem still persists.

Preceding the crash, I am seeing many blocks like these in unicorn.stderr.log:

E, [2018-06-28T10:28:34.290585 #78] ERROR -- : master loop error: Connection timed out (Redis::TimeoutError)
E, [2018-06-28T10:28:34.290713 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'
E, [2018-06-28T10:28:34.290753 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'
E, [2018-06-28T10:28:34.290785 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'
E, [2018-06-28T10:28:34.290814 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'
E, [2018-06-28T10:28:34.290842 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'
E, [2018-06-28T10:28:34.290871 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'
E, [2018-06-28T10:28:34.290904 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'
E, [2018-06-28T10:28:34.290933 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'
E, [2018-06-28T10:28:34.290962 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'
E, [2018-06-28T10:28:34.290990 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'
E, [2018-06-28T10:28:34.291019 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'
E, [2018-06-28T10:28:34.291050 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'
E, [2018-06-28T10:28:34.291079 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:889:in `block in get'
E, [2018-06-28T10:28:34.291108 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'
E, [2018-06-28T10:28:34.291136 #78] ERROR -- : /usr/local/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'
E, [2018-06-28T10:28:34.291164 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'
E, [2018-06-28T10:28:34.291193 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:888:in `get'
E, [2018-06-28T10:28:34.291221 #78] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (3 levels) in <class:DiscourseRedis>'
E, [2018-06-28T10:28:34.291249 #78] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:153:in `ignore_readonly'
E, [2018-06-28T10:28:34.291277 #78] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (2 levels) in <class:DiscourseRedis>'
E, [2018-06-28T10:28:34.291305 #78] ERROR -- : /var/www/discourse/app/jobs/regular/run_heartbeat.rb:15:in `last_heartbeat'
E, [2018-06-28T10:28:34.291333 #78] ERROR -- : config/unicorn.conf.rb:172:in `check_sidekiq_heartbeat'
E, [2018-06-28T10:28:34.291361 #78] ERROR -- : config/unicorn.conf.rb:199:in `master_sleep'
E, [2018-06-28T10:28:34.291391 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join'
E, [2018-06-28T10:28:34.291420 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>'
E, [2018-06-28T10:28:34.291449 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/bin/unicorn:23:in `load'
E, [2018-06-28T10:28:34.291489 #78] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.5.0/bin/unicorn:23:in `<main>'
E, [2018-06-28T10:28:34.291573 #78] ERROR -- : worker=1 PID:8585 timeout (33s > 30s), killing

This is what I am seeing in postgres log:

2018-06-28 11:01:35.453 UTC [3657] LOG:  lock file "postmaster.pid" contains wrong PID: 9785 instead of 3657
2018-06-28 11:01:35.453 UTC [3657] LOG:  performing immediate shutdown because data directory lock file is invalid
2018-06-28 11:01:35.453 UTC [3657] LOG:  received immediate shutdown request
2018-06-28 11:01:35.454 UTC [3666] discourse@discourse WARNING:  terminating connection because of crash of another server process
2018-06-28 11:01:35.454 UTC [3666] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-28 11:01:35.454 UTC [3666] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-06-28 11:01:35.457 UTC [3662] WARNING:  terminating connection because of crash of another server process
2018-06-28 11:01:35.457 UTC [3662] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-28 11:01:35.457 UTC [3662] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-06-28 11:01:35.461 UTC [3665] discourse@discourse WARNING:  terminating connection because of crash of another server process
2018-06-28 11:01:35.461 UTC [3665] discourse@discourse DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-28 11:01:35.461 UTC [3665] discourse@discourse HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-06-28 11:01:35.454 UTC [3669] WARNING:  terminating connection because of crash of another server process
2018-06-28 11:01:35.454 UTC [3669] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-28 11:01:35.454 UTC [3669] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2018-06-28 11:01:35.477 UTC [3657] LOG:  database system is shut down
2018-06-28 11:01:35.495 UTC [3670] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2018-06-28 11:01:35.496 UTC [3670] LOG:  listening on IPv6 address "::", port 5432
2018-06-28 11:01:35.497 UTC [3670] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-06-28 11:01:35.526 UTC [3671] LOG:  database system was interrupted; last known up at 2018-06-28 11:01:29 UTC
2018-06-28 11:01:35.560 UTC [3671] LOG:  database system was not properly shut down; automatic recovery in progress
2018-06-28 11:01:35.562 UTC [3671] LOG:  invalid record length at 5/70366670: wanted 24, got 0
2018-06-28 11:01:35.562 UTC [3671] LOG:  redo is not required
2018-06-28 11:01:35.573 UTC [3670] LOG:  database system is ready to accept connections

Finally, the server is dead (unicorn.stderr.log - this goes on for a long time):

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

The bottom line is, since my update, the only way I can keep my forum up and running is by having monit check it and reboot it when it dies. This obviously is a really ugly hack, and not sustainable.

Is anyone running v2.1.0.beta2 +72 without any problems? If so, what is your configuration? Do any of the Discourse Team have input on this?


(Jeff Atwood) #2

We are not seeing this, so likely a third party plugin or something else in your config is responsible. Disable all third party plugins and rebuild.


(Jay Pfaffman) #3

Is this different from Redis connection timed out?


(Allen) #4

I first thought that my problem was the same, but from the logs it rather seems that processes are crashing. Also, that thread ended up being caused by a docker bug, which can nearly be excluded in my case.


(Matt Palmer) #5

The Docker bug referenced in the other topic is still present, Discourse has just worked around it as much as possible.

You’ve got PostgreSQL processes dropping off the twig, so you’ve got really big problems. Not Discourse-specific problems, something far more fundamental is wrong with your system.


(Allen) #6

Did that, and it crashed again, roughly 18 hours after starting discourse (like previous times). ZERO plugins this time, this is driving me crazy. I did not change anything in my setup or config, which has been running rock solid for month up to this mess (after upgrade to 2.1 / OS updates). Same thing happened this time again: App looses connection to Redis, Postgres goes down the tubes (same log patterns as above, not going to repost). My monit setup worked and rebooted the forum once it went down. But this is nonsense.

Here is my config:

## this is the all-in-one, standalone Discourse Docker container template
##
## After making changes to this file, you MUST rebuild
## /var/discourse/launcher rebuild app
##
## BE *VERY* CAREFUL WHEN EDITING!
## YAML FILES ARE SUPER SUPER SENSITIVE TO MISTAKES IN WHITESPACE OR ALIGNMENT!
## visit http://www.yamllint.com/ to validate this file as needed

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  #- "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/web.socketed.template.yml"

## which TCP/IP ports should this container expose?
## If you want Discourse to share a port with another webserver like Apache or nginx,
## see https://meta.discourse.org/t/17247 for details
expose:
#  - "80:80"   # http
#  - "443:443" # https
  - "4432:5432" # fwd host port 4432 to container port 5432 (postgres)
  - "2222:22" # fwd host port 2222 to container port 22 (ssh)

params:
  db_default_text_search_config: "pg_catalog.english"

  ## Set db_shared_buffers to a max of 25% of the total memory.
  ## will be set automatically by bootstrap based on detected RAM, or you can override
  db_shared_buffers: "512MB"

  ## can improve sorting performance, but adds memory usage per-connection
  #db_work_mem: "40MB"

  ## Which Git revision should this container use? (default: tests-passed)
  #version: tests-passed

env:
  LANG: en_US.UTF-8
  # DISCOURSE_DEFAULT_LOCALE: en

  ## How many concurrent web requests are supported? Depends on memory and CPU cores.
  ## will be set automatically by bootstrap based on detected CPUs, or you can override
  ## Should set to one per CPU
  UNICORN_WORKERS: 2

  ## TODO: The domain name this Discourse instance will respond to
  DISCOURSE_HOSTNAME: forum.example.com

  ## Uncomment if you want the container to be started with the same
  ## hostname (-h option) as specified above (default "$hostname-$config")
  #DOCKER_USE_HOSTNAME: true

  ## TODO: List of comma delimited emails that will be made admin and developer
  ## on initial signup example 'user1@example.com,user2@example.com'
  DISCOURSE_DEVELOPER_EMAILS: 'mymail@example.com'

  ## TODO: The SMTP mail server used to validate new accounts and send notifications
  DISCOURSE_SMTP_ADDRESS: smtp.elasticemail.com
  DISCOURSE_SMTP_PORT: 2525
  DISCOURSE_SMTP_USER_NAME: mymail@example.com
  DISCOURSE_SMTP_PASSWORD: "mykey"
  #DISCOURSE_SMTP_ENABLE_START_TLS: true           # (optional, default true)

  ## If you added the Lets Encrypt template, uncomment below to get a free SSL certificate
  #LETSENCRYPT_ACCOUNT_EMAIL: me@example.com

  ## The CDN address for this Discourse instance (configured to pull)
  ## see https://meta.discourse.org/t/14857 for details
  #DISCOURSE_CDN_URL: //discourse-cdn.example.com

## The Docker container is stateless; all data is stored in /shared
volumes:
  - volume:
      host: /var/discourse/shared/standalone
      guest: /shared
  - volume:
      host: /var/discourse/shared/standalone/log/var-log
      guest: /var/log

## Plugins go here
## see https://meta.discourse.org/t/19157 for details
hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://github.com/discourse/docker_manager.git
#          - git clone https://github.com/discourse/discourse-bbcode.git
#          - git clone https://github.com/angusmcleod/discourse-locations.git
#          - git clone https://github.com/discourse/discourse-voting.git
#          - git clone https://github.com/Ebsy/discourse-nationalflags.git
#          - git clone https://github.com/discoursehosting/discourse-migratepassword.git
#          - git clone https://github.com/discourse/discourse-akismet.git
#          - git clone https://github.com/leodavidson/discourse-reset-bump.git

## Any custom commands to run after building
run:
  - exec: echo "Beginning of custom commands"
  ## If you want to set the 'From' email address for your first registration, uncomment and change:
  ## After getting the first signup email, re-comment the line. It only needs to run once.
  #- exec: rails r "SiteSetting.notification_email='info@unconfigured.discourse.org'"
  - exec: echo "End of custom commands"

I am running this on a DO droplet with 4GB RAM and 2 vCPU’s. Memory is relaxed at permanently under 70%, CPU’s idling and inodes are plenty. I have all relevant resources under supervision with monit, and am getting no resource alerts. So what gives, why is this setup running unstable all of the sudden? WHY is it always crashing after 18 hours of operation? Furthermore, I have an identical droplet running for QA purposes, and it is now crashing too since upgrade (both Discourse 1.9 > 2.1 and OS updates) - not as often though. Please don’t get me wrong, I am not pointing any fingers, just explaining what the facts are.

So what gives? Does anyone have any ideas how i could approach this? I am thinking of doing a clean install next on another distro (probably Debian instead of Ubuntu 16.04 LTS - I don’t want to move to CentOS or something completely different).

Any input is greatly appreciated.


(Allen) #7

I believe I found out what the problem was, and am documenting it here in case others are in the same situation, find this topic, and are in despair like I (hopefully) was.

As said, my forum was migrated from phpbb. I still had my import container from the phpbb importer hanging around, because I will probably be merging another forum down the road. For some reason this import container was set to start automatically, causing a conflict with the app container. This problem only surfaced after my first Discourse and OS update, because I rebooted the server after updating it. I then naturally suspected that it had something to do either with the Discourse or OS update, specially since my QA box was showing the same problems (separate droplet).

So this is how the story went on:

When I realized what was going on by doing a docker ps, and seeing that both app and import were running at the same time, I then proceeded to shutdown both containers, so I could do a clean restart of app. Unfortunately, I was too late, and my database had become corrupted. It was not possible to rebuild any longer. I was getting this:

[75] PANIC: could not locate a valid checkpoint record

After which the whole rebuild came to a grinding halt, following some more error messages. The message indicated that It’s looking for a checkpoint record in the transaction log that probably doesn’t exist or is corrupted. To bring the transaction log back to life I did the following:

./launcher enter app
su postgres
/usr/lib/postgresql/10/bin/pg_resetwal /shared/postgres_data

This repaired the transaction log and I could then rebuild the app. Then the next problem came up, which I could not resolve: It was not possible to login to Dicourse any longer. Password reset didn’t work (user unknown), and not even the admin backdoor was working ( /u/admin-login - email unknown). Removing all plugins did not bring back the login either. As I had a backup which was only a few hours old, and didn’t have any more time to waste on this problem, I restored my backup to a clean install of Discourse. Now I am pretty confident that peace will return.

Bottom line and learning: If you keep your import container from your phpbb (or similar) forum import, for your sanity’s sake, at least run this command:

docker update --restart=no import

This will ensure that the import container will only start when you want it to, thereby you will avoid lots of time and serious head-banging, like I did.


(Jay Pfaffman) #8

I’ve had similar problems when switching from a one container to a two container conductance.

I recommend that you destroy the import container. If you need it again, it’ll be too be the same version as the server.


(Stephen) #9

Out of curiosity did you try using rake admin:create to recreate your admin user and inspect the rest of the data? It’s possible the rest of the phpbb data was intact.


(Allen) #10

I did not do that, but logged into the postgres db instead. At first glance, I could not see that anything was missing. Btw, the migrated forum was already running on Discourse for over a month, so I am not sure what you mean by phpbb data…