Discourse update keeps failing

The core dump and the invalid instruction indicate that something is going wrong at a low level (CPU, memory).

I am not a hardware expert, but this CPU came on the market 12 years ago and I suspect it might be too old (i.e. it is trying to run compiled code that assumes a newer CPU).

1 Like

We did think about this, but given it has been working fine for the last three years what would have been updated within the stack that suddenly requires a newer instruction? (Also, what/which instruction?)

Would FEATURE: Add support for clear_every parameter in Redis backend (#309) · discourse/message_bus@1baa1ea · GitHub be triggering some different behaviour within Redis? :thinking:

1 Like

I also want to add that last Friday the major version upgrade was performed seamlessly and it ran the entire weekend without a hitch. I even performed a successful update on Sunday. If it is the CPU, which is understandable, is the cause then it would’ve shown this error with the major version upgrade.

But, perhaps there has been a change since Monday…

2 Likes

That could very well be, it’s crashing in a json parse routine, in the message bus code, although that change you mentioned is over 4 months old.

-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.2.7(rb_vm_bugreport+0x50a) [0x7f30fc64839a] vm_dump.c:755
[0x7f30fc4b9b47]
/usr/local/lib/libruby.so.2.7(sigill+0x3b) [0x7f30fc5c4f0b] signal.c:962
/lib/x86_64-linux-gnu/libc.so.6(0x7f30fc283d60) [0x7f30fc283d60]
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/oj-3.13.15/lib/oj/oj.so(oj_parse2+0x4f9) [0x7f30f3a68339] /usr/lib/gcc/x86_64-linux-gnu/10/include/smmintrin.h:649

I, [2022-07-05T10:03:30.513303 #1]  INFO -- : > cd /var/www/discourse && su discourse -c 'bundle exec rake db:migrate'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-4.2.0/lib/message_bus/codec/json.rb:11: [BUG] Illegal instruction at 0x00007f30f3a68339
ruby 2.7.6p219 (2022-04-12 revision c9c2245c0a) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0030 p:---- s:0162 e:000161 CFUNC  :parse
c:0029 p:0013 s:0157 e:000156 METHOD /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-4.2.0/lib/message_bus/codec/json.rb:11
c:0028 p:0037 s:0152 e:000151 METHOD /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-4.2.0/lib/message_bus.rb:648
c:0027 p:0020 s:0144 e:000143 BLOCK  /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-4.2.0/lib/message_bus.rb:766
c:0026 p:0082 s:0135 e:000134 BLOCK  /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-4.2.0/lib/message_bus/backends/redis.rb:330
c:0025 p:0024 s:0130 e:000129 BLOCK  /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.5.1/lib/redis/subscribe.rb:46
c:0024 p:0034 s:0124 e:000123 BLOCK  /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.5.1/lib/redis/client.rb:183 [FINISH]
2 Likes

Yeah… so it should already have been present on Sunday. :pensive:

1 Like

Looking through the logs, it seems like there is already some other instance of redis running when it tries to start it.
Can that be the issue?

102:C 05 Jul 2022 09:53:34.597 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
102:C 05 Jul 2022 09:53:34.597 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=102, just started
102:C 05 Jul 2022 09:53:34.597 # Configuration loaded
102:M 05 Jul 2022 09:53:34.598 * monotonic clock: POSIX clock_gettime
102:M 05 Jul 2022 09:53:34.599 * Running mode=standalone, port=6379.
102:M 05 Jul 2022 09:53:34.599 # Server initialized
102:M 05 Jul 2022 09:53:34.599 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
102:M 05 Jul 2022 09:53:34.599 * Loading RDB produced by version 6.2.6
102:M 05 Jul 2022 09:53:34.599 * RDB age 1972 seconds
102:M 05 Jul 2022 09:53:34.599 * RDB memory usage when created 60.60 Mb
102:M 05 Jul 2022 09:53:34.949 # Done loading RDB, keys loaded: 8005, keys expired: 9.
102:M 05 Jul 2022 09:53:34.950 * DB loaded from disk: 0.351 seconds
102:M 05 Jul 2022 09:53:34.950 * Ready to accept connections
129:C 05 Jul 2022 09:53:45.056 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
129:C 05 Jul 2022 09:53:45.056 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=129, just started
129:C 05 Jul 2022 09:53:45.056 # Configuration loaded
129:M 05 Jul 2022 09:53:45.057 * monotonic clock: POSIX clock_gettime
129:M 05 Jul 2022 09:53:45.057 # Warning: Could not create server TCP listening socket *:6379: bind: Address already in use
129:M 05 Jul 2022 09:53:45.057 # Failed listening on port 6379 (TCP), aborting.
102:signal-handler (1657015415) Received SIGTERM scheduling shutdown...
102:M 05 Jul 2022 10:03:35.245 # User requested shutdown...
102:M 05 Jul 2022 10:03:35.245 * Saving the final RDB snapshot before exiting.
102:M 05 Jul 2022 10:03:39.882 * DB saved on disk
102:M 05 Jul 2022 10:03:39.882 # Redis is now ready to exit, bye bye...
3 Likes

This is pretty normal for a launcher rebuild app - it doesn’t affect anything (as far as I know, at least…).

4 Likes

Code paths can also trigger on certain data being present or absent. Maybe the offending code was present but it was not being executed.

2 Likes

I’m going to try some quasi-bisecting on the latest set of commits and see if I can narrow it down to a specific recent change. This will take “some time”… :sweat_smile:

Edit:

OK, so the first bad commit with the illegal instruction is Build(deps): Bump oj from 3.13.14 to 3.13.15 (#17309) · discourse/discourse@4c69619 · GitHub which is linked to Fix NaN object dump issue · ohler55/oj@f0122cf · GitHub

Some previous commits also fail to build but with a different issue (which also seems like it could be transient…):

I, [2022-07-05T12:14:35.377926 #1]  INFO -- : > cd /var/www/discourse && su discourse -c 'bundle exec rake db:migrate'                                                          
102:M 05 Jul 2022 12:14:44.308 * 100 changes in 300 seconds. Saving...                                                                                                          
102:M 05 Jul 2022 12:14:44.312 * Background saving started by pid 709                                                                                                           
709:C 05 Jul 2022 12:14:45.166 * DB saved on disk                                                                                                                               
709:C 05 Jul 2022 12:14:45.169 * RDB: 1 MB of memory used by copy-on-write                                                                                                      
102:M 05 Jul 2022 12:14:45.217 * Background saving terminated with success                                                                                                      
I, [2022-07-05T12:14:46.192386 #1]  INFO -- :                                                                                                                                   
I, [2022-07-05T12:14:46.193317 #1]  INFO -- : > cd /var/www/discourse && su discourse -c 'bundle exec rake themes:update assets:precompile'                                     
                                                                                                                                                                                
Missing yarn packages:                                                                                                                                                          
Package: ember-cli-deprecation-workflow                                                                                                                                         
  * Specified: ^2.1.0                                                                                                                                                           
  * Installed: (not installed)                                                                                                                                                  
                                                                                                                                                                                
Run `yarn` to install missing dependencies.                                                                                                                                     
                                                                                                                                                                                
                                                                                                                                                                                
                                                                                                                                                                                
Stack Trace and Error Report: /tmp/error.dump.ccfa3d8342a442ee6860db37ce7c7330.log                                                                                              
An error occurred in the constructor for ember-cli-dependency-checker at /var/www/discourse/app/assets/javascripts/node_modules/ember-cli-dependency-checker                    
                                                                                                                                                                                
error Command failed with exit code 1.
4 Likes

Good find, it’s indeed crashing in the oj gem.

Version 3.13.15 also contains this commit which switches to using SSE 4.2 instructions for performance. And those are not supported on AMD Opteron 41xx processors.

So we’re back to

IMHO It sucks that the gem author chose to make this a compile time decision.

6 Likes

Lovely. An additional change not mentioned in the oj changelog… :grin:

So, if the gem doesn’t do its native compilation during installation (so we could potentially prod it into working via OJ_USE_SSE4_2), it looks like it’s going to need a server move… :expressionless:

Edit: the gem doesn’t distribute any pre-compiled objects so this should be workable - so the next question is why it’s compiling with SSE4.2 on a system that doesn’t support it.

3 Likes

Our current base image ships 3.13.14 so it is being compiled on your system.

Can you try reproducing the error with the benchmark script from the commit:

○ → docker run --rm -it -u discourse discourse/base:2.0.20220621-0049 bash
discourse@313d7af3be39:/$ cd
discourse@313d7af3be39:~$ gem install --user pry benchmark-ips oj
…
Successfully installed oj-3.13.15
5 gems installed
discourse@313d7af3be39:~$ /home/discourse/.local/share/gem/ruby/2.7.0/bin/pry
[1] pry(main)> require 'benchmark/ips'
require 'oj'

def json(string)
  "\"#{string}\""
end

Benchmark.ips do |x|
  x.warmup = 5
  x.time = 20

  json_0   = json('a' *   0)
  json_64  = json('a' *  64)
  json_128 = json('a' * 128)

  x.report('Oj.load   [0]') { Oj.load(json_0) }
  x.report('Oj.load  [64]') { Oj.load(json_64) }
  x.report('Oj.load [128]') { Oj.load(json_128) }
end;

You can also check whether or not it was compiled using the problematic instruction with:

discourse@313d7af3be39:~$ objdump -d /home/discourse/.local/share/gem/ruby/2.7.0/gems/oj-3.13.15/lib/oj/oj.so | grep -C3 pcmpestri
   2e32b:	0f 82 b5 03 00 00    	jb     2e6e6 <oj_parse2+0x8a6>
   2e331:	66 0f 6f 05 77 d6 01 	movdqa 0x1d677(%rip),%xmm0        # 4b9b0 <exp_plus+0x330>
   2e338:	00 
   2e339:	66 0f 3a 61 07 00    	pcmpestri $0x0,(%rdi),%xmm0
   2e33f:	83 f9 10             	cmp    $0x10,%ecx
   2e342:	74 dc                	je     2e320 <oj_parse2+0x4e0>
   2e344:	48 63 c9             	movslq %ecx,%rcx

If so, this is probably something to report to the oj gem’s project.

3 Likes

I do want to look into this some more, but 1) I want to avoid more downtime (for a while at least; I know the above doesn’t involve downtime but I might be tempted to try other things) and 2) when this changes:

to 3.13.15 and the Discourse base image inherits that same minimum CPU microarchitecture requirement, then the current server isn’t going to be sustainable anyway (unless there’s a way of working around it, like (re)installing the gem separately e.g. as part a pre-code hook, but I’d also guess that’s a bit of a faff for most people).

It also raises the question of what a reasonable cut-off date for hardware support should be anyway; it’s not reasonable to expect 32-bit CPU support, so perhaps SSE4.2 is a reasonable “new minimum” for modern software.

5 Likes

Indeed, I’ve already raised this internally.

:+1:

4 Likes

Hey !

Thank you for looking in to this. I am having the same issue on an Intel Atom N2800 (from end of 2011).
Do you think there might be a way around this issue or the only thing I can do for now is to migrate to a newer hardware ?

Thank you,

I’m dead in the water now with my forum with the update I was prompted to do today. I never saw any warnings about upcoming obsoleting of any CPUs, and to have this happen suddenly is … bad. The available servers all are the same configuration for consistency, and all use the same CPU.

AMD Athlon™ II X2 B22 Processor

Not practical to run out and buy a new server, configure, etc. in this economy, even given the time.

How can I back out of this update until this situation is better understood? I can’t even contact my users right now with the forum down. Thanks.

1 Like

If you’re using the Docker deployment method, you may have an older container which you can restart (check e.g. docker images and/or docker ps -a).

You can also override the commit used to build the Discourse instance by editing app.yml and setting the version to the commit prior to the change, then rebuilding:

params:
  version: adb7fa5e2fc51308efc9fc4ee57ecb1c15a85cfa

Discourse will break again if you update after this, which is not ideal given the security update that has been released since (although exploitation potential seems pretty limited for most instances).

3 Likes

One option (which I haven’t tried yet) is to install the oj gem separately and hope to trigger compilation with the correct CPU features (or lack thereof).

I had planned to try this in app.yml:

hooks:
  before_code:
    - exec:
        cmd:
          - gem install oj

but I haven’t got the scope for more forum downtime.

3 Likes

That specific security update doesn’t appear relevant to me since I’m not in a shared hosting environment. I’m unsure how to interpret the docker info. Here’s the ps:

37c258b23221 local_discourse/app “/sbin/boot” 3 months ago Exited (7) 3 hours ago

Here’s the image list:

REPOSITORY            TAG                 IMAGE ID       CREATED         SIZE
discourse/base        2.0.20220621-0049   a44ca4f67972   3 weeks ago     2.65GB
local_discourse/app   latest              b5f2a8a39709   3 months ago    3.53GB
discourse/base        2.0.20220413-0411   ab71a5d97460   3 months ago    2.81GB
<none>                <none>              58ba7d1c8d7a   3 months ago    3.74GB
discourse/base        2.0.20220224-2005   cd112601450a   4 months ago    2.84GB
<none>                <none>              d9cf1feb92fd   6 months ago    3.19GB
<none>                <none>              d53ee33f6fe1   6 months ago    3.19GB
<none>                <none>              14f79500c49c   6 months ago    3.19GB
<none>                <none>              edff9b614f46   6 months ago    3.19GB
<none>                <none>              e2348b41f937   6 months ago    3.19GB
<none>                <none>              42f6511b414c   6 months ago    3.19GB
<none>                <none>              3086f92af2fe   6 months ago    3.19GB
<none>                <none>              6ada029723ba   6 months ago    3.19GB
<none>                <none>              ca61149580d4   6 months ago    3.19GB
<none>                <none>              ce5ae3bb62ac   6 months ago    3.19GB
<none>                <none>              e9a5c1b1aed4   6 months ago    3.19GB
<none>                <none>              6bb94ce1e01f   6 months ago    3.19GB
<none>                <none>              e1df4acbd927   6 months ago    3.19GB
<none>                <none>              7e05a0b160c5   6 months ago    3.19GB
<none>                <none>              979926f28a73   6 months ago    3.19GB
<none>                <none>              d055f9b01556   6 months ago    3.19GB
<none>                <none>              aa0c779093dc   6 months ago    3.19GB
discourse/base        2.0.20211118-0105   b6cc7cf8974a   7 months ago    2.58GB
discourse/base        2.0.20210528-1735   482386bf57af   13 months ago   2.36GB
<none>                <none>              e6011d2b206c   14 months ago   2.69GB
discourse/base        2.0.20210415-1332   30e4746e631e   15 months ago   2.23GB
<none>                <none>              8066ac13b8c3   17 months ago   2.45GB
discourse/base        2.0.20201221-2020   c0704d4ce2b4   18 months ago   2.11GB
<none>                <none>              043da6b3335d   2 years ago     2.4GB
discourse/base        2.0.20200429-2110   dc919e1dae2c   2 years ago     2.13GB
<none>                <none>              ff15472f4794   2 years ago     2.79GB
discourse/base        2.0.20191013-2320   09725007dc9e   2 years ago     2.3GB
<none>                <none>              f65391a062f0   2 years ago     2.62GB
discourse/base        2.0.20190901-2315   10f636afbeaf   2 years ago     2.29GB
<none>                <none>              6944d06786b4   2 years ago     2.31GB
discourse/base        2.0.20190625-0946   2b3a5b47565f   3 years ago     1.93GB
<none>                <none>              60b39deba7d2   3 years ago     2.3GB
discourse/base        2.0.20190505-2322   ed87227f60d2   3 years ago     1.91GB
<none>                <none>              cc5c0e56298c   3 years ago     2.38GB
discourse/base        2.0.20190321-0122   7db99586b5b5   3 years ago     1.97GB
<none>                <none>              b19f9a483788   3 years ago     2.27GB
discourse/base        2.0.20190217        9c24db193c37   3 years ago     1.92GB
hello-world           latest              fce289e99eb9   3 years ago     1.84kB
<none>                <none>              614db6988e9c   3 years ago     2.25GB
<none>                <none>              729b196da862   3 years ago     2.25GB
<none>                <none>              80584ec5ec01   3 years ago     2.25GB
<none>                <none>              0e2481aefed8   3 years ago     2.25GB
<none>                <none>              725d0c17a6bb   3 years ago     2.25GB
<none>                <none>              220bed95d236   3 years ago     2.25GB
<none>                <none>              fca469dba597   3 years ago     2.25GB
<none>                <none>              edab31d0ffce   3 years ago     2.25GB
<none>                <none>              dbacaff2d35e   3 years ago     2.25GB
<none>                <none>              3d6a0453da1d   3 years ago     2.25GB
<none>                <none>              fbf0529eb303   3 years ago     2.25GB
<none>                <none>              7a45443ae44c   3 years ago     2.25GB
<none>                <none>              ad90d7f42416   3 years ago     2.25GB
<none>                <none>              d61ea07d6084   3 years ago     2.25GB
<none>                <none>              d393fd8b4de0   3 years ago     2.25GB
discourse/base        2.0.20181031        ea31cd77735a   3 years ago     1.88GB


Can you try a ./launcher start app ?

3 Likes