I’m seeing that the server sent out a bunch of emails successfully for this new topic to everyone who was watching that category, I’m not seeing any errors in the email logs either or on the email server.
I upgraded to bf987af3ca and I retried everything and still have at least 38 of Jobs::HandledExceptionWrapper: Wrapped ActiveRecord::RecordInvalid: Validation failed: Post has already been taken showing in my sidekiq console.
Without further updates, I’m now now down to 30. I’d guess that they are timing out, except that my test account got the (delayed) weekly digest, and I assume this is related. Not sure where to look in logs to know whether any are actually giving up.
They seem to mostly fail but occasionally succeed, which sure smells like a race condition somewhere.
My backtraces look the same as what @RBoy and @md-misko saw, but here is the full backtrace, not just the truncated one from the “Copy” button:
activerecord-7.0.3/lib/active_record/transactions.rb:302:in `block in save!'
activerecord-7.0.3/lib/active_record/transactions.rb:354:in `block in with_transaction_returning_status'
activerecord-7.0.3/lib/active_record/relation.rb:115:in `block in create!'
activerecord-7.0.3/lib/active_record/relation.rb:219:in `block in create_or_find_by!'
activerecord-7.0.3/lib/active_record/connection_adapters/abstract/transaction.rb:319:in `block in within_new_transaction'
activesupport-7.0.3/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `block in synchronize'
activerecord-7.0.3/lib/active_record/relation/delegation.rb:67:in `block in transaction'
/var/www/discourse/app/jobs/base.rb:232:in `block (2 levels) in perform'
/var/www/discourse/app/jobs/base.rb:221:in `block in perform'
sidekiq-6.4.2/lib/sidekiq/processor.rb:164:in `block (2 levels) in process'
sidekiq-6.4.2/lib/sidekiq/middleware/chain.rb:138:in `block in invoke'
sidekiq-6.4.2/lib/sidekiq/middleware/chain.rb:140:in `block in invoke'
sidekiq-6.4.2/lib/sidekiq/processor.rb:163:in `block in process'
sidekiq-6.4.2/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
sidekiq-6.4.2/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
sidekiq-6.4.2/lib/sidekiq.rb:40:in `block in <module:Sidekiq>'
sidekiq-6.4.2/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
sidekiq-6.4.2/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
sidekiq-6.4.2/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
sidekiq-6.4.2/lib/sidekiq/processor.rb:124:in `block in dispatch'
sidekiq-6.4.2/lib/sidekiq/util.rb:65:in `block in safe_thread'
Is there any more information I could provide to help debug this?
I discovered that my mail server was using a certificate for another server in its round-robin, and hoped that the hostname mismatch was the problem. I updated to b850c12793 in the process of changing to a server that did not have a certificate mismatch, but it did not resolve the problem. I retried some of the jobs, but none of them completed successfully. Therefore, this bug is not a symptom of hidden certificate mismatches.
This was built with discourse_docker 2a9faf7e5680b9.
Updating discourse_docker to 241a42ce718, and with it discourse to 95e7e10417, also did not resolve the problem. I still have 30 of these failures being retried.
From what you’re describing and looking at this post there may be multiple issues here:
The server may not be throttling it’s retries for emails causing it to time out or be rejected by the mail server. But there’s someother underlying issue if your certificates and configuration are valid and it’s still not sending the emails. For some it also appears to be eating up disk space. I checked mine but I didn’t notice that here.
I didn’t run out of space, and this happens even when I select exactly one job to re-run so it doesn’t look like a race condition. There’s clearly more than one issue here, and what I’m seeing here is not related to that linked topic.
(It turns out that I didn’t have a certificate problem after all; the server names were in the alternate server name. But I moved to using the hostname that matches the SN anyway, and it made no difference.)
I’m successfully sending a tremendous number of mails, just these few jobs are stuck. I don’t know, for example, what log entries to go looking for to help diagnose.
I reviewed my 29 failed emails to make sure that there was nothing critical to send, and as far as I could tell, there wasn’t, so I deleted all the jobs in sidekiq, in case this was due to a transient problem due to email jobs spanning upgrades. However, without applying further updates, I now have another single case of the same failure.
Just sharing this as information that it’s an ongoing problem and not a weird transient.